Latent Diffusion — Morph-Chain Generator — User Guide

Encodes audio events into a low-dimensional latent space via an on-the-fly autoencoder, discovers K acoustic identity clusters (k-means++), then runs a temperature-annealed diffusion loop that transforms a maximally-corrupted (noisy) seed vector back toward its cluster identity. Output is a Morph-Chain: one continuous audio sequence per cluster, evolving from static / noise-like texture into a recognisable instrument identity.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 1.0 (2025) License: MIT License Citation: Cohen, S. (2025). Praat AudioTools Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools

Contents:

What this does Quick start Latent Diffusion Theory Preset Strategies Parameters & Controls Visualization & Analysis Applications

What this does

This script implements a Latent Diffusion Resynthesis engine — a morph-chain generator that encodes audio events into a low-dimensional latent space via an on-the-fly autoencoder, discovers K acoustic identity clusters (k-means++), then runs a temperature-annealed diffusion loop that transforms a maximally-corrupted (noisy) seed vector back toward its cluster identity. The output is a Morph-Chain: one continuous audio sequence per cluster, evolving from static / noise-like texture into a recognisable instrument identity.

🧠 What is Latent Diffusion?

Latent diffusion models apply a diffusion process in a learned latent space. In this implementation:

Events are encoded into latent vectors Z via autoencoder
Clusters (K acoustic identities) are discovered via k-means++
Seed events (farthest from cluster center) are heavily corrupted with Gaussian noise
Diffusion steps gradually denoise the vector, guided by cluster statistics
Boltzmann weighting allows stochastic exploration at high temperatures
Morph-Chain results: each step maps to a real event, creating a smooth evolution from noise to identity

Key Features:

6 Preset Strategies — Gentle Crystallisation to Multi-Identity, plus Custom
On-the-Fly Autoencoder — Pure numpy MLP with Adam, trained on log-mel patches
K-means++ Clustering — Discovers acoustic identity clusters
Temperature-Annealed Diffusion — Exponential annealing from T_start to T_end
Boltzmann-Weighted Gradient — Soft cluster assignment guides denoising
4 Anti-Loop Mechanisms — Tabu penalty, stochastic top-K, temperature inheritance, sparse-pool adaptation
Morph-Chain Output — One chain per cluster, from noisy to refined
Comprehensive Visualization — 6-panel display with waveforms, spectrograms, diffusion panel, cluster bars, intensity comparison

Technical Implementation: (1) Event Segmentation: Praat segments audio into events. (2) Mel Patches: 40×32 log-mel patches per event. (3) Autoencoder: Train on-the-fly, encode to latent space Z. (4) Clustering: k-means++ on Z → K clusters. (5) Diffusion Engine: For each cluster, seed from farthest event, corrupt, iteratively denoise with temperature annealing, Boltzmann weights, anti-loop selection. (6) Reconstruction: Map each diffusion step to nearest real event, concatenate chains with silence gaps. (7) Visualization & Stats.

Quick start

In Praat, select exactly one Sound object (any duration, any content).
Run script… → select LatentDiffusion.praat.
Choose Preset (2-7 for specific strategies, 1 for custom).
Set latent size, learning steps, number of clusters, diffusion steps.
Adjust entropy threshold, temperature range, denoising strength.
Enable Draw_visualization for analysis display.
Click OK — engine segments, trains autoencoder, clusters, runs diffusion, reconstructs.

Quick tip: Start with Gentle Crystallisation preset on a 10-20 second recording with varied texture. Enable visualization — you'll see the diffusion panel with temperature annealing bar (hot→cold) and cluster population bars. Listen to the morph-chains: each cluster evolves from noise to its acoustic identity. The output appears as "source_diffusion" in the Objects window.

Important: PYTHON DEPENDENCIES — Requires numpy, soundfile, scipy (no scikit-learn needed). AUTOENCODER TRAINING happens on-the-fly and may take 30-60 seconds. CLUSTER COUNT should match the number of distinct acoustic identities in your source. DIFFUSION STEPS controls chain length — more steps = longer morph chains. ANTI-LOOP MECHANISMS prevent repetitive loops: tabu penalty, stochastic top-K, temperature inheritance, and sparse-pool adaptation.

Latent Diffusion Theory

Autoencoder Encoding

Input: log-mel patch (40 mel bands × 32 frames = 1280 features) Encoder: input (1280) → hidden (h) → latent (L) Decoder: latent (L) → hidden (h) → output (1280) where h = max(L×2, min(256, √(1280×L))) (geometric mean scaling) Activations: leaky ReLU (α=0.01) for hidden layers, linear for output Training: denoising autoencoder with Adam optimiser.

Cluster Discovery

K-means++ on latent vectors Z ∈ ℝᴺˣᴸ: 1. Choose first center uniformly at random 2. For each subsequent center, sample point with probability ∝ distance² to nearest existing center 3. Iterate until convergence (up to 60 iterations) Per-cluster diagonal variance: σ_k² = var(Z[mask == k]) + 1e-8 (fallback to global variance for small clusters) Cross-entropy for a point z to cluster k: CE_k = 0.5 × Σ_d (z_d - μ_kd)² / σ_kd²

Diffusion Step

📈 Temperature-Annealed Gradient Descent

At step t with temperature T: 1. Compute Boltzmann weights: w_k = softmax(-CE_k / T) 2. Gradient: g = Σ_k w_k · (z - μ_k) / σ_k² 3. Step size: step = denoising_strength × (1/T) / (1 + 1/T) × (1 - 0.4·step_frac) 4. Denoised: z' = z - step × g 5. Annealed noise: add ε ∼ 𝒩(0, √T × (1 - step_frac) × 0.4) 6. New z = z' + noise

Temperature annealing: Exponential from T_start to T_end (with floor at 0.05)

Effect: High T → exploratory, low T → deterministic convergence

Anti-Loop Mechanisms

🔄 Preventing Repetitive Loops

1. Tabu Penalty: Events used in last N steps get distance multiplied by penalty factor (default 5.0). Discourages immediate repetition.

2. Stochastic Top-K Selection: At each step, consider only the K nearest events, then sample with Boltzmann probability p_i ∝ exp(-d_i / T_sel). T_sel inherits from diffusion temperature.

3. Temperature Inheritance: Selection temperature = max(diffusion_T, T_min), where T_min adapts to pool size (0.08-0.35).

4. Sparse-Pool Adaptation: When event pool < 20, parameters scale up:

top_k grows to cover 25-50% of pool
tabu_size grows to cover 25-40% of pool
T_min raised to keep Boltzmann weights diffuse

Morph-Chain Construction

For each cluster: 1. Start with seed event (farthest from cluster center) 2. Corrupt with noise: z₀ = z_seed + T_start × σ(Z) × ε 3. Run diffusion steps to get chain of latent vectors [z₀, z₁, ..., z_N] 4. Map each z_t to nearest real event using anti-loop selection 5. Concatenate resulting event clips with crossfades Output: [Chain0: noisy→refined] | silence | [Chain1: noisy→refined] | …

Preset Strategies

Preset 2: Gentle Crystallisation

❄️ Gradual Emergence

Latent: 8 | Steps: 100 | Clusters: 3

Diffusion: 25 steps | Entropy: 1.2 | T: 1.0→0.06 | Denoise: 0.5

Character: Gentle emergence from noise to identity — subtle, smooth evolution

Use on: Ambient, gradual transformations

Preset 3: Stochastic Melt

🌊 Exploratory Diffusion

Latent: 8 | Steps: 100 | Clusters: 3

Diffusion: 40 steps | Entropy: 0.6 | T: 3.0→0.30 | Denoise: 0.3

Character: High start temperature, low denoising — exploratory, stochastic

Use on: Chaotic, unpredictable textures

Preset 4: Deep Freeze

🧊 Highly Deterministic

Latent: 10 | Steps: 150 | Clusters: 3

Diffusion: 50 steps | Entropy: 0.5 | T: 2.5→0.05 | Denoise: 0.8

Character: Strong denoising, low final T — deterministic convergence

Use on: Precise identity emergence

Preset 5: Plasma Burst

⚡ Fast, Explosive

Latent: 8 | Steps: 80 | Clusters: 4

Diffusion: 15 steps | Entropy: 2.0 | T: 4.0→0.50 | Denoise: 0.9

Character: Short chains, high start T, rapid denoising — explosive emergence

Use on: Percussive, dramatic textures

Preset 6: Slow Diffusion

🐢 Long, Gradual

Latent: 12 | Steps: 150 | Clusters: 3

Diffusion: 80 steps | Entropy: 0.8 | T: 1.5→0.08 | Denoise: 0.5

Character: Long chains, slow annealing — very gradual emergence

Use on: Long-form evolution, meditation

Preset 7: Multi-Identity

🎭 Many Clusters

Latent: 12 | Steps: 150 | Clusters: 6

Diffusion: 35 steps | Entropy: 1.0 | T: 2.0→0.10 | Denoise: 0.6

Character: Up to 6 clusters, moderate diffusion — explores many identities

Use on: Complex material with many acoustic states

Parameters & Controls

Autoencoder Parameters

Parameter	Default	Description
Latent_size	8	Autoencoder latent dimensions (2–32)
Learning_steps	100	Training iterations (10–500)

Cluster Parameters

Parameter	Default	Description
Number_of_clusters	3	K acoustic identity clusters (2–8)

Diffusion Parameters

Parameter	Default	Description
Diffusion_steps	30	Number of diffusion steps per chain (5–100)
Entropy_threshold	1.0	Early stop when cross-entropy < threshold
Temperature_start	2.0	Initial temperature (0.1–10)
Temperature_end	0.1	Final temperature (≥0.01, ≤ T_start)
Denoising_strength	0.6	Strength of gradient step (0–1)

Output

Parameter	Default	Description
Seed	42	Random seed for reproducibility
Draw_visualization	1	Generate 6-panel analysis display
Play_result	1	Audition after processing

Visualization & Analysis

6-Panel Display

Latent Diffusion Visualization: Panel 1: TITLE • Script name, source name, preset, clusters, temperature range Panel 2: INPUT WAVEFORM • Gray waveform with red dotted lines = event boundaries • Title: "Original (N events)" Panel 3: OUTPUT WAVEFORM • Blue waveform = diffused morph-chains • Title: "Diffused" • X-axis: Time (s) Panel 4: ORIGINAL SPECTROGRAM • 0-5000 Hz spectrogram of original • Title: "Original spectrogram" Panel 5: OUTPUT SPECTROGRAM • 0-5000 Hz spectrogram of diffused output • Title: "Diffused spectrogram (Morph-Chain)" Panel 6: DIFFUSION PANEL • Temperature annealing bar (10 segments from hot orange→cold blue) • Labels: T_start and T_end • Cluster population bars (colored by cluster, height = % of events) • Statistics: steps, chains, early stops, denoising strength, latent spread, events, mean dur • Title: "Diffusion:" Panel 7: INTENSITY COMPARISON • X-axis: Time, Y-axis: dB • Gray line = original intensity • Blue line = diffused intensity • Title: "Intensity: Grey = original | Blue = diffused" Panel 8: SUMMARY PANEL • Preset, clusters, steps, seed • Autoencoder loss (initial→final), latent size • Duration in/out, RMS comparison • Temperature range, denoising strength, early stops • Warnings if any

Reading the Diffusion Panel

What the bars show:

Temperature bar: 10 segments from orange (hot) to blue (cold) — shows annealing schedule
Cluster bars: Each colored bar's height = percentage of events in that cluster
Colors: Cluster 0 (blue), 1 (red), 2 (green), 3 (purple), etc. — consistent across visualizations
Numbers: Steps taken, number of morph chains, early stops count
Latent spread: Mean standard deviation of latent dimensions — measure of space coverage

Interpreting Morph-Chain Output

What you'll hear:

Each chain: Begins with noise-like texture, gradually crystallizes into cluster identity
Between chains: Brief silence gap (60ms) separates clusters
Chain length: Varies depending on early stopping
Anti-loop effects: Even in long chains, you'll hear variety rather than repetitive loops
The spectrogram will show gradual spectral refinement from noisy to structured

Applications

Electroacoustic Composition

Use case: Creating morphing textures that evolve from noise to identity

Technique: Gentle Crystallisation or Slow Diffusion presets

Workflow:

Select a 20-60 second recording with multiple acoustic states
Run with Gentle Crystallisation preset
Listen to how each cluster emerges from noise to its characteristic sound
Export and use as movement in larger work

Sound Design for Media

Use case: Creating evolving textures, risers, transitions

Technique: Plasma Burst or Stochastic Melt on appropriate sources

Applications:

Risers: Plasma Burst with short, explosive chains
Ambient evolution: Slow Diffusion with long, gradual chains
Character emergence: Deep Freeze for precise identity crystallization

Music Production

Use case: Creating evolving pads, generative textures

Technique: Multi-Identity preset to explore many acoustic states

Examples:

Pad evolution: Gentle Crystallisation on synth pad
Rhythmic textures: Plasma Burst on percussive loops
Generative beds: Slow Diffusion with long chains

Research & Education

Use case: Studying diffusion processes, latent spaces, clustering

Technique: Compare presets on same source, examine cluster distributions

Learning outcomes:

Understand how temperature annealing affects diffusion
See how clustering discovers acoustic identities
Observe anti-loop mechanisms in action
Explore relationship between latent space and perceptual identity

Practical Workflow Examples

🎬 Film Scene: Identity Emergence

Goal: Create 60-second cue representing a character's identity emerging from chaos

Settings:

Source: 30-second voice recording with multiple characters
Preset: Gentle Crystallisation
Clusters: 4 (one per character identity)

Result: Four morph-chains, each evolving from noise to a different vocal character

🎚️ Electronic Music: Riser

Goal: Create 15-second riser from synth stab

Settings:

Source: 8-second synth stab
Preset: Plasma Burst
Custom: diffusion_steps=10, T_start=5.0, T_end=0.5

Result: Short, explosive chain from chaos to synth identity — perfect riser

🎙️ Voice Processing: Character Exploration

Goal: Explore different vocal identities in a single recording

Settings:

Source: 20-second vocal improvisation
Preset: Multi-Identity
Clusters: 6, diffusion_steps=40

Result: Six morph-chains, each revealing a different vocal personality from the same source

Troubleshooting Common Issues

Problem: Python not found or missing packages
Cause: Python not installed, or packages missing
Solution: Install Python and required packages: pip install numpy soundfile scipy

Problem: All events in one cluster
Cause: Source too homogeneous, or k too high
Solution: Reduce number_of_clusters, or use source with more variety

Problem: Chains sound repetitive (looping)
Cause: Anti-loop mechanisms insufficient for pool size
Solution: Increase tabu_penalty, top_k, or reduce diffusion_steps

Problem: Output has clicks
Cause: Crossfade insufficient at splice points
Solution: Increase XFADE_SEC in Python script (currently 8ms)

Problem: Chains too short
Cause: Early stopping triggered too early
Solution: Increase entropy_threshold, reduce diffusion_steps, or lower denoising_strength

Advanced Techniques

Custom anti-loop tuning:

In chain_to_event_sequence(), modify tabu_penalty, top_k, and T_sel_min to adjust repetition avoidance.

Annealing schedule modification:

In run_diffusion_chains(), replace exponential annealing with linear or custom curves.

Cluster statistics:

In cluster_statistics(), use full covariance instead of diagonal for more accurate Gaussian models.

Multi-channel input:

Script converts to mono for analysis. For stereo, modify to process each channel separately and recombine.