Latent Diffusion — Morph-Chain Generator — User Guide
Encodes audio events into a low-dimensional latent space via an on-the-fly autoencoder, discovers K acoustic identity clusters (k-means++), then runs a temperature-annealed diffusion loop that transforms a maximally-corrupted (noisy) seed vector back toward its cluster identity. Output is a Morph-Chain: one continuous audio sequence per cluster, evolving from static / noise-like texture into a recognisable instrument identity.
What this does
This script implements a Latent Diffusion Resynthesis engine — a morph-chain generator that encodes audio events into a low-dimensional latent space via an on-the-fly autoencoder, discovers K acoustic identity clusters (k-means++), then runs a temperature-annealed diffusion loop that transforms a maximally-corrupted (noisy) seed vector back toward its cluster identity. The output is a Morph-Chain: one continuous audio sequence per cluster, evolving from static / noise-like texture into a recognisable instrument identity.
🧠 What is Latent Diffusion?
Latent diffusion models apply a diffusion process in a learned latent space. In this implementation:
- Events are encoded into latent vectors Z via autoencoder
- Clusters (K acoustic identities) are discovered via k-means++
- Seed events (farthest from cluster center) are heavily corrupted with Gaussian noise
- Diffusion steps gradually denoise the vector, guided by cluster statistics
- Boltzmann weighting allows stochastic exploration at high temperatures
- Morph-Chain results: each step maps to a real event, creating a smooth evolution from noise to identity
Key Features:
- 6 Preset Strategies — Gentle Crystallisation to Multi-Identity, plus Custom
- On-the-Fly Autoencoder — Pure numpy MLP with Adam, trained on log-mel patches
- K-means++ Clustering — Discovers acoustic identity clusters
- Temperature-Annealed Diffusion — Exponential annealing from T_start to T_end
- Boltzmann-Weighted Gradient — Soft cluster assignment guides denoising
- 4 Anti-Loop Mechanisms — Tabu penalty, stochastic top-K, temperature inheritance, sparse-pool adaptation
- Morph-Chain Output — One chain per cluster, from noisy to refined
- Comprehensive Visualization — 6-panel display with waveforms, spectrograms, diffusion panel, cluster bars, intensity comparison
Technical Implementation: (1) Event Segmentation: Praat segments audio into events. (2) Mel Patches: 40×32 log-mel patches per event. (3) Autoencoder: Train on-the-fly, encode to latent space Z. (4) Clustering: k-means++ on Z → K clusters. (5) Diffusion Engine: For each cluster, seed from farthest event, corrupt, iteratively denoise with temperature annealing, Boltzmann weights, anti-loop selection. (6) Reconstruction: Map each diffusion step to nearest real event, concatenate chains with silence gaps. (7) Visualization & Stats.
Quick start
- In Praat, select exactly one Sound object (any duration, any content).
- Run script… → select
LatentDiffusion.praat. - Choose Preset (2-7 for specific strategies, 1 for custom).
- Set latent size, learning steps, number of clusters, diffusion steps.
- Adjust entropy threshold, temperature range, denoising strength.
- Enable Draw_visualization for analysis display.
- Click OK — engine segments, trains autoencoder, clusters, runs diffusion, reconstructs.
Latent Diffusion Theory
Autoencoder Encoding
Cluster Discovery
Diffusion Step
📈 Temperature-Annealed Gradient Descent
Temperature annealing: Exponential from T_start to T_end (with floor at 0.05)
Effect: High T → exploratory, low T → deterministic convergence
Anti-Loop Mechanisms
🔄 Preventing Repetitive Loops
1. Tabu Penalty: Events used in last N steps get distance multiplied by penalty factor (default 5.0). Discourages immediate repetition.
2. Stochastic Top-K Selection: At each step, consider only the K nearest events, then sample with Boltzmann probability p_i ∝ exp(-d_i / T_sel). T_sel inherits from diffusion temperature.
3. Temperature Inheritance: Selection temperature = max(diffusion_T, T_min), where T_min adapts to pool size (0.08-0.35).
4. Sparse-Pool Adaptation: When event pool < 20, parameters scale up:
- top_k grows to cover 25-50% of pool
- tabu_size grows to cover 25-40% of pool
- T_min raised to keep Boltzmann weights diffuse
Morph-Chain Construction
Preset Strategies
Preset 2: Gentle Crystallisation
❄️ Gradual Emergence
Latent: 8 | Steps: 100 | Clusters: 3
Diffusion: 25 steps | Entropy: 1.2 | T: 1.0→0.06 | Denoise: 0.5
Character: Gentle emergence from noise to identity — subtle, smooth evolution
Use on: Ambient, gradual transformations
Preset 3: Stochastic Melt
🌊 Exploratory Diffusion
Latent: 8 | Steps: 100 | Clusters: 3
Diffusion: 40 steps | Entropy: 0.6 | T: 3.0→0.30 | Denoise: 0.3
Character: High start temperature, low denoising — exploratory, stochastic
Use on: Chaotic, unpredictable textures
Preset 4: Deep Freeze
🧊 Highly Deterministic
Latent: 10 | Steps: 150 | Clusters: 3
Diffusion: 50 steps | Entropy: 0.5 | T: 2.5→0.05 | Denoise: 0.8
Character: Strong denoising, low final T — deterministic convergence
Use on: Precise identity emergence
Preset 5: Plasma Burst
⚡ Fast, Explosive
Latent: 8 | Steps: 80 | Clusters: 4
Diffusion: 15 steps | Entropy: 2.0 | T: 4.0→0.50 | Denoise: 0.9
Character: Short chains, high start T, rapid denoising — explosive emergence
Use on: Percussive, dramatic textures
Preset 6: Slow Diffusion
🐢 Long, Gradual
Latent: 12 | Steps: 150 | Clusters: 3
Diffusion: 80 steps | Entropy: 0.8 | T: 1.5→0.08 | Denoise: 0.5
Character: Long chains, slow annealing — very gradual emergence
Use on: Long-form evolution, meditation
Preset 7: Multi-Identity
🎭 Many Clusters
Latent: 12 | Steps: 150 | Clusters: 6
Diffusion: 35 steps | Entropy: 1.0 | T: 2.0→0.10 | Denoise: 0.6
Character: Up to 6 clusters, moderate diffusion — explores many identities
Use on: Complex material with many acoustic states
Parameters & Controls
Autoencoder Parameters
| Parameter | Default | Description |
|---|---|---|
| Latent_size | 8 | Autoencoder latent dimensions (2–32) |
| Learning_steps | 100 | Training iterations (10–500) |
Cluster Parameters
| Parameter | Default | Description |
|---|---|---|
| Number_of_clusters | 3 | K acoustic identity clusters (2–8) |
Diffusion Parameters
| Parameter | Default | Description |
|---|---|---|
| Diffusion_steps | 30 | Number of diffusion steps per chain (5–100) |
| Entropy_threshold | 1.0 | Early stop when cross-entropy < threshold |
| Temperature_start | 2.0 | Initial temperature (0.1–10) |
| Temperature_end | 0.1 | Final temperature (≥0.01, ≤ T_start) |
| Denoising_strength | 0.6 | Strength of gradient step (0–1) |
Output
| Parameter | Default | Description |
|---|---|---|
| Seed | 42 | Random seed for reproducibility |
| Draw_visualization | 1 | Generate 6-panel analysis display |
| Play_result | 1 | Audition after processing |
Visualization & Analysis
6-Panel Display
Reading the Diffusion Panel
- Temperature bar: 10 segments from orange (hot) to blue (cold) — shows annealing schedule
- Cluster bars: Each colored bar's height = percentage of events in that cluster
- Colors: Cluster 0 (blue), 1 (red), 2 (green), 3 (purple), etc. — consistent across visualizations
- Numbers: Steps taken, number of morph chains, early stops count
- Latent spread: Mean standard deviation of latent dimensions — measure of space coverage
Interpreting Morph-Chain Output
- Each chain: Begins with noise-like texture, gradually crystallizes into cluster identity
- Between chains: Brief silence gap (60ms) separates clusters
- Chain length: Varies depending on early stopping
- Anti-loop effects: Even in long chains, you'll hear variety rather than repetitive loops
- The spectrogram will show gradual spectral refinement from noisy to structured
Applications
Electroacoustic Composition
Use case: Creating morphing textures that evolve from noise to identity
Technique: Gentle Crystallisation or Slow Diffusion presets
Workflow:
- Select a 20-60 second recording with multiple acoustic states
- Run with Gentle Crystallisation preset
- Listen to how each cluster emerges from noise to its characteristic sound
- Export and use as movement in larger work
Sound Design for Media
Use case: Creating evolving textures, risers, transitions
Technique: Plasma Burst or Stochastic Melt on appropriate sources
Applications:
- Risers: Plasma Burst with short, explosive chains
- Ambient evolution: Slow Diffusion with long, gradual chains
- Character emergence: Deep Freeze for precise identity crystallization
Music Production
Use case: Creating evolving pads, generative textures
Technique: Multi-Identity preset to explore many acoustic states
Examples:
- Pad evolution: Gentle Crystallisation on synth pad
- Rhythmic textures: Plasma Burst on percussive loops
- Generative beds: Slow Diffusion with long chains
Research & Education
Use case: Studying diffusion processes, latent spaces, clustering
Technique: Compare presets on same source, examine cluster distributions
Learning outcomes:
- Understand how temperature annealing affects diffusion
- See how clustering discovers acoustic identities
- Observe anti-loop mechanisms in action
- Explore relationship between latent space and perceptual identity
Practical Workflow Examples
🎬 Film Scene: Identity Emergence
Goal: Create 60-second cue representing a character's identity emerging from chaos
Settings:
- Source: 30-second voice recording with multiple characters
- Preset: Gentle Crystallisation
- Clusters: 4 (one per character identity)
Result: Four morph-chains, each evolving from noise to a different vocal character
🎚️ Electronic Music: Riser
Goal: Create 15-second riser from synth stab
Settings:
- Source: 8-second synth stab
- Preset: Plasma Burst
- Custom: diffusion_steps=10, T_start=5.0, T_end=0.5
Result: Short, explosive chain from chaos to synth identity — perfect riser
🎙️ Voice Processing: Character Exploration
Goal: Explore different vocal identities in a single recording
Settings:
- Source: 20-second vocal improvisation
- Preset: Multi-Identity
- Clusters: 6, diffusion_steps=40
Result: Six morph-chains, each revealing a different vocal personality from the same source
Troubleshooting Common Issues
Cause: Python not installed, or packages missing
Solution: Install Python and required packages: pip install numpy soundfile scipy
Cause: Source too homogeneous, or k too high
Solution: Reduce number_of_clusters, or use source with more variety
Cause: Anti-loop mechanisms insufficient for pool size
Solution: Increase tabu_penalty, top_k, or reduce diffusion_steps
Cause: Crossfade insufficient at splice points
Solution: Increase XFADE_SEC in Python script (currently 8ms)
Cause: Early stopping triggered too early
Solution: Increase entropy_threshold, reduce diffusion_steps, or lower denoising_strength
Advanced Techniques
In chain_to_event_sequence(), modify tabu_penalty, top_k, and T_sel_min to adjust repetition avoidance.
In run_diffusion_chains(), replace exponential annealing with linear or custom curves.
In cluster_statistics(), use full covariance instead of diagonal for more accurate Gaussian models.
Script converts to mono for analysis. For stereo, modify to process each channel separately and recombine.