Latent Diffusion — Morph-Chain Generator — User Guide

Encodes audio events into a low-dimensional latent space via an on-the-fly autoencoder, discovers K acoustic identity clusters (k-means++), then runs a temperature-annealed diffusion loop that transforms a maximally-corrupted (noisy) seed vector back toward its cluster identity. Output is a Morph-Chain: one continuous audio sequence per cluster, evolving from static / noise-like texture into a recognisable instrument identity.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 1.0 (2025) License: MIT License Citation: Cohen, S. (2025). Praat AudioTools Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools
Contents:

What this does

This script implements a Latent Diffusion Resynthesis engine — a morph-chain generator that encodes audio events into a low-dimensional latent space via an on-the-fly autoencoder, discovers K acoustic identity clusters (k-means++), then runs a temperature-annealed diffusion loop that transforms a maximally-corrupted (noisy) seed vector back toward its cluster identity. The output is a Morph-Chain: one continuous audio sequence per cluster, evolving from static / noise-like texture into a recognisable instrument identity.

🧠 What is Latent Diffusion?

Latent diffusion models apply a diffusion process in a learned latent space. In this implementation:

  • Events are encoded into latent vectors Z via autoencoder
  • Clusters (K acoustic identities) are discovered via k-means++
  • Seed events (farthest from cluster center) are heavily corrupted with Gaussian noise
  • Diffusion steps gradually denoise the vector, guided by cluster statistics
  • Boltzmann weighting allows stochastic exploration at high temperatures
  • Morph-Chain results: each step maps to a real event, creating a smooth evolution from noise to identity

Key Features:

Technical Implementation: (1) Event Segmentation: Praat segments audio into events. (2) Mel Patches: 40×32 log-mel patches per event. (3) Autoencoder: Train on-the-fly, encode to latent space Z. (4) Clustering: k-means++ on Z → K clusters. (5) Diffusion Engine: For each cluster, seed from farthest event, corrupt, iteratively denoise with temperature annealing, Boltzmann weights, anti-loop selection. (6) Reconstruction: Map each diffusion step to nearest real event, concatenate chains with silence gaps. (7) Visualization & Stats.

Quick start

  1. In Praat, select exactly one Sound object (any duration, any content).
  2. Run script… → select LatentDiffusion.praat.
  3. Choose Preset (2-7 for specific strategies, 1 for custom).
  4. Set latent size, learning steps, number of clusters, diffusion steps.
  5. Adjust entropy threshold, temperature range, denoising strength.
  6. Enable Draw_visualization for analysis display.
  7. Click OK — engine segments, trains autoencoder, clusters, runs diffusion, reconstructs.
Quick tip: Start with Gentle Crystallisation preset on a 10-20 second recording with varied texture. Enable visualization — you'll see the diffusion panel with temperature annealing bar (hot→cold) and cluster population bars. Listen to the morph-chains: each cluster evolves from noise to its acoustic identity. The output appears as "source_diffusion" in the Objects window.
Important: PYTHON DEPENDENCIES — Requires numpy, soundfile, scipy (no scikit-learn needed). AUTOENCODER TRAINING happens on-the-fly and may take 30-60 seconds. CLUSTER COUNT should match the number of distinct acoustic identities in your source. DIFFUSION STEPS controls chain length — more steps = longer morph chains. ANTI-LOOP MECHANISMS prevent repetitive loops: tabu penalty, stochastic top-K, temperature inheritance, and sparse-pool adaptation.

Latent Diffusion Theory

Autoencoder Encoding

Input: log-mel patch (40 mel bands × 32 frames = 1280 features) Encoder: input (1280) → hidden (h) → latent (L) Decoder: latent (L) → hidden (h) → output (1280) where h = max(L×2, min(256, √(1280×L))) (geometric mean scaling) Activations: leaky ReLU (α=0.01) for hidden layers, linear for output Training: denoising autoencoder with Adam optimiser.

Cluster Discovery

K-means++ on latent vectors Z ∈ ℝᴺˣᴸ: 1. Choose first center uniformly at random 2. For each subsequent center, sample point with probability ∝ distance² to nearest existing center 3. Iterate until convergence (up to 60 iterations) Per-cluster diagonal variance: σ_k² = var(Z[mask == k]) + 1e-8 (fallback to global variance for small clusters) Cross-entropy for a point z to cluster k: CE_k = 0.5 × Σ_d (z_d - μ_kd)² / σ_kd²

Diffusion Step

📈 Temperature-Annealed Gradient Descent

At step t with temperature T: 1. Compute Boltzmann weights: w_k = softmax(-CE_k / T) 2. Gradient: g = Σ_k w_k · (z - μ_k) / σ_k² 3. Step size: step = denoising_strength × (1/T) / (1 + 1/T) × (1 - 0.4·step_frac) 4. Denoised: z' = z - step × g 5. Annealed noise: add ε ∼ 𝒩(0, √T × (1 - step_frac) × 0.4) 6. New z = z' + noise

Temperature annealing: Exponential from T_start to T_end (with floor at 0.05)

Effect: High T → exploratory, low T → deterministic convergence

Anti-Loop Mechanisms

🔄 Preventing Repetitive Loops

1. Tabu Penalty: Events used in last N steps get distance multiplied by penalty factor (default 5.0). Discourages immediate repetition.

2. Stochastic Top-K Selection: At each step, consider only the K nearest events, then sample with Boltzmann probability p_i ∝ exp(-d_i / T_sel). T_sel inherits from diffusion temperature.

3. Temperature Inheritance: Selection temperature = max(diffusion_T, T_min), where T_min adapts to pool size (0.08-0.35).

4. Sparse-Pool Adaptation: When event pool < 20, parameters scale up:

  • top_k grows to cover 25-50% of pool
  • tabu_size grows to cover 25-40% of pool
  • T_min raised to keep Boltzmann weights diffuse

Morph-Chain Construction

For each cluster: 1. Start with seed event (farthest from cluster center) 2. Corrupt with noise: z₀ = z_seed + T_start × σ(Z) × ε 3. Run diffusion steps to get chain of latent vectors [z₀, z₁, ..., z_N] 4. Map each z_t to nearest real event using anti-loop selection 5. Concatenate resulting event clips with crossfades Output: [Chain0: noisy→refined] | silence | [Chain1: noisy→refined] | …

Preset Strategies

Preset 2: Gentle Crystallisation

❄️ Gradual Emergence

Latent: 8 | Steps: 100 | Clusters: 3

Diffusion: 25 steps | Entropy: 1.2 | T: 1.0→0.06 | Denoise: 0.5

Character: Gentle emergence from noise to identity — subtle, smooth evolution

Use on: Ambient, gradual transformations

Preset 3: Stochastic Melt

🌊 Exploratory Diffusion

Latent: 8 | Steps: 100 | Clusters: 3

Diffusion: 40 steps | Entropy: 0.6 | T: 3.0→0.30 | Denoise: 0.3

Character: High start temperature, low denoising — exploratory, stochastic

Use on: Chaotic, unpredictable textures

Preset 4: Deep Freeze

🧊 Highly Deterministic

Latent: 10 | Steps: 150 | Clusters: 3

Diffusion: 50 steps | Entropy: 0.5 | T: 2.5→0.05 | Denoise: 0.8

Character: Strong denoising, low final T — deterministic convergence

Use on: Precise identity emergence

Preset 5: Plasma Burst

⚡ Fast, Explosive

Latent: 8 | Steps: 80 | Clusters: 4

Diffusion: 15 steps | Entropy: 2.0 | T: 4.0→0.50 | Denoise: 0.9

Character: Short chains, high start T, rapid denoising — explosive emergence

Use on: Percussive, dramatic textures

Preset 6: Slow Diffusion

🐢 Long, Gradual

Latent: 12 | Steps: 150 | Clusters: 3

Diffusion: 80 steps | Entropy: 0.8 | T: 1.5→0.08 | Denoise: 0.5

Character: Long chains, slow annealing — very gradual emergence

Use on: Long-form evolution, meditation

Preset 7: Multi-Identity

🎭 Many Clusters

Latent: 12 | Steps: 150 | Clusters: 6

Diffusion: 35 steps | Entropy: 1.0 | T: 2.0→0.10 | Denoise: 0.6

Character: Up to 6 clusters, moderate diffusion — explores many identities

Use on: Complex material with many acoustic states

Parameters & Controls

Autoencoder Parameters

ParameterDefaultDescription
Latent_size8Autoencoder latent dimensions (2–32)
Learning_steps100Training iterations (10–500)

Cluster Parameters

ParameterDefaultDescription
Number_of_clusters3K acoustic identity clusters (2–8)

Diffusion Parameters

ParameterDefaultDescription
Diffusion_steps30Number of diffusion steps per chain (5–100)
Entropy_threshold1.0Early stop when cross-entropy < threshold
Temperature_start2.0Initial temperature (0.1–10)
Temperature_end0.1Final temperature (≥0.01, ≤ T_start)
Denoising_strength0.6Strength of gradient step (0–1)

Output

ParameterDefaultDescription
Seed42Random seed for reproducibility
Draw_visualization1Generate 6-panel analysis display
Play_result1Audition after processing

Visualization & Analysis

6-Panel Display

Latent Diffusion Visualization: Panel 1: TITLE • Script name, source name, preset, clusters, temperature range Panel 2: INPUT WAVEFORM • Gray waveform with red dotted lines = event boundaries • Title: "Original (N events)" Panel 3: OUTPUT WAVEFORM • Blue waveform = diffused morph-chains • Title: "Diffused" • X-axis: Time (s) Panel 4: ORIGINAL SPECTROGRAM • 0-5000 Hz spectrogram of original • Title: "Original spectrogram" Panel 5: OUTPUT SPECTROGRAM • 0-5000 Hz spectrogram of diffused output • Title: "Diffused spectrogram (Morph-Chain)" Panel 6: DIFFUSION PANEL • Temperature annealing bar (10 segments from hot orange→cold blue) • Labels: T_start and T_end • Cluster population bars (colored by cluster, height = % of events) • Statistics: steps, chains, early stops, denoising strength, latent spread, events, mean dur • Title: "Diffusion:" Panel 7: INTENSITY COMPARISON • X-axis: Time, Y-axis: dB • Gray line = original intensity • Blue line = diffused intensity • Title: "Intensity: Grey = original | Blue = diffused" Panel 8: SUMMARY PANEL • Preset, clusters, steps, seed • Autoencoder loss (initial→final), latent size • Duration in/out, RMS comparison • Temperature range, denoising strength, early stops • Warnings if any

Reading the Diffusion Panel

What the bars show:
  • Temperature bar: 10 segments from orange (hot) to blue (cold) — shows annealing schedule
  • Cluster bars: Each colored bar's height = percentage of events in that cluster
  • Colors: Cluster 0 (blue), 1 (red), 2 (green), 3 (purple), etc. — consistent across visualizations
  • Numbers: Steps taken, number of morph chains, early stops count
  • Latent spread: Mean standard deviation of latent dimensions — measure of space coverage

Interpreting Morph-Chain Output

What you'll hear:
  • Each chain: Begins with noise-like texture, gradually crystallizes into cluster identity
  • Between chains: Brief silence gap (60ms) separates clusters
  • Chain length: Varies depending on early stopping
  • Anti-loop effects: Even in long chains, you'll hear variety rather than repetitive loops
  • The spectrogram will show gradual spectral refinement from noisy to structured

Applications

Electroacoustic Composition

Use case: Creating morphing textures that evolve from noise to identity

Technique: Gentle Crystallisation or Slow Diffusion presets

Workflow:

Sound Design for Media

Use case: Creating evolving textures, risers, transitions

Technique: Plasma Burst or Stochastic Melt on appropriate sources

Applications:

Music Production

Use case: Creating evolving pads, generative textures

Technique: Multi-Identity preset to explore many acoustic states

Examples:

Research & Education

Use case: Studying diffusion processes, latent spaces, clustering

Technique: Compare presets on same source, examine cluster distributions

Learning outcomes:

Practical Workflow Examples

🎬 Film Scene: Identity Emergence

Goal: Create 60-second cue representing a character's identity emerging from chaos

Settings:

  • Source: 30-second voice recording with multiple characters
  • Preset: Gentle Crystallisation
  • Clusters: 4 (one per character identity)

Result: Four morph-chains, each evolving from noise to a different vocal character

🎚️ Electronic Music: Riser

Goal: Create 15-second riser from synth stab

Settings:

  • Source: 8-second synth stab
  • Preset: Plasma Burst
  • Custom: diffusion_steps=10, T_start=5.0, T_end=0.5

Result: Short, explosive chain from chaos to synth identity — perfect riser

🎙️ Voice Processing: Character Exploration

Goal: Explore different vocal identities in a single recording

Settings:

  • Source: 20-second vocal improvisation
  • Preset: Multi-Identity
  • Clusters: 6, diffusion_steps=40

Result: Six morph-chains, each revealing a different vocal personality from the same source

Troubleshooting Common Issues

Problem: Python not found or missing packages
Cause: Python not installed, or packages missing
Solution: Install Python and required packages: pip install numpy soundfile scipy
Problem: All events in one cluster
Cause: Source too homogeneous, or k too high
Solution: Reduce number_of_clusters, or use source with more variety
Problem: Chains sound repetitive (looping)
Cause: Anti-loop mechanisms insufficient for pool size
Solution: Increase tabu_penalty, top_k, or reduce diffusion_steps
Problem: Output has clicks
Cause: Crossfade insufficient at splice points
Solution: Increase XFADE_SEC in Python script (currently 8ms)
Problem: Chains too short
Cause: Early stopping triggered too early
Solution: Increase entropy_threshold, reduce diffusion_steps, or lower denoising_strength

Advanced Techniques

Custom anti-loop tuning:

In chain_to_event_sequence(), modify tabu_penalty, top_k, and T_sel_min to adjust repetition avoidance.

Annealing schedule modification:

In run_diffusion_chains(), replace exponential annealing with linear or custom curves.

Cluster statistics:

In cluster_statistics(), use full covariance instead of diagonal for more accurate Gaussian models.

Multi-channel input:

Script converts to mono for analysis. For stereo, modify to process each channel separately and recombine.