AI Phase Diffusion Engine — v5.0 User Guide
Latent‑space phase diffusion & Paulstretch architecture with on‑the‑fly autoencoder training. Three models: PCA‑weighted, AR‑smear gated by AE coherence, and full latent‑space diffusion (walk toward cluster centroids).
What this does
This script implements an AI‑driven phase diffusion engine based on a Paulstretch‑style time‑domain processor, but with a crucial addition: a NumpyAutoencoder trained on the input signal itself. The autoencoder learns a compressed latent representation of log‑mel spectral patches. Its reconstruction error becomes a per‑frequency coherence weight: low error → structured, tonal regions → receive more diffusion / smearing; high error → noisy, transient regions → are protected.
Three operating modes (the “models”):
- Phase PCA — AE‑weighted Paulstretch. Coherence weights from AE error modulate per‑bin magnitude attenuation.
- Phase AR — AR(1) magnitude smearing gated by both AR coefficient (temporal sustain) and AE coherence.
- Latent — full latent‑space diffusion. Walk each event’s latent vector toward its cluster centroid (temperature‑annealed), decode to magnitude envelope, blend with original.
latent_diffusion.py, latent_barycentric.py is now integrated into the phase‑diffusion engine.
The signal itself defines its own “vocabulary” of acoustic events.
Quick start
- In Praat, select exactly one Sound object.
- Run script… →
PhaseDiffusion.praat(from plugin AudioTools). - Choose a Preset (Custom / Veil / Spectral Fog / … Latent Deep) or adjust parameters manually.
- Select Model: Phase PCA, Phase AR, or Latent.
- Set core parameters: diffusion_amount, steps, window/hop size, mag_smear.
- Adjust autoencoder parameters: latent_size, train_steps, n_clusters, temperature (Latent model).
- Enable Draw_visualization to see waveforms, spectrograms, intensity, and latent info panel.
- Click OK – Python auto‑detection runs, autoencoder trains (150 steps default), processing starts.
numpy and soundfile.
Auto‑detection tries python3 / python / py. First run may be slower because of on‑the‑fly training (~150 steps).
The Latent model performs gradient steps per event (diffusion_steps) – larger values increase CPU time.
Preserve_transients option reduces diffusion amount on high‑spectral‑flux frames.
Architecture v5.0 — latent autoencoder integration
The diagram below shows the signal flow. All three models share the first stages.
Stage 2 – Log‑Mel patches (40 mel bands, 16 frames per event, flattened)
Stage 3 – NumpyAutoencoder (input→hidden→latent→hidden→output; leaky ReLU, Adam, denoising, L2)
Stage 4 – Encode → latent Z & reconstruction error (error = coherence measure)
Stage 4b – K‑means++ clustering (in latent space, for latent model only)
Stage 5 – Project error to FFT bin weights (mel filterbank inversion) →
ae_weights[f]Stage 6 – Paulstretch loop (window/hop) with model‑specific diffusion:
- PCA: magnitude attenuation = 1 − diffusion_amount × (1−ae_weight)mag_smear
- AR: IIR smear state = decay * state + (1‑decay)*mag, decay = ARcoeff × (1−ae_weight) × amount × smear
- Latent: diffuse latent Z toward centroid → decode → magnitude envelope (mel→FFT) → blend with frame magnitude
NumpyAutoencoder (pure NumPy, no external ML)
📐 Architecture
Input dim = N_MELS × MEL_FRAMES = 40×16 = 640.
Hidden = max(2×latent, min(256, √(640×latent))).
Latent = user‑defined (2–16).
Training: 150 steps (default), denoising noise annealed 0.3→0.15, learning rate 0.003→0.002, Adam optimizer.
Reconstruction error per event → 0 (structured) … 1 (noisy).
Three models — detailed behaviour
1. Phase PCA (AE‑weighted Paulstretch)
Formula per bin: coherence_w = 1 - ae_weight[f]
attenuation = 1 − diffusion_amount × (coherence_w ** mag_smear)
mag_wet = magnitude × attenuation
Phase randomised uniformly.
Result: Tonal bins (low ae_weight) are heavily randomised; noisy bins keep more original magnitude.
2. Phase AR (IIR smear gated by AE & AR)
First, AR(1) coefficients are fit per bin from the sequence of magnitude frames.
Decay per bin = ARcoeff[f] × (1 − ae_weight[f]) × diffusion_amount × mag_smear (clipped ≤0.97).
Smear state updated: state = decay * state + (1‑decay) * magnitude
Phase always randomised.
Result: Only bins that are both temporally sustained (high AR) and spectrally structured (low ae_weight) receive heavy smearing.
3. Latent (full latent‑space diffusion)
Each event has latent vector Z. K‑means++ clusters Z into N groups. Then for each event:
- Walk Z toward its cluster centroid using temperature‑annealed gradient steps (same engine as latent_diffusion.py).
- Diffused Z is decoded → log‑mel patch → exponentiated → projected to FFT bins via mel filterbank → magnitude envelope.
- Frame magnitude =
(1‑diffusion_amount) * orig_mag + diffusion_amount * latent_envelope(scaled). - Phase randomised.
Result: Each event becomes a blend of acoustically similar events from the recording, navigated by temperature (higher = more chaotic walk).
Presets (12 built‑in)
| Preset | Model | Amount | Steps | Window | Latent | Description |
|---|---|---|---|---|---|---|
| Veil | PCA | 0.30 | 20 | 8192 | 6 | Barely‑there haze |
| Spectral Fog | PCA | 0.60 | 30 | 8192 | 8 | Mid diffusion, AE‑weighted |
| Ambient Wash | PCA | 0.80 | 30 | 16384 | 8 | Deep smooth wash, transients protected |
| Drone Cloud | AR | 0.95 | 30 | 32768 | 8 | Max wash, heavy smear |
| Stutter Field | PCA | 0.70 | 20 | 2048 | 6 | Grainy micro‑texture |
| Formant Ghost | AR | 0.55 | 30 | 8192 | 8 | Heavy magnitude smear (2.0), envelope blur |
| Phase Plasma | PCA | 1.00 | 30 | 16384 | 8 | Full PCA, no transient protection |
| Void | AR | 1.00 | 50 | 32768 | 12 | Maximum everything |
| Latent Drift | Latent | 0.50 | 20 | 8192 | 8 | Gentle latent walk, low temp (0.5) |
| Latent Morph | Latent | 0.75 | 40 | 8192 | 8 | Stronger blend, temp=1.5 |
| Latent Deep | Latent | 1.00 | 60 | 16384 | 12 | Maximum latent diffusion, temp=3.0 |
Parameters & defaults
Common parameters
| Parameter | Default | Description |
|---|---|---|
| Preset | Custom | Load predefined combination |
| Diffusion_amount | 0.70 | Dry/wet crossfade (0–1) |
| Diffusion_steps | 30 | Latent: gradient steps; PCA/AR: reserved |
| Window_size | 8192 | FFT window (samples) |
| Hop_size | 2048 | STFT hop (samples) |
| Mag_smear | 1.0 | PCA: exponent on coherence; AR: multiplier on decay |
| Model | Phase PCA | pca / ar / latent |
Autoencoder / latent parameters
| Parameter | Default | Description |
|---|---|---|
| Latent_size | 8 | Bottleneck dimension (2–16) |
| Train_steps | 150 | AE training iterations |
| N_clusters | 4 | k‑means++ clusters in latent space (2–8) |
| Temperature | 1.0 | Latent diffusion temperature (≥0.05) |
Options
| Flag | Default | Effect |
|---|---|---|
| Preserve_transients | yes | Reduce diffusion_amount on high‑flux frames |
| Draw_visualization | yes | Show waveform, spectrograms, intensity, latent info |
| Play_result | yes | Auto‑play after processing |
| Debug | no | Print detailed Python info |
Applications & techniques
✨ Ambient / drone texture
Use Drone Cloud (AR) or Ambient Wash (PCA) with large window (16384–32768). Enable preserve_transients to keep attacks if needed. The AE weights protect noisy fragments, so texture remains clean.
🌀 Spectral morphing (latent models)
Latent Morph or Latent Deep. Each event’s spectrum is replaced by a blend of similar events from the same file. Increase temperature for more adventurous walks between clusters.
🎚️ Gating / transient design
Set preserve_transients = yes and use Phase PCA with low amount (0.3–0.5). Transients remain sharp, sustained parts diffuse.
🔬 Reproducible sound design
Because the autoencoder is trained from scratch on the input, the same preset yields different results on different signals – but identical on the same signal (seed=42). Good for scientific audio.
📊 Understanding the latent panel
• Original / diffused waveforms & spectrograms
• Intensity overlay (grey=original, colour=diffused)
• Autoencoder panel: architecture description, training steps, latent size, clusters
• For latent model: temperature, steps, “each event Z walked toward centroid”
• Parameter summary: RMS change, window, hop, transient protection
- No sound / nearly silent: diffusion_amount too high + preserve_transients = no? Try lower amount or enable preservation.
- Python not found: install numpy & soundfile, or manually set python command in script.
- Latent model slow: reduce diffusion_steps (20–30) or window_size.
- Clipping: output is RMS‑matched and capped at 3× gain; if still clipping, reduce input level.
- Phase inversion (silence): random phase can produce cancellations; blend with dry (diffusion_amount < 1) to avoid.
Mathematical deep dive – latent diffusion step
Weight for cluster k: wk ∝ exp( –‖Z‑μk‖²/(2σk² T) )
Gradient: ∇ = Σ wk (Z‑μk)/(σk²)
Step: Z ← Z – step_size · ∇ + noise·√T
step_size = 0.6·(1/T) / (1+1/T) · (1‑0.4·frac)
After N steps (diffusion_steps), blend with original Z: Zout = (1‑amount)·Zorig + amount·Zwalked
The decoded magnitude envelope is then used as the spectral shape for each frame of that event – a form of latent‑guided cross‑synthesis with itself.