AI Phase Diffusion Engine — v5.0 User Guide

Latent‑space phase diffusion & Paulstretch architecture with on‑the‑fly autoencoder training. Three models: PCA‑weighted, AR‑smear gated by AE coherence, and full latent‑space diffusion (walk toward cluster centroids).

Author: Shai Cohen Affiliation: Department of Music, Bar‑Ilan University, Israel Version: 5.0 (2025) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools
Contents:

What this does

This script implements an AI‑driven phase diffusion engine based on a Paulstretch‑style time‑domain processor, but with a crucial addition: a NumpyAutoencoder trained on the input signal itself. The autoencoder learns a compressed latent representation of log‑mel spectral patches. Its reconstruction error becomes a per‑frequency coherence weight: low error → structured, tonal regions → receive more diffusion / smearing; high error → noisy, transient regions → are protected.

What is phase diffusion? In Paulstretch‑like processing, the magnitude of STFT frames is kept (or smoothed) while the phase is randomised. This creates ethereal, time‑stretched‑sounding textures without changing duration. This engine adds signal‑aware weighting using a neural network trained on the fly.

Three operating modes (the “models”):

Key innovation v5.0: the same log‑mel + NumpyAutoencoder + k‑means++ pipeline as latent_diffusion.py, latent_barycentric.py is now integrated into the phase‑diffusion engine. The signal itself defines its own “vocabulary” of acoustic events.

Quick start

  1. In Praat, select exactly one Sound object.
  2. Run script…PhaseDiffusion.praat (from plugin AudioTools).
  3. Choose a Preset (Custom / Veil / Spectral Fog / … Latent Deep) or adjust parameters manually.
  4. Select Model: Phase PCA, Phase AR, or Latent.
  5. Set core parameters: diffusion_amount, steps, window/hop size, mag_smear.
  6. Adjust autoencoder parameters: latent_size, train_steps, n_clusters, temperature (Latent model).
  7. Enable Draw_visualization to see waveforms, spectrograms, intensity, and latent info panel.
  8. Click OK – Python auto‑detection runs, autoencoder trains (150 steps default), processing starts.
Quick tip: Start with “Spectral Fog” (Phase PCA) to hear gentle AE‑weighted diffusion. For extreme texture, try “Void” (Phase AR, max smear). The new latent presets (Latent Drift, Morph, Deep) navigate the learned cluster space. Enable Draw_visualization to see the latent panel with cluster sizes and AE details.
Important: This script calls an external Python process. Requires numpy and soundfile. Auto‑detection tries python3 / python / py. First run may be slower because of on‑the‑fly training (~150 steps). The Latent model performs gradient steps per event (diffusion_steps) – larger values increase CPU time. Preserve_transients option reduces diffusion amount on high‑spectral‑flux frames.

Architecture v5.0 — latent autoencoder integration

The diagram below shows the signal flow. All three models share the first stages.

Stage 1 – Segmentation (spectral flux onset detection → events 0.1–2.0 s)
Stage 2 – Log‑Mel patches (40 mel bands, 16 frames per event, flattened)
Stage 3 – NumpyAutoencoder (input→hidden→latent→hidden→output; leaky ReLU, Adam, denoising, L2)
Stage 4 – Encode → latent Z & reconstruction error (error = coherence measure)
Stage 4b – K‑means++ clustering (in latent space, for latent model only)
Stage 5 – Project error to FFT bin weights (mel filterbank inversion) → ae_weights[f]
Stage 6 – Paulstretch loop (window/hop) with model‑specific diffusion:
  • PCA: magnitude attenuation = 1 − diffusion_amount × (1−ae_weight)mag_smear
  • AR: IIR smear state = decay * state + (1‑decay)*mag, decay = ARcoeff × (1−ae_weight) × amount × smear
  • Latent: diffuse latent Z toward centroid → decode → magnitude envelope (mel→FFT) → blend with frame magnitude
Stage 7 – Overlap‑add, dry/wet crossfade, RMS match, output

NumpyAutoencoder (pure NumPy, no external ML)

📐 Architecture

Input dim = N_MELS × MEL_FRAMES = 40×16 = 640.
Hidden = max(2×latent, min(256, √(640×latent))).
Latent = user‑defined (2–16).
Training: 150 steps (default), denoising noise annealed 0.3→0.15, learning rate 0.003→0.002, Adam optimizer.

Reconstruction error per event → 0 (structured) … 1 (noisy).

Three models — detailed behaviour

1. Phase PCA (AE‑weighted Paulstretch)

Formula per bin: coherence_w = 1 - ae_weight[f]
attenuation = 1 − diffusion_amount × (coherence_w ** mag_smear)
mag_wet = magnitude × attenuation
Phase randomised uniformly.

Result: Tonal bins (low ae_weight) are heavily randomised; noisy bins keep more original magnitude.

2. Phase AR (IIR smear gated by AE & AR)

First, AR(1) coefficients are fit per bin from the sequence of magnitude frames.
Decay per bin = ARcoeff[f] × (1 − ae_weight[f]) × diffusion_amount × mag_smear (clipped ≤0.97).
Smear state updated: state = decay * state + (1‑decay) * magnitude
Phase always randomised.

Result: Only bins that are both temporally sustained (high AR) and spectrally structured (low ae_weight) receive heavy smearing.

3. Latent (full latent‑space diffusion)

Each event has latent vector Z. K‑means++ clusters Z into N groups. Then for each event:

  • Walk Z toward its cluster centroid using temperature‑annealed gradient steps (same engine as latent_diffusion.py).
  • Diffused Z is decoded → log‑mel patch → exponentiated → projected to FFT bins via mel filterbank → magnitude envelope.
  • Frame magnitude = (1‑diffusion_amount) * orig_mag + diffusion_amount * latent_envelope (scaled).
  • Phase randomised.

Result: Each event becomes a blend of acoustically similar events from the recording, navigated by temperature (higher = more chaotic walk).

Presets (12 built‑in)

PresetModelAmountStepsWindowLatentDescription
VeilPCA0.302081926Barely‑there haze
Spectral FogPCA0.603081928Mid diffusion, AE‑weighted
Ambient WashPCA0.8030163848Deep smooth wash, transients protected
Drone CloudAR0.9530327688Max wash, heavy smear
Stutter FieldPCA0.702020486Grainy micro‑texture
Formant GhostAR0.553081928Heavy magnitude smear (2.0), envelope blur
Phase PlasmaPCA1.0030163848Full PCA, no transient protection
VoidAR1.00503276812Maximum everything
Latent DriftLatent0.502081928Gentle latent walk, low temp (0.5)
Latent MorphLatent0.754081928Stronger blend, temp=1.5
Latent DeepLatent1.00601638412Maximum latent diffusion, temp=3.0

Parameters & defaults

Common parameters

ParameterDefaultDescription
PresetCustomLoad predefined combination
Diffusion_amount0.70Dry/wet crossfade (0–1)
Diffusion_steps30Latent: gradient steps; PCA/AR: reserved
Window_size8192FFT window (samples)
Hop_size2048STFT hop (samples)
Mag_smear1.0PCA: exponent on coherence; AR: multiplier on decay
ModelPhase PCApca / ar / latent

Autoencoder / latent parameters

ParameterDefaultDescription
Latent_size8Bottleneck dimension (2–16)
Train_steps150AE training iterations
N_clusters4k‑means++ clusters in latent space (2–8)
Temperature1.0Latent diffusion temperature (≥0.05)

Options

FlagDefaultEffect
Preserve_transientsyesReduce diffusion_amount on high‑flux frames
Draw_visualizationyesShow waveform, spectrograms, intensity, latent info
Play_resultyesAuto‑play after processing
DebugnoPrint detailed Python info

Applications & techniques

✨ Ambient / drone texture

Use Drone Cloud (AR) or Ambient Wash (PCA) with large window (16384–32768). Enable preserve_transients to keep attacks if needed. The AE weights protect noisy fragments, so texture remains clean.

🌀 Spectral morphing (latent models)

Latent Morph or Latent Deep. Each event’s spectrum is replaced by a blend of similar events from the same file. Increase temperature for more adventurous walks between clusters.

🎚️ Gating / transient design

Set preserve_transients = yes and use Phase PCA with low amount (0.3–0.5). Transients remain sharp, sustained parts diffuse.

🔬 Reproducible sound design

Because the autoencoder is trained from scratch on the input, the same preset yields different results on different signals – but identical on the same signal (seed=42). Good for scientific audio.

📊 Understanding the latent panel

Visualisation includes:
• Original / diffused waveforms & spectrograms
• Intensity overlay (grey=original, colour=diffused)
Autoencoder panel: architecture description, training steps, latent size, clusters
• For latent model: temperature, steps, “each event Z walked toward centroid”
• Parameter summary: RMS change, window, hop, transient protection
Troubleshooting:
  • No sound / nearly silent: diffusion_amount too high + preserve_transients = no? Try lower amount or enable preservation.
  • Python not found: install numpy & soundfile, or manually set python command in script.
  • Latent model slow: reduce diffusion_steps (20–30) or window_size.
  • Clipping: output is RMS‑matched and capped at 3× gain; if still clipping, reduce input level.
  • Phase inversion (silence): random phase can produce cancellations; blend with dry (diffusion_amount < 1) to avoid.

Mathematical deep dive – latent diffusion step

For latent vector Z, cluster centres μk, cluster variances σk², temperature T:
Weight for cluster k: wk ∝ exp( –‖Z‑μk‖²/(2σk² T) )
Gradient: ∇ = Σ wk (Z‑μk)/(σk²)
Step: Z ← Z – step_size · ∇ + noise·√T
step_size = 0.6·(1/T) / (1+1/T) · (1‑0.4·frac)
After N steps (diffusion_steps), blend with original Z: Zout = (1‑amount)·Zorig + amount·Zwalked

The decoded magnitude envelope is then used as the spectral shape for each frame of that event – a form of latent‑guided cross‑synthesis with itself.