AI Phase Diffusion Engine — v5.0 User Guide

Latent‑space phase diffusion & Paulstretch architecture with on‑the‑fly autoencoder training. Three models: PCA‑weighted, AR‑smear gated by AE coherence, and full latent‑space diffusion (walk toward cluster centroids).

Author: Shai Cohen Affiliation: Department of Music, Bar‑Ilan University, Israel Version: 5.0 (2025) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools

Contents:

What it does Quick start Architecture v5.0 Three models Presets Parameters Applications

What this does

This script implements an AI‑driven phase diffusion engine based on a Paulstretch‑style time‑domain processor, but with a crucial addition: a NumpyAutoencoder trained on the input signal itself. The autoencoder learns a compressed latent representation of log‑mel spectral patches. Its reconstruction error becomes a per‑frequency coherence weight: low error → structured, tonal regions → receive more diffusion / smearing; high error → noisy, transient regions → are protected.

What is phase diffusion? In Paulstretch‑like processing, the magnitude of STFT frames is kept (or smoothed) while the phase is randomised. This creates ethereal, time‑stretched‑sounding textures without changing duration. This engine adds signal‑aware weighting using a neural network trained on the fly.

Three operating modes (the “models”):

Phase PCA — AE‑weighted Paulstretch. Coherence weights from AE error modulate per‑bin magnitude attenuation.
Phase AR — AR(1) magnitude smearing gated by both AR coefficient (temporal sustain) and AE coherence.
Latent — full latent‑space diffusion. Walk each event’s latent vector toward its cluster centroid (temperature‑annealed), decode to magnitude envelope, blend with original.

Key innovation v5.0: the same log‑mel + NumpyAutoencoder + k‑means++ pipeline as latent_diffusion.py, latent_barycentric.py is now integrated into the phase‑diffusion engine. The signal itself defines its own “vocabulary” of acoustic events.

Quick start

In Praat, select exactly one Sound object.
Run script… → PhaseDiffusion.praat (from plugin AudioTools).
Choose a Preset (Custom / Veil / Spectral Fog / … Latent Deep) or adjust parameters manually.
Select Model: Phase PCA, Phase AR, or Latent.
Set core parameters: diffusion_amount, steps, window/hop size, mag_smear.
Adjust autoencoder parameters: latent_size, train_steps, n_clusters, temperature (Latent model).
Enable Draw_visualization to see waveforms, spectrograms, intensity, and latent info panel.
Click OK – Python auto‑detection runs, autoencoder trains (150 steps default), processing starts.

Quick tip: Start with “Spectral Fog” (Phase PCA) to hear gentle AE‑weighted diffusion. For extreme texture, try “Void” (Phase AR, max smear). The new latent presets (Latent Drift, Morph, Deep) navigate the learned cluster space. Enable Draw_visualization to see the latent panel with cluster sizes and AE details.

Important: This script calls an external Python process. Requires numpy and soundfile. Auto‑detection tries python3 / python / py. First run may be slower because of on‑the‑fly training (~150 steps). The Latent model performs gradient steps per event (diffusion_steps) – larger values increase CPU time. Preserve_transients option reduces diffusion amount on high‑spectral‑flux frames.

Architecture v5.0 — latent autoencoder integration

The diagram below shows the signal flow. All three models share the first stages.

Stage 1 – Segmentation (spectral flux onset detection → events 0.1–2.0 s)
Stage 2 – Log‑Mel patches (40 mel bands, 16 frames per event, flattened)
Stage 3 – NumpyAutoencoder (input→hidden→latent→hidden→output; leaky ReLU, Adam, denoising, L2)
Stage 4 – Encode → latent Z & reconstruction error (error = coherence measure)
Stage 4b – K‑means++ clustering (in latent space, for latent model only)
Stage 5 – Project error to FFT bin weights (mel filterbank inversion) → ae_weights[f]
Stage 6 – Paulstretch loop (window/hop) with model‑specific diffusion:

PCA: magnitude attenuation = 1 − diffusion_amount × (1−ae_weight)^mag_smear
AR: IIR smear state = decay * state + (1‑decay)*mag, decay = ARcoeff × (1−ae_weight) × amount × smear
Latent: diffuse latent Z toward centroid → decode → magnitude envelope (mel→FFT) → blend with frame magnitude

Stage 7 – Overlap‑add, dry/wet crossfade, RMS match, output

NumpyAutoencoder (pure NumPy, no external ML)

📐 Architecture

Input dim = N_MELS × MEL_FRAMES = 40×16 = 640.
Hidden = max(2×latent, min(256, √(640×latent))).
Latent = user‑defined (2–16).
Training: 150 steps (default), denoising noise annealed 0.3→0.15, learning rate 0.003→0.002, Adam optimizer.

Reconstruction error per event → 0 (structured) … 1 (noisy).

Three models — detailed behaviour

1. Phase PCA (AE‑weighted Paulstretch)

Formula per bin: coherence_w = 1 - ae_weight[f]
attenuation = 1 − diffusion_amount × (coherence_w ** mag_smear)
mag_wet = magnitude × attenuation
Phase randomised uniformly.

Result: Tonal bins (low ae_weight) are heavily randomised; noisy bins keep more original magnitude.

2. Phase AR (IIR smear gated by AE & AR)

First, AR(1) coefficients are fit per bin from the sequence of magnitude frames.
Decay per bin = ARcoeff[f] × (1 − ae_weight[f]) × diffusion_amount × mag_smear (clipped ≤0.97).
Smear state updated: state = decay * state + (1‑decay) * magnitude
Phase always randomised.

Result: Only bins that are both temporally sustained (high AR) and spectrally structured (low ae_weight) receive heavy smearing.

3. Latent (full latent‑space diffusion)

Each event has latent vector Z. K‑means++ clusters Z into N groups. Then for each event:

Walk Z toward its cluster centroid using temperature‑annealed gradient steps (same engine as latent_diffusion.py).
Diffused Z is decoded → log‑mel patch → exponentiated → projected to FFT bins via mel filterbank → magnitude envelope.
Frame magnitude = (1‑diffusion_amount) * orig_mag + diffusion_amount * latent_envelope (scaled).
Phase randomised.

Result: Each event becomes a blend of acoustically similar events from the recording, navigated by temperature (higher = more chaotic walk).

Presets (12 built‑in)

Preset	Model	Amount	Steps	Window	Latent	Description
Veil	PCA	0.30	20	8192	6	Barely‑there haze
Spectral Fog	PCA	0.60	30	8192	8	Mid diffusion, AE‑weighted
Ambient Wash	PCA	0.80	30	16384	8	Deep smooth wash, transients protected
Drone Cloud	AR	0.95	30	32768	8	Max wash, heavy smear
Stutter Field	PCA	0.70	20	2048	6	Grainy micro‑texture
Formant Ghost	AR	0.55	30	8192	8	Heavy magnitude smear (2.0), envelope blur
Phase Plasma	PCA	1.00	30	16384	8	Full PCA, no transient protection
Void	AR	1.00	50	32768	12	Maximum everything
Latent Drift	Latent	0.50	20	8192	8	Gentle latent walk, low temp (0.5)
Latent Morph	Latent	0.75	40	8192	8	Stronger blend, temp=1.5
Latent Deep	Latent	1.00	60	16384	12	Maximum latent diffusion, temp=3.0

Parameters & defaults

Common parameters

Parameter	Default	Description
Preset	Custom	Load predefined combination
Diffusion_amount	0.70	Dry/wet crossfade (0–1)
Diffusion_steps	30	Latent: gradient steps; PCA/AR: reserved
Window_size	8192	FFT window (samples)
Hop_size	2048	STFT hop (samples)
Mag_smear	1.0	PCA: exponent on coherence; AR: multiplier on decay
Model	Phase PCA	pca / ar / latent

Autoencoder / latent parameters

Parameter	Default	Description
Latent_size	8	Bottleneck dimension (2–16)
Train_steps	150	AE training iterations
N_clusters	4	k‑means++ clusters in latent space (2–8)
Temperature	1.0	Latent diffusion temperature (≥0.05)

Options

Flag	Default	Effect
Preserve_transients	yes	Reduce diffusion_amount on high‑flux frames
Draw_visualization	yes	Show waveform, spectrograms, intensity, latent info
Play_result	yes	Auto‑play after processing
Debug	no	Print detailed Python info

Applications & techniques

✨ Ambient / drone texture

Use Drone Cloud (AR) or Ambient Wash (PCA) with large window (16384–32768). Enable preserve_transients to keep attacks if needed. The AE weights protect noisy fragments, so texture remains clean.

🌀 Spectral morphing (latent models)

Latent Morph or Latent Deep. Each event’s spectrum is replaced by a blend of similar events from the same file. Increase temperature for more adventurous walks between clusters.

🎚️ Gating / transient design

Set preserve_transients = yes and use Phase PCA with low amount (0.3–0.5). Transients remain sharp, sustained parts diffuse.

🔬 Reproducible sound design

Because the autoencoder is trained from scratch on the input, the same preset yields different results on different signals – but identical on the same signal (seed=42). Good for scientific audio.

📊 Understanding the latent panel

Visualisation includes:
• Original / diffused waveforms & spectrograms
• Intensity overlay (grey=original, colour=diffused)
• Autoencoder panel: architecture description, training steps, latent size, clusters
• For latent model: temperature, steps, “each event Z walked toward centroid”
• Parameter summary: RMS change, window, hop, transient protection

Troubleshooting:

No sound / nearly silent: diffusion_amount too high + preserve_transients = no? Try lower amount or enable preservation.
Python not found: install numpy & soundfile, or manually set python command in script.
Latent model slow: reduce diffusion_steps (20–30) or window_size.
Clipping: output is RMS‑matched and capped at 3× gain; if still clipping, reduce input level.
Phase inversion (silence): random phase can produce cancellations; blend with dry (diffusion_amount < 1) to avoid.

Mathematical deep dive – latent diffusion step

For latent vector Z, cluster centres μ_k, cluster variances σ_k², temperature T:
Weight for cluster k: w_k ∝ exp( –‖Z‑μ_k‖²/(2σ_k² T) )
Gradient: ∇ = Σ w_k (Z‑μ_k)/(σ_k²)
Step: Z ← Z – step_size · ∇ + noise·√T
step_size = 0.6·(1/T) / (1+1/T) · (1‑0.4·frac)
After N steps (diffusion_steps), blend with original Z: Z_out = (1‑amount)·Z_orig + amount·Z_walked

The decoded magnitude envelope is then used as the spectral shape for each frame of that event – a form of latent‑guided cross‑synthesis with itself.