Latent Barycentric Mutation — User Guide

Trains a VAE on-the-fly from event-level audio patches, then navigates the latent space according to a navigation plan. At each step, K nearest-neighbor events are found and their waveforms are mixed with barycentric (inverse-distance) weights.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 1.1 (2025) License: MIT License Citation: Cohen, S. (2025). Praat AudioTools Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools
Contents:

What this does

This script implements a Latent Barycentric Mutation engine — a system that trains a VAE on-the-fly from event-level audio patches, then navigates the latent space according to a navigation plan. At each step, K nearest-neighbor events are found and their waveforms are mixed with barycentric (inverse-distance) weights, creating seamless morphs between acoustic identities.

🧠 What is Barycentric Mutation?

This approach combines several advanced concepts:

  • VAE encoding: Each event is encoded to a latent mean vector μ via a Variational Autoencoder
  • Navigation plan: A sequence of steps with modes (drift, mutate, return, settle) that determine how to move through latent space
  • K-nearest neighbors: At each step, find the K events closest to the current latent position
  • Barycentric mixing: Mix the K event waveforms with weights inversely proportional to distance
  • Pitch preservation: Optional phase-vocoder time-stretching preserves pitch when durations differ

The result is a continuous morph through the acoustic identity space of the source.

Key Features:

Technical Implementation: (1) Event Segmentation: Praat segments audio into events. (2) Mel Patches: 40×32 log-mel patches per event. (3) VAE: Train on-the-fly, encode to latent means μ. (4) Navigation Plan: Load from CSV or auto-generate with 4 modes. (5) Execution: For each step, find K nearest events, compute barycentric weights, mix, apply duration/energy scaling, crossfade. (6) Pitch Preservation: Phase vocoder or pad/trim as needed. (7) Normalization: Scale to match input RMS or peak. (8) Visualization.

Quick start

  1. In Praat, select exactly one Sound object (any duration, any content).
  2. Run script… → select LatentBarycentric.praat.
  3. Choose Preset (2-5 for specific strategies, 1 for custom).
  4. Set latent size and duration (0 = original).
  5. Choose Plan source (External CSV or Auto-generate).
  6. If auto-generate, set generator controls (steps, mode, step size, temperature, K, return strength, etc.).
  7. Select normalization mode and pitch preservation mode.
  8. Enable Draw_visualization for analysis display.
  9. Click OK — engine segments, trains VAE, executes navigation plan, reconstructs.
Quick tip: Start with Gentle Drift preset with auto-generate mode "cycle" on a 10-20 second recording with varied texture. Enable visualization — you'll see the navigation phase panel showing how many steps in each mode. Listen to how the sound morphs through the latent space, smoothly transitioning between acoustic identities. The output appears as "source_bary" in the Objects window.
Important: PYTHON DEPENDENCIES — Requires numpy, soundfile, scipy (no scikit-learn). VAE TRAINING happens on-the-fly and may take 1-2 minutes. EXTERNAL PLAN CSV must match the expected schema. PITCH MODE = "preserve_spectral_envelope" uses phase vocoder, which is computationally intensive but preserves pitch. NORMALIZATION MODE "rms" scales output to match input RMS (IRCAM-style). CROSSFADE between mixed segments (12ms) prevents clicks.

Latent Barycentric Theory

VAE Architecture

Input: log-mel patch (40 mel bands × 32 frames = 1280 features) Encoder: input (1280) → hidden (h) → (μ, log σ²) [latent = L] Decoder: sample z ∼ 𝒩(μ, σ) → hidden (h) → output (1280) where h = max(L×2, min(256, √(1280×L))) (geometric mean scaling) Loss = MSE_recon + β × KL[𝒩(μ,σ) || 𝒩(0,I)] (β = 0.05) We use only μ for latent navigation (no sampling at inference).

Barycentric Mixing

⚖️ Inverse-Distance Weights

At each step, given current latent position z and K nearest neighbors z₁...zₖ:

dᵢ = ||z - zᵢ||₂ wᵢ = (1/dᵢ) / Σⱼ (1/dⱼ) (inverse distance weighting) mixed = Σᵢ wᵢ × clipᵢ

This creates a smooth blend where closer events contribute more.

Navigation Modes

Pitch Preservation

Three modes: off: linear interpolation resampling (may shift pitch) t_src = np.linspace(0, n-1, target_len) result = np.interp(t_src, np.arange(n), clip) preserve_f0: pad/trim without resampling (duration changes, pitch preserved) • If longer: fade out near trim point, then truncate • If shorter: fade out near end, then zero-pad preserve_spectral_envelope: phase vocoder time-stretch • STFT with 75% overlap (Hann window, OLA normalisation) • Instantaneous frequency estimation for phase propagation • Preserves both pitch and spectral envelope

Normalization Modes

none: no scaling (pass through) peak: scale so peak = -1 dBFS (0.891) out *= 0.891 / peak rms / loudness: scale output RMS to match input RMS, then limit peak ≤ 0.891 gain = ref_rms / out_rms out *= gain if peak > 0.891: out *= 0.891 / peak

Preset Strategies

Preset 2: Gentle Drift

🌊 Smooth, Coherent

Latent: 6 | Plan: auto (defaults) | Pitch: off

Character: Smooth drift through latent space — gentle, coherent evolution

Use on: Ambient, gradual transformations

Preset 3: Full Mutation Arc

🌀 Exploratory Arc

Latent: 10 | Plan: auto (defaults) | Pitch: off

Character: Full cycle through all modes — drift → mutate → return → settle

Use on: Complete narrative arcs

Preset 4: Return Focus

🎯 Anchor-Focused

Latent: 8 | Plan: auto (defaults) | Pitch: off

Character: Emphasis on return mode — pulls toward anchor positions

Use on: Creating tension and resolution

Preset 5: Slow Settle

🧘 Convergent

Latent: 6 | Plan: auto (defaults) | Pitch: off

Character: Gradual settling into stable region

Use on: Resolution, conclusion

Parameters & Controls

Core Settings

ParameterDefaultDescription
Latent_size8VAE latent dimensions (2–32)
Duration (0 = original)0Target output duration (seconds)
Seed42Random seed for reproducibility

Plan Generator

ParameterDefaultDescription
Plan_sourceExternal CSVExternal CSV or Auto-generate
Plan_mode_presetcycledrift, mutate, return, cycle
Plan_steps60Number of navigation steps (4–500)
Plan_step_size0.35Movement step size (0–1)
Plan_temperature0.40Noise level (0–1)
Plan_k_neighbors4Number of neighbors for barycentric mix (2–8)
Plan_return_strength0.65Strength of return pull (0–1)
Plan_anchor_strategycentercenter, step0, last, periodic
Plan_anchor_period15Steps between anchors (periodic mode)
Plan_dur_scale1.0Duration multiplier per step (0.25–4.0)
Plan_dur_jitter0.0Random variation in duration (±)
Plan_eng_scale1.0Energy multiplier per step (0.1–3.0)
Plan_eng_jitter0.0Random variation in energy (±)

Output Level

ParameterDefaultDescription
Normalize_modermsnone, peak, rms, loudness

Pitch Preservation

ParameterDefaultDescription
Pitch_modeoffoff, preserve_f0, preserve_spectral_envelope

Output

ParameterDefaultDescription
Draw_visualization1Generate 6-panel analysis display
Play_result1Audition after processing

Visualization & Analysis

6-Panel Display

Latent Barycentric Mutation Visualization: Panel 1: TITLE • Script name, source name, preset, latent size, plan source Panel 2: INPUT WAVEFORM • Gray waveform with red dotted lines = event boundaries • Title: "Original (N events)" Panel 3: OUTPUT WAVEFORM • Blue waveform = barycentric output • Title: "Barycentric" • X-axis: Time (s) Panel 4: INPUT SPECTROGRAM • 0-5000 Hz spectrogram of original • Title: "Original spectrogram" Panel 5: OUTPUT SPECTROGRAM • 0-5000 Hz spectrogram of barycentric output (L channel) • Title: "Barycentric output spectrogram (L channel)" Panel 6: NAVIGATION PLAN PHASE PANEL • Colored bars showing steps in each mode: - Blue = drift (coherent small steps) - Red = mutate (large exploratory steps) - Green = return (gravitational pull) - Yellow = settle (cool-down) • Step counts for each mode displayed • Title: "Navigation plan phases (auto-generated — cycle):" Panel 7: SUMMARY PANEL • Events, plan steps, used steps, mean event duration • VAE loss (initial→final), latent size, seed • Duration in/out, normalization mode, RMS comparison • Pitch mode, plan source • Warnings if any

Reading the Navigation Phase Panel

What the colors mean:
  • Blue (drift): Smooth, coherent movement with inertia — gradual evolution
  • Red (mutate): Exploratory jumps with high temperature — sudden changes
  • Green (return): Pull toward anchor positions — creates tension and resolution
  • Yellow (settle): Cool-down, convergence — stabilization
  • The step counts show how many steps in each mode (useful for understanding the plan)

Interpreting Output Spectrogram

What to look for:
  • Smooth transitions: The barycentric mixing should create seamless morphs between events
  • Pitch preservation: With preserve_f0 or preserve_spectral_envelope, pitch should remain stable even as timbre evolves
  • Mode changes: In mutate mode, you may hear more abrupt changes; in drift, smoother evolution

Applications

Electroacoustic Composition

Use case: Creating continuous morphs between acoustic identities

Technique: Full Mutation Arc preset with auto-generate cycle

Workflow:

Sound Design for Media

Use case: Creating evolving textures, transitions, risers

Technique: Gentle Drift with preserve_spectral_envelope

Applications:

Music Production

Use case: Creating evolving pads, generative textures

Technique: Custom plan with specific mode distribution

Examples:

Research & Education

Use case: Studying VAE latent spaces, barycentric interpolation, navigation strategies

Technique: Compare plans on same source, examine phase panel

Learning outcomes:

Practical Workflow Examples

🎬 Film Scene: Emotional Arc

Goal: Create 60-second cue with emotional arc (calm → agitated → resolution)

Settings:

  • Source: 30-second ambient recording
  • Plan: auto, mode=cycle (drift→mutate→return→settle)
  • Pitch mode: preserve_f0

Result: Smooth drift, then exploratory mutations, return to anchor, final settle — perfect emotional arc

🎚️ Electronic Music: Risers

Goal: Create 15-second riser

Settings:

  • Source: 8-second synth stab
  • Plan: auto, mode=mutate (all steps mutate)
  • Temperature: 0.8 (high exploration)

Result: Continuous exploration through latent space — chaotic riser

🎙️ Voice Processing: Character Morph

Goal: Morph between different vocal characters

Settings:

  • Source: 20-second vocal with multiple characters
  • Plan: external CSV from AI planner
  • Pitch mode: preserve_spectral_envelope

Result: Smooth morphs between vocal identities with preserved pitch

Troubleshooting Common Issues

Problem: Python not found or missing packages
Cause: Python not installed, or packages missing
Solution: Install Python and required packages: pip install numpy soundfile scipy
Problem: External plan CSV not found
Cause: Plan source set to External but file missing
Solution: Switch to Auto-generate, or provide valid CSV
Problem: Phase vocoder artifacts (metallic sounds)
Cause: Extreme time-stretching ratios
Solution: Use preserve_f0 mode instead, or reduce dur_scale range
Problem: Output has clicks
Cause: Crossfade insufficient at splice points
Solution: Increase XFADE_SEC in Python script (currently 12ms)
Problem: No audible movement
Cause: step_size too small, or temperature too low
Solution: Increase step_size, increase temperature, check plan modes

Advanced Techniques

Custom external plans:

Generate navigation plans from other AI tools (e.g., latent navigation AI) and load via External CSV. Schema must match expected format.

Phase vocoder tuning:

In _phase_vocoder_stretch(), modify n_fft, hop_out, and OLA floor to adjust quality/performance trade-off.

Anchor strategies:

Experiment with different anchor_strategy values in auto-generate to control return behavior.

Multi-channel input:

Script extracts mono for analysis but preserves stereo through mixing (delayed Haas effect). For true stereo, input stereo and output will be stereo.