The Latent Counterpoint β€” User Guide

Trains an autoencoder on-the-fly to learn a latent space from event-level audio patches, then deploys multiple agents that navigate the latent space simultaneously with counterpoint forces (attraction, repulsion, inertia, jitter) to produce polyphonic recombination of the input material.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 1.0 (2025) License: MIT License Citation: Cohen, S. (2025). Praat AudioTools Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools
Contents:

What this does

This script implements The Latent Counterpoint β€” an AI-powered multi-agent system that learns a latent space from event-level audio patches using an on-the-fly autoencoder, then deploys multiple agents that navigate this space simultaneously with counterpoint forces (attraction, repulsion, inertia, jitter) to produce polyphonic recombination of the input material.

🎡 What is Latent Counterpoint?

Traditional counterpoint is the art of combining independent melodic lines. This system creates a latent counterpoint:

  • Events are segmented from the source audio (200ms–3s)
  • Autoencoder learns a latent space where each event becomes a point
  • Agents (Cantus, Florid, Shadow) navigate this space with distinct behaviors
  • Forces (attraction to events, repulsion from other agents) create emergent polyphony
  • The result is a new composition where multiple "voices" independently recombine the source material

Key Features:

Technical Implementation: (1) Event Segmentation: Praat segments audio into 200ms–3s events. (2) Mel Patches: Python extracts 40Γ—32 log-mel patches per event. (3) Autoencoder: Train MLP with hidden layer, leaky ReLU, denoising, L2 reg, Adam. (4) Latent Geometry: Compute center, periphery, distance matrix. (5) Agent Physics: Multi-agent simulation with forces, inertia, jitter. (6) Polyphonic Reconstruction: Sum agent outputs with panning and volume scaling. (7) Visualization & Stats.

Quick start

  1. In Praat, select exactly one Sound object (any duration, any content).
  2. Run script… β†’ select LatentCounterpoint.praat.
  3. Choose Preset (2-7 for specific strategies, 1 for custom).
  4. Set number of agents, latent size, counterpoint rigidity, speed.
  5. Set target duration (0 = original).
  6. Enable Draw_visualization for analysis display.
  7. Click OK β€” engine segments, trains autoencoder, runs agent physics, reconstructs.
Quick tip: Start with Trio preset on a 10-20 second recording with varied texture. Enable visualization β€” you'll see event boundaries (red lines) on the input waveform, agent profiles with stats, and unison rates. Listen to how the three voices create independent yet related lines, each navigating the latent space differently. The output appears as "source_cp" in the Objects window.
Important: PYTHON DEPENDENCIES β€” Requires numpy, soundfile, scipy (no scikit-learn). AUTOENCODER TRAINING happens on-the-fly and may take 30-60 seconds. EVENT SEGMENTATION uses intensity peaks β€” if your material has few dynamic changes, consider a different source. LATENT SIZE affects representation β€” too small may lose detail, too large may overfit. COUNTERPOINT RIGIDITY controls agent repulsion β€” higher = more independent voices.

Latent Counterpoint Theory

Agent Profiles

🎭 Three Behavioral Archetypes

ProfileRoleMassMax SpeedJitterAttractionBehavior
CantusStable leader3.00.30.051.0Gravitates to center of gravity, slow, stable
FloridPeripheral wanderer0.51.50.20.6Attracted to rare/peripheral sounds, fast, exploratory
ShadowLagging mirror2.00.40.080.3Mirrors Cantus with 3-step lag + inversion, moderate

Physics Engine

βš™οΈ Agent Dynamics

At each time step, each agent experiences:

Attraction force: β€’ Cantus: to center + mild pull to nearest event β€’ Florid: to peripheral events (weighted by periphery score) β€’ Shadow: to mirrored position of lagged Cantus + mild event pull Repulsion force: F_rep = rigidity Γ— median_distΒ² / dΒ² (capped at speedΓ—3) Jitter: deterministic pseudo-random noise scaled by profile Velocity update: a = F / mass v = 0.85Β·v + a speed = min(||v||, max_speed Γ— speed_global Γ— median_dist)

LRU memory: Each agent remembers last 5 chosen events and applies distance penalty to avoid repetition.

Latent Geometry

For latent vectors Z[1..n_events]: center = mean(Z) periphery[i] = ||Z[i] - center|| / max(||Z - center||) dists[i,j] = ||Z[i] - Z[j]|| median_dist = median(dists where i≠j) These provide the spatial reference for agent movement.

Polyphonic Reconstruction

Each agent produces a mono sequence by concatenating its chosen events: layer_i = concatenate(clips[history_i]) with crossfades Panning and volume by profile: β€’ Cantus: vol=0.6, pan=0.5 (center) β€’ Florid: vol=0.4, pan=0.75 (right-of-center) β€’ Shadow: vol=0.35, pan=0.25 (left-of-center) Output = Ξ£ layer_i Γ— gain_i Γ— pan_i (stereo)

Unison Rate

For agents i and j at step s: unison(s) = 1 if history_i[s] == history_j[s] else 0 unison_rate = Ξ£ unison(s) / total_steps Lower unison rates indicate more independent voices β€” true counterpoint.

Preset Strategies

Preset 2: Duo (2 voices)

🎼 Two-Voice Counterpoint

Agents: 2 | Latent: 6 | Rigidity: 0.4 | Speed: 0.4

Character: Gentle duo β€” one Cantus, one Florid, moderate independence

Use on: Simple material, exploring two-voice counterpoint

Preset 3: Trio (3 voices)

🎡 Three-Voice Counterpoint

Agents: 3 | Latent: 8 | Rigidity: 0.5 | Speed: 0.5

Character: Balanced trio β€” Cantus, Florid, Shadow β€” full archetype set

Use on: General purpose, exploring three-voice interactions

Preset 4: Quartet (4 voices)

🎻 Four-Voice Ensemble

Agents: 4 | Latent: 10 | Rigidity: 0.6 | Speed: 0.5

Character: Denser texture β€” profiles cycle: Cantus, Florid, Shadow, Florid

Use on: Richer material, complex counterpoint

Preset 5: Dense Ensemble (5 voices)

🌫️ Five-Voice Texture

Agents: 5 | Latent: 12 | Rigidity: 0.7 | Speed: 0.6

Character: Dense polyphonic texture with strong repulsion

Use on: Complex material, dense textures

Preset 6: Tight Counterpoint

πŸ”— Highly Independent

Agents: 3 | Latent: 8 | Rigidity: 1.5 | Speed: 0.3

Character: Very strong repulsion β€” voices stay far apart in latent space, minimal unison

Use on: Maximizing voice independence

Preset 7: Free Scatter

πŸŒ€ Loose, Scattered

Agents: 3 | Latent: 10 | Rigidity: 0.1 | Speed: 1.2

Character: Very weak repulsion, high speed β€” agents wander freely, overlapping often

Use on: Loose, overlapping textures

Parameters & Controls

Agent Parameters

ParameterDefaultDescription
Number_of_agents32-6 agents (profiles cycle through Cantus/Florid/Shadow)
Latent_size8Autoencoder latent dimensions (2–32)
Counterpoint_rigidity0.5Repulsion strength between agents (0–2)
Speed0.5Global agent movement speed (0.1–3)

Duration

ParameterDefaultDescription
Duration (0 = original)0Target output duration (seconds)

Output

ParameterDefaultDescription
Seed42Random seed for reproducibility
Draw_visualization1Generate 5-panel analysis display
Play_result1Audition after processing

Visualization & Analysis

5-Panel Display

The Latent Counterpoint Visualization: Panel 1: TITLE β€’ Script name, source name, preset, agent count, rigidity Panel 2: INPUT WAVEFORM β€’ Gray waveform with red dotted lines = event boundaries β€’ Title: "Original (N events)" Panel 3: OUTPUT WAVEFORM β€’ Purple waveform = stereo counterpoint output β€’ Title: "Counterpoint" β€’ X-axis: Time (s) Panel 4: ORIGINAL SPECTROGRAM β€’ 0-5000 Hz spectrogram of original β€’ Title: "Original spectrogram" Panel 5: OUTPUT SPECTROGRAM β€’ 0-5000 Hz spectrogram of counterpoint (L channel) β€’ Title: "Counterpoint spectrogram (L channel)" Panel 6: AGENT PROFILES PANEL β€’ For each agent, color-coded by agent ID: - Agent 0 (blue), Agent 1 (red), Agent 2 (green), Agent 3 (purple), etc. β€’ Profile name (Cantus/Florid/Shadow) β€’ Steps, Unique events, Repetition rate β€’ Average travel distance in latent space β€’ Title: "Agent Profiles:" Panel 7: COUNTERPOINT PANEL β€’ Unison rates for each agent pair (lower = more independent) β€’ Title: "Counterpoint (unison rates β€” lower = more independent):" Panel 8: SUMMARY PANEL β€’ Events count, total unique used, mean event duration β€’ Autoencoder loss (initial β†’ final), latent size, seed β€’ Duration in/out, RMS comparison β€’ Warnings if any

Reading Agent Profiles

What the agent stats mean:
  • Steps: Number of events selected by this agent (should be similar across agents)
  • Unique: How many distinct source events the agent used β€” lower = more repetition
  • Rep rate: (steps - unique)/steps β€” higher = more repetition
  • Travel: Average latent distance between consecutive chosen events β€” higher = more exploratory
  • Periphery: Average distance from latent center of chosen events β€” higher = more peripheral

Interpreting Unison Rates

What the numbers mean:
  • Unison rate is the percentage of time two agents select the same event
  • High unison (>30%): Voices are highly correlated β€” may sound like a single line
  • Medium unison (10-30%): Some independence, occasional agreements
  • Low unison (<10%): Highly independent voices β€” true counterpoint
  • Adjust counterpoint_rigidity to control unison rates

Applications

Electroacoustic Composition

Use case: Creating polyphonic textures from single-source material

Technique: Trio or Quartet presets on varied source material

Workflow:

Generative Music

Use case: Creating endless variations with different seeds

Technique: Same preset, different seed values

Examples:

Sound Design for Media

Use case: Creating layered, evolving textures

Technique: Dense Ensemble on appropriate sources

Applications:

Research & Education

Use case: Studying emergent polyphony, agent-based models

Technique: Compare presets on same source, examine agent trajectories

Learning outcomes:

Practical Workflow Examples

🎬 Film Scene: Multiple Personalities

Goal: Create 60-second cue representing multiple characters from a single voice

Settings:

  • Source: 30-second vocal improvisation
  • Preset: Quartet
  • Custom: rigidity=0.8 (strong independence), speed=0.4

Result: Four independent vocal streams, each with distinct character (Cantus, Florid, Shadow, Florid)

🎚️ Electronic Music: Polyphonic Pad

Goal: Create evolving pad with multiple layers

Settings:

  • Source: 8-second synth pad
  • Preset: Trio
  • Custom: speed=0.3 (slow), rigidity=0.3 (loose)

Result: 24-second evolving pad with three overlapping voices

πŸŽ™οΈ Voice Processing: Choral Effect

Goal: Create choral texture from solo voice

Settings:

  • Source: 10-second vocal phrase
  • Preset: Dense Ensemble (5 voices)
  • Custom: rigidity=0.5, speed=0.5

Result: Five-voice polyphony from single voice β€” creates choral effect

Troubleshooting Common Issues

Problem: Python not found or missing packages
Cause: Python not installed, or packages missing
Solution: Install Python and required packages: pip install numpy soundfile scipy
Problem: Too few events detected
Cause: Source has few intensity peaks, or segmentation parameters inappropriate
Solution: Use source with more dynamic variation, or adjust min/max event duration in script
Problem: Autoencoder loss not decreasing
Cause: Too few steps, too small latent size, or data too complex
Solution: Increase learning_steps, increase latent_size, or use simpler source
Problem: All agents sound similar (high unison rate)
Cause: counterpoint_rigidity too low, or speed too high causing convergence
Solution: Increase rigidity, reduce speed, check agent profiles
Problem: Output has clicks
Cause: Crossfade insufficient at splice points
Solution: Increase XFADE_SEC in Python script (currently 8ms)

Advanced Techniques

Custom agent profiles:

In Python script, modify mass, max_speed, jitter_scale, and attraction_weight in Agent.__init__() to create new behavioral types.

Memory size adjustment:

Change a.memory_size (default 5) to control how strongly agents avoid recent events.

LRU penalty tuning:

Modify penalty scaling (currently median_dist * 0.3) to control repetition avoidance strength.

Multi-channel output:

Script outputs stereo. For multi-channel, modify reconstruct_polyphonic() to output N channels with custom panning.