KD-Tree Timbral Counterpoint — Neural Counterpoint Engine

Creates N contrapuntal layers from a target sound by searching a corpus for grains at different timbral distances using a KD-Tree for efficient N-dimensional nearest-neighbour lookup across MFCC, pitch, centroid, intensity, HNR, and ZCR features.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 1.0 (2026) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools
Contents:

What this does

This script implements KD-Tree Timbral Counterpoint — a computational approach to generating polyphonic textures by searching a corpus of audio grains for matches at different timbral distances. A target sound is divided into overlapping grains, each described by 11 acoustic features (6 MFCCs, pitch, spectral centroid, intensity, HNR, ZCR). A KD-Tree provides efficient N-dimensional nearest-neighbour lookup across a corpus of WAV files. Each voice is assigned a different neighbour rank, producing layers ranging from close imitation to distant timbral echo.

What is Timbral Counterpoint? Traditional counterpoint arranges pitches in time. Timbral counterpoint arranges timbres — each layer searches the same corpus at different distances in feature space. Voice 1 (rank 1) finds nearly identical grains (close imitation). Voice 2 (rank 3) finds moderately different grains (shadow). Voice 3 (rank 8) finds quite distant grains (cousin). Voice 4 (rank 20) finds very distant grains (ghost). The result is a polyphonic texture where each voice echoes the target's structure but with different timbral characters, like a canon of sounds.

Key Features:

Technical Implementation: (1) Feature extraction: Target and corpus sounds are divided into overlapping grains (grain_size_ms, overlap_percent). For each grain: MFCC (6 coefficients), Pitch (mean F0), Spectral Centroid, Intensity (RMS), HNR, ZCR. (2) KD-Tree construction: Corpus features standardized, weighted, and indexed. (3) Neighbour search: For each target grain and each voice, find rank N neighbour (with randomness and repetition penalty). (4) Resynthesis: Extract corpus grain, apply gain, pan (cosine law), delay, crossfade, and place in voice timeline. (5) Envelope shaping: Optional pause detection and amplitude envelope from target. (6) Visualization: Target/mix waveforms, spectrograms, contrapuntal voice timeline, summary panel.

Quick start

  1. In Praat, select exactly one Sound object (the target).
  2. Prepare a corpus folder containing .wav files (any length, any sample rate).
  3. Run script…KDTreeTimbralCounterpoint.praat.
  4. Enter Corpus folder path (use forward slashes).
  5. Set Grain size (ms) and Overlap (%) — smaller grains = more granular detail.
  6. Choose a preset or configure custom weights and neighbour ranks.
  7. Adjust Envelope shaping (Off / Pauses only / Amplitude envelope only / Both).
  8. Click OK — script extracts features, runs KD-Tree matching, and resynthesises polyphony.
Quick tip: Start with Spectral Counterpoint preset for balanced results. For close imitation, use Strict Doppelgänger. For ethereal textures, use Ghost Choir (ranks 5,12,25,50). Enable Draw_visualization to see the contrapuntal voice timeline — colour-coded blocks show where each voice plays. The script automatically normalises sample rates to 44.1 kHz and trims trailing silence with a cosine fade-out.
Important: Python + numpy + scipy required — install with pip install numpy scipy (scipy for cKDTree). Corpus folder must contain at least one .wav file. Grain size should not exceed the shortest corpus sound. Processing time depends on corpus size — a few hundred grains takes seconds; thousands may take minutes. Envelope shaping (pauses) uses intensity thresholding (-35 dB) — adjust if target has low-level noise.

4 Contrapuntal Voices — Built-in Separation Rules

Voice 1 — Close

Neighbour rank: 1 (closest match)

Pan: Centre (0.0)

Delay: 0 ms

Gain: 0.9

Character: Almost identical to target — close imitation, like a canon at the unison.

Voice 2 — Shadow

Neighbour rank: 3

Pan: Alternating ±0.5

Delay: 20 ms

Gain: 0.65

Character: Slightly different timbre, stereo separation, short echo.

Voice 3 — Cousin

Neighbour rank: 8

Pan: Alternating ±0.8

Delay: 45 ms

Gain: 0.45

Character: Distantly related timbre, wide stereo, noticeable delay.

Voice 4 — Ghost

Neighbour rank: 20

Pan: Random ±1.0

Delay: 80 ms

Gain: 0.30

Character: Very different timbre, unpredictable panning, long echo — like a ghost of the original.

Pan law: Script uses cosine panning: left_gain = cos(angle), right_gain = sin(angle), where angle = (pan + 1) × π/4. This ensures constant power across the stereo field.

Presets

PresetRanksWeights (MFCC/Pitch/Cent/Int/HNR)RandomnessCharacter
Strict Doppelgänger1,2,3,4MFCC=1.5, Int=1.0, others=0.80.05Very close imitation — almost identical layers, subtle stereo spread.
Spectral Counterpoint1,3,8,20Balanced (1.0,0.8,0.5,0.5,0.5)0.2Standard preset — clear hierarchy from close to distant.
Ghost Choir5,12,25,50Balanced0.5All voices distant — ethereal, choir-like texture with unpredictable timbres.
Orchestral Shadow1,3,8,20Pitch=1.5, Centroid=1.2, others=1.00.2Emphasises pitch and spectral shape — good for melodic material.
Noise Doppelgänger1,3,8,20HNR=2.0, Centroid=1.5, others=1.00.2Emphasises noise content — transforms pitched sounds into noisy textures.
Customuser-defineduser-defineduser-definedFull manual control over all parameters.
Weighted Euclidean distance in standardized feature space:
distance = sqrt( Σ w_i × ((f_i - μ_i)/σ_i)² )

Where:
- f_i = feature value (MFCC1..6, centroid, pitch, intensity, HNR, ZCR)
- μ_i, σ_i = mean and standard deviation computed from all corpus grains
- w_i = user weight per feature group (MFCCs share the same weight)

Feature Space — 11 Dimensions

FeatureDescriptionWeight group
MFCC 1–6Mel-frequency cepstral coefficients — spectral envelope shape (timbre).mfcc_weight (default 1.0)
Pitch (F0)Fundamental frequency in Hz (0 for unvoiced).pitch_weight (default 0.8)
Spectral centroidCentre of gravity of the spectrum (brightness).spectral_centroid_weight (default 0.5)
IntensityRMS energy in dB.intensity_weight (default 0.5)
HNRHarmonics-to-Noise Ratio — periodicity vs. noise content.hnr_weight (default 0.5)
ZCRZero-crossing rate — correlates with noisiness/spectral slope.hnr_weight (same as HNR)

KD-Tree indexing

The script builds a scipy.spatial.cKDTree from all corpus grains after standardization and weighting. Query time is O(log N) — efficient even for large corpora (tens of thousands of grains). For each target grain and each voice, the tree is queried for k = (rank + 30) neighbours, then the rank-th neighbour is selected (with randomness and repetition penalty).

Repetition penalty: When enabled, the script avoids selecting any corpus grain that appears in the last 20 selections for the current voice. This prevents the same grain from repeating too frequently, encouraging variety across the timeline.
Randomness parameter: Adds stochastic variation to the selected rank: actual_rank = rank + random(-shift..+shift), where shift = rank × randomness_amount. Higher randomness = more exploration of distant neighbours, less strict adherence to the rank hierarchy.

Visualization (Praat Picture)

When Draw_visualization is enabled, the script generates a comprehensive 8×8 cm picture with six panels:

PanelContent
Title barScript name, sound name, voice count, ranks, preset name.
Target waveformTarget sound with vertical lines at each grain boundary.
Output mix waveformFinal stereo mix after all voices, trimming, and fade-out.
Target spectrogram0–5000 Hz, Gaussian window — left side of canvas.
Output spectrogramRight side of canvas — compare spectral transformation.
Contrapuntal voice timelineColour-coded horizontal bars for each voice (V1–V4). Each bar = one corpus grain placed in time. Gaps = silence between grains (preserves target rhythm).
Summary panelVoices, ranks, randomness, grain parameters, file counts, durations, weights, envelope shaping mode.
Interpreting the voice timeline: Each coloured block represents a corpus grain placed at a specific time. Block colour varies by voice (blue, red, green, purple). Gaps between blocks occur when the target grain sequence has pauses — the script preserves the target's rhythmic structure exactly. This reveals how the target's temporal shape is mirrored across all voices, but with different timbral content.

Applications

Generative Polyphonic Texture

Use case: Transform a monophonic recording into a rich, evolving polyphonic texture with 4 distinct timbral layers.

Technique: Use Spectral Counterpoint preset. Target can be voice, instrument, or field recording. Corpus should contain sounds with complementary timbres (e.g., same instrument different articulations, or cross-synthesis between voice and synthesisers).

Timbral Canon / Echo

Use case: Create a canon where each voice echoes the target's structure but with different timbres, like a "sound canon".

Technique: Strict Doppelgänger with ranks 1,2,3,4, short grain size (50–100 ms), low randomness. The result: four nearly identical layers with slight timbral variations, panned across stereo field.

Corpus Exploration & Composition

Use case: Discover unexpected relationships within a corpus. Use a silent target (or constant tone) to "scan" the corpus at different ranks.

Technique: Target a simple sine wave or white noise. The result is a composition where grains are selected based on timbral proximity to the target — revealing the corpus's internal structure.

Experimental Music & Sound Art

Use case: Generate impossible textures — sounds that are simultaneously identical in rhythm but wildly different in timbre.

Technique: Ghost Choir preset with large grain size (500–1000 ms), high randomness (0.5–0.8), envelope shaping = "Pauses only" to preserve target rhythm while letting timbres drift.

Workflow: Voice → Percussion Ghost Choir

Target: Spoken phrase (3–5 seconds).
Corpus: Folder of drum loops, cymbals, foley sounds.
Settings: Ghost Choir preset, grain=150 ms, overlap=50%, envelope shaping = "Pauses only".
Result: The spoken rhythm is preserved, but each syllable triggers drum/percussion grains at different timbral distances — a percussive "ghost choir".

Workflow: Piano → Orchestral Shadow

Target: Piano melody.
Corpus: Orchestral samples (strings, woodwinds, brass, pizzicato).
Settings: Orchestral Shadow preset (pitch_weight=1.5, centroid_weight=1.2).
Result: Each piano note is echoed by orchestral grains with similar pitch and brightness — a spectral canon where piano becomes the conductor of an imaginary orchestra.

Troubleshooting Common Issues:
Scipy not installed: Run pip install scipy. The script falls back to brute-force search (O(N²)) but warns about performance.
No .wav files in corpus: Ensure corpus folder contains WAV files (mono or stereo, any sample rate — script resamples to 44.1 kHz).
Grain size too large: If grain size exceeds the shortest corpus sound, extraction fails. Use smaller grains (e.g., 100–300 ms).
Output silence: Check that corpus grains have reasonable amplitude. Increase final_gain in voice rules (edit script) or use envelope shaping = "Amplitude envelope only" to boost soft grains.
KD-Tree query slow: Reduce corpus size or increase grain size (fewer grains). For huge corpora (10k+ grains), consider subsetting.
Envelope shaping ignores low-level content: Adjust silence threshold in the To TextGrid (silences) command (currently -35 dB) — lower for quiet sources.

Advanced: Custom Voice Separation

To modify voice rules (gain, pan, delay, rank), edit the Python section in kd_tree_timbral_counterpoint.py:
if v_idx == 0:  # Voice 1
    gain, pan, delay = 0.9, 0.0, 0.0
elif v_idx == 1:  # Voice 2
    gain, pan, delay = 0.65, pan_val, 20.0
elif v_idx == 2:  # Voice 3
    gain, pan, delay = 0.45, pan_val, 45.0
else:  # Voice 4+
    gain, pan, delay = 0.30, random.uniform(-1.0, 1.0), 80.0

Mathematical Deep Dive

Feature Extraction per Grain

For each grain (start to end): • MFCC (13 coefficients, use first 6): cepstral representation of spectral envelope • Pitch: median F0 via Praat's autocorrelation method (75–600 Hz) • Spectral centroid: Σ (freq × magnitude) / Σ magnitude • Intensity: RMS = sqrt( mean(x²) ) converted to dB • HNR (Harmonicity): ratio of periodic to noisy energy in dB • ZCR: zero-crossings per second = n_zero_crossings / duration

Standardization & Weighting

Corpus feature matrix X_c (N × 11): μ = mean(X_c, axis=0) σ = std(X_c, axis=0) + ε (ε = 1e-8 to avoid division by zero) Standardized: X_std = (X - μ) / σ Weighted: X_w = X_std × w (where w broadcasts across feature dimensions) Euclidean distance in weighted, standardized space: d(p, q) = || (p - q) × w / σ ||₂

Resynthesis: Grain Placement & Crossfade

Each selected corpus grain: 1. Extract grain [c_start, c_end] from corpus file. 2. Apply gain (linear amplitude scaling). 3. Apply cosine panning to stereo. 4. Apply delay (shift start time by delay_ms / 1000). 5. Cosine crossfade at grain boundaries (crossfade_duration, default 50 ms): fade-in: 0.5 - 0.5×cos(π × t / fade_dur) fade-out: 0.5 + 0.5×cos(π × t / fade_dur) 6. Place into voice timeline — non-overlapping placement (simple scheduler). 7. Mix all voices into stereo output.