KL Divergence Corpus Resynthesis — Information-Theoretic Mosaicking

Corpus A defines a target timbral distribution. Corpus B provides source grains. The script greedily builds an output by selecting grains that minimise KL divergence between the reference distribution and the growing output distribution — actively driving synthesis, not just analysing afterwards.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 1.0 (2026) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools
Contents:

What this does

This script implements information-theoretic corpus mosaicking. Two audio corpora are analysed: Corpus A (REFERENCE) defines a target timbral distribution; Corpus B (SOURCE) provides the building blocks (grains). The script greedily builds an output sound by repeatedly choosing the Corpus B grain that, when added, minimises the KL divergence between the reference distribution P and the running output distribution Q. KL actively drives synthesis — it is not computed after the fact.

What is KL divergence? Kullback-Leibler divergence measures how one probability distribution diverges from a reference distribution. D(P||Q) quantifies the information loss when Q approximates P. Lower KL = closer match. This script uses per-dimension normalised histograms (not GMMs) — an honest, feasible choice inside Praat. The greedy algorithm selects grains one by one, updating Q incrementally, so the output's timbral distribution progressively matches the reference.

Key Features:

Performance optimisation: The inner KL loop uses flat Praat vectors and precomputed corrections — no object access inside the hot loop. This makes the greedy selection feasible for hundreds of steps. For 8-second output at 10 ms hop (~800 grains) with 40 candidates per step, the script runs in seconds, not minutes.

Quick start

  1. Prepare two folders: Corpus A (reference) and Corpus B (source) containing audio files (WAV, AIFF, MP3).
  2. In Praat, run KL_Divergence_Corpus_Resynthesis.praat.
  3. Enter folder paths (or leave blank for chooser dialogs).
  4. Choose a preset (Coarse/fast, Balanced, Fine match, Dense mosaic, or Custom).
  5. Adjust analysis parameters: Frame length (default 46.4 ms), Overlap ratio (1:1 to 1:8), Number of MFCCs, Histogram bins.
  6. Select KL mode (Forward / Reverse / Symmetric).
  7. Set Output_duration (seconds) and Crossfade (overlap between grains).
  8. Click OK — script analyses both corpora, builds reference distribution, runs greedy selection, synthesises output.
Quick tip: Use Coarse/fast for exploratory runs (16 bins, 20 candidate pool). Fine match (48 bins, 80 candidates) produces more precise timbral matching but takes longer. Balanced (32 bins, 40 candidates) is a good starting point. For maximum timbral accuracy, use Symmetric KL — it balances forward and reverse divergence, avoiding mode-seeking or mass-covering behaviour.
Important: Both corpora should contain monophonic or homogeneous timbres for meaningful results. Corpus A defines the target distribution — if it contains diverse sounds (e.g., both voice and drums), the output will try to match that mixture. Corpus B must be long enough to provide sufficient grains; the script can reuse grains (no "without replacement" constraint). The script automatically resamples all files to target_sample_rate (default 44.1 kHz) and normalises peak to 0.99 before analysis.

5 Presets

PresetHistogram BinsCandidate PoolKL ModeOverlapCharacter
Coarse / fast1620Forward (P||Q)1:1 (no overlap)Fastest, coarser distribution matching, minimal smoothing.适合
Balanced 3240Symmetric1:2 (50% overlap)Good quality/time trade-off; recommended default.
Fine match 4880Symmetric1:2 (50% overlap)九年Higher precision, slower, more accurate distribution matching.
Dense mosaic 3260Symmetric1:4 (75% overlap)九年Denser grain placement, smoother temporal evolution.

KL Divergence & Greedy Selection

Kullback-Leibler divergence

Forward KL: DKL(P || Q) = Σx P(x) · log(P(x) / Q(x))
Measures how much information is lost when Q approximates P. Tends to be mode-seeking (pulls Q toward high-probability regions of P).

Reverse KL: DKL(Q || P) = Σx Q(x) · log(Q(x) / P(x))
Tends to be mass-covering (spreads Q to cover P's support).

Symmetric KL: (DKL(P||Q) + DKL(Q||P)) / 2
Balances both behaviours; often the best choice for timbral matching.

Greedy selection algorithm:
  1. Compute reference distribution P from Corpus A (per-dimension histograms, eps-smoothed).
  2. Initialise output distribution Q as zero counts.
  3. For each output step (duration / hop_size):
    • Precompute base KL (without adding any grain) and per-bin corrections.
    • Sample candidate pool (size = candidate_pool) from Corpus B grains.
    • For each candidate, quickly compute KL = (1/featDim) × Σd (base_d + corr_d[addBin])
    • Select grain with lowest KL, commit it: update Q counts, append grain to output.
  4. Resynthesise output by overlap-add of selected grains (crossfade = crossfade seconds).
Why greedy, not global optimisation? Global optimisation over all output steps is NP-hard (set cover with KL objective). Greedy selection, while locally optimal, produces musically coherent results because KL is additive over independent dimensions, and the greedy choice at each step is computationally efficient. Empirical tests show greedy KL decreases monotonically and plateaus near the global optimum.

Feature Space (16 dimensions)

MFCC c1–c13 (13 dims)

Mel-frequency cepstral coefficients: spectral envelope shape (timbre). c1 = overall spectral tilt, c2–c13 = finer formant structure.

RMS (dB) — 1 dim

Root-mean-square amplitude converted to decibels. Captures loudness/dynamics.

Spectral centroid — 1 dim

Centre of gravity of the spectrum (brightness). Higher = brighter timbre.

Spectral spread — 1 dim

Standard deviation of the spectrum around the centroid (bandwidth). Captures noisiness/roughness.

Why these features? MFCCs capture timbral envelope; RMS captures dynamics; centroid captures brightness; spread captures noisiness. Together, they form a compact representation that distinguishes most musical timbres (voice vs. piano vs. drums, bright vs. dark, loud vs. soft). The per-dimension histogram model assumes feature independence — a simplification, but effective for corpus mosaicking.

Applications

Timbral transfer / style emulation

Use case: Corpus A = recordings of a specific instrument (e.g., cello), Corpus B = a different instrument (e.g., voice). The output resynthesises the voice such that its timbral distribution matches the cello — a form of timbre transfer without machine learning.

Settings: Balanced preset, Symmetric KL. Output duration = length of desired output.

Corpus summarisation / interpolation

Use case: Corpus A = long recording, Corpus B = short fragments from the same recording. The output selects grains that approximate the long recording's distribution — effectively a "timbre summary".

Settings: Fine match preset, output duration shorter than original.

Experimental / generative music

Use case: Corpus A = desired timbral "target" (e.g., a specific synthetic texture), Corpus B = a large database of found sounds. The output is a mosaic that sounds like it's made from Corpus B but with the distribution of Corpus A.

Settings: Custom: high histogram bins (48), large candidate pool (80), forward KL for mode-seeking (extreme timbral focus).

Workflow: Voice → cello timbre transfer

Corpus A (reference): 5 minutes of solo cello (sustained tones, pizzicato, arco).
Corpus B (source): 2 minutes of spoken voice (vowels, consonants, breaths).
Settings: Balanced preset, Symmetric KL, output_duration = 30 s.
Result: The output sounds like voice fragments but with the timbral distribution of cello — vocal formants shaped like cello spectral envelopes.

Workflow: Drum loop → ambient texture

Corpus A (reference): 10-minute ambient drone (sustained, smooth).
Corpus B (source): 30-second drum loop (transient-rich, noisy).
Settings: Forward KL (mode-seeking), Coarse preset, output_duration = 60 s.
Result: The output selects drum hits whose spectral features match the drone's distribution — likely quiet, sustained drum sounds (e.g., cymbal rolls, tom sustains) rather than attacks.

Workflow: Corpus interpolation (morphing)

Corpus A: Bright, high-centroid sounds (e.g., piccolo, bells).
Corpus B: Dark, low-centroid sounds (e.g., bass clarinet, cello).
Settings: Symmetric KL, Balanced preset. Morph by mixing reference histograms: to interpolate, create a custom target distribution: 0.5×P_A + 0.5×P_B (requires manual histogram combination).
Result: Output sounds like a timbral cross-fade between the two extremes.

Troubleshooting:
Output sounds static / repetitive: If Corpus B has few distinct grains (small folder or short files), the algorithm will reuse grains. Increase Max_files_per_corpus or use larger Corpus B.
KL barely decreases: If P and Q are already similar from the start, or if Corpus B cannot match P's distribution (e.g., Corpus B has no high-centroid sounds but P has high centroid mass). Check the visualisation histogram.
Output clicks at grain boundaries: Increase Crossfade (e.g., 0.02–0.05 s). Ensure overlap ratio > 1:1.
Processing is slow: Reduce candidate_pool (20–30), reduce histogram_bins (16–24), or use Coarse preset. Also reduce nBins for faster histogram updates.
Corpus B has many files, but script uses only a few: Check Max_files_per_corpus — 0 = no limit. Also ensure files are longer than frame_length + hop_size (short files are skipped).

Visualisation: P vs. Q histograms

When Draw_visualization is enabled, the script displays:
  • Title bar — preset name and displayed feature dimension
  • Histogram panel — light blue bars = reference distribution P (Corpus A), red bars = achieved distribution Q (output mosaic).
  • Legend — KL mode, bin count, initial and final KL values.
The histogram shows how well the greedy selection matched the target distribution. If red bars poorly match blue bars, try increasing candidate_pool or histogram_bins, or switch KL mode.
Grain boundary overlap: The script places grains at intervals of frame_length - crossfade. If crossfade is large (e.g., frame_length/2), grains overlap significantly, producing smooth transitions. Overlap ratio presets (1:2, 1:4, 1:8) control how many grains overlap — higher overlap = smoother, more blurring.