Semantic Timbre Retrieval — User Guide

Free‑text timbre search over a corpus of audio files. Type a prompt like “dark airy swelling scrape” – the engine analyses the corpus, extracts 8 semantic dimensions (brightness, noisiness, tonalness, stability, impulsiveness, sustain, roughness, spatiality), applies rule‑based tagging, and retrieves the best‑matching files or segments using hybrid scoring (semantic + tag + keyword). Returns a ranked list and an optional preview montage.

Author: Shai Cohen Affiliation: Department of Music, Bar‑Ilan University, Israel Version: 1.0 (2026) License: MIT License Repo: GitHub

Contents:

What it does Quick start The 8 semantic dimensions Prompt parsing Scoring & ranking Parameters Visualization FAQ / troubleshooting

What this does

Semantic Timbre Retrieval is a content‑based audio search engine that lets you find sounds by describing them in everyday language. Given a folder of audio files (the corpus) and a free‑text prompt (e.g., “dark airy swelling scrape”), the system:

Scans the corpus and segments each file into regions (energy‑based or fixed‑window).
Extracts 16 raw acoustic features (RMS, spectral centroid, ZCR, onset strength, pitch confidence, etc.).
Derives 8 semantic dimensions (brightness, noisiness, tonalness, stability, impulsiveness, sustain, roughness, spatiality) in [0,1].
Applies rule‑based tagging (e.g., “dark”, “airy”, “swelling”, “scrape”) from a human‑editable JSON rule file.
Parses the free‑text prompt using a lexicon (intensity modifiers, synonyms, negation).
Scores each segment via a weighted hybrid of semantic distance, tag matches, and keyword presence.
Ranks results with a diversity penalty (prevents the same source file from dominating).
Exports ranked results to CSV and optionally builds a preview montage WAV.

No neural networks, no embeddings, no external APIs. The system uses hand‑crafted rules (JSON) for tagging and a lexicon for prompt parsing. All processing is local, fast, and explainable. The output includes a short caption for each match (e.g., “bright tonal sustained gesture with smooth edge”).

Quick start

Prepare a folder of audio files (WAV, FLAC, AIFF, MP3, OGG).
In Praat, run the script semantic_timbre_retrieval.praat.
Select the corpus folder.
Enter a Prompt (e.g., “dark airy swelling scrape”).
Choose Retrieval_mode:
- files – each file is treated as a single item.
- segments – files are split into shorter segments (energy‑based with min/max duration).
Set Top_matches (number of results to return).
Optionally enable Build_preview_montage – creates a concatenated WAV of the top matches with crossfades.
Adjust weights: Weight_semantic (dimension distance), Weight_tag (tag matches), Weight_keyword (keyword presence), Diversity_penalty (avoid over‑sampling the same file).
Click OK. Python scans the corpus, extracts features, scores, and returns a list of matches in the Info window. If preview is enabled, a Sound object named str_preview_prompt is created.

Tip: Start with a small corpus (10–20 files) to test the prompt. Use segments mode for detailed retrieval, files mode for a coarse match. The preview montage is a great way to audition the top results.

Important: Python dependencies: numpy, scipy, soundfile, librosa. The first run on a large corpus may take a few minutes while librosa extracts features. Subsequent runs are faster because the engine re‑analyses the corpus each time (no persistent index yet).

The 8 semantic dimensions

The engine maps raw acoustic features to eight perceptual dimensions, each normalised to [0,1]. These form the “semantic space” for retrieval.

。。。。。。。。。<1 – noisiness + pitch confidence. 。。。。。。。。。。。。。。。

Dimension	Low (0)	High (1)	Derivation
brightness
noisiness
tonalness
stability
impulsiveness
sustain
roughness
spatiality

Prompt parsing (lexicon‑driven)

📝 How the prompt is interpreted

The system uses two JSON files:

semantic_prompt_lexicon.json – defines intensity modifiers (“slightly”, “very”), synonyms (“deep” → “dark”), negations (“not”, “without”), and token definitions (each token has a target dim delta and tag boosts).
semantic_timbre_rules.json – defines rule‑based tagging: per‑dim band thresholds and compound rules (e.g., “drone” requires high sustain + low impulsiveness + high stability).

Example: “very dark airy” → “very” scales intensity (1.35), “dark” shifts brightness → -0.7, “airy” shifts noisiness +0.4 and spatiality +0.3. The resulting target vector is compared to each segment’s semantic dims.

Supported prompt features:

Intensity modifiers: slightly, somewhat, moderately, very, extremely, less, more, a bit, etc.
Negation: “not bright”, “without sustain” – flips the effect and adds exclusion tags.
Synonyms: “deep” → “dark”, “shiny” → “bright”, “whoosh” → “airy”, etc.
Compound tags: “drone”, “burst”, “cloud”, “scrape”, “gesture”, “stream”, “metallic”, “wooden”, “glassy”, “frictional”, “vowel‑like”, etc.

Scoring & ranking

Semantic score: 1 – weighted Euclidean distance between target dim vector and segment dim vector. Weights are derived from the prompt (dims mentioned get higher weight).

Tag score: sum of (boost × confidence) for matched tags, minus penalties for excluded tags, normalised to [0,1].

Keyword score: fraction of prompt keywords that appear in the segment’s short caption.

Total score: weighted average of semantic, tag, and keyword scores (weights adjustable).

Diversity penalty: after ranking, a greedy algorithm picks the highest‑scoring candidate, then reduces the score of any remaining candidate from the same source file by diversity_penalty × k (where k is the number of already‑picked siblings).

The output includes a short caption (e.g., “dark tonal sustained gesture with smooth edge”) and an explanation (e.g., “strong on brightness+stability | tags: dark/sustained | weak on noisiness”).

Parameters & defaults

Corpus & prompt

。。。

Parameter	Default	Description
Prompt	“dark airy swelling scrape”
Retrieval_mode	segments
Top_matches	8

Segmentation (segments mode only)

ParameterRangeDefaultDescription Min_segment_sec0.1–2.00.25。 Max_segment_sec0.5–8.04.0。 Segment_gate_dB-50 to -10-35.0。

Weights & diversity

。。。。

Parameter	Range	Default
Weight_semantic	0–2	1.0
Weight_tag	0–2	0.5
Weight_keyword	0–2	0.25
Diversity_penalty	0–0.5	0.15

Preview

ParameterDefaultDescription Build_preview_montageyes。 Crossfade_ms30 ms。

Output

ParameterDefaultDescription Draw_visualizationyes。 Play_previewyes。

Visualization (Praat picture)

When Draw_visualization = 1, the script draws a 5‑panel figure:

Retrieved profile – bar chart of the average semantic dimensions of the top‑3 matches (target profile).
Ranked matches list – each rank shows source file, time range, short caption, and a horizontal bar representing the total score.
Per‑rank semantic heatmap – 8 columns (dimensions) × up to 8 rows (ranks). Darker blue = higher value in that dimension.
Summary panel – corpus files scanned, items indexed, prompt tokens resolved, weights, gate, render time.

Tip: The heatmap is the most informative panel – it shows the semantic profile of each retrieved result. If the top result is dark (low brightness) but the second result is bright, you’ll see a clear colour difference.

FAQ / troubleshooting

“Missing Python dependencies”

Install: pip install numpy scipy soundfile librosa. The script checks for these packages and exits cleanly if missing.

No results / all scores low

If the corpus is small or doesn’t contain sounds that match the prompt, scores will be low. Try a broader prompt, or increase Top_matches to see more results. The semantic heatmap will show the actual dims of the retrieved items – you can see which dimensions are mismatched.

Preview montage is silent / contains clicks

Check that Crossfade_ms is not too short (default 30 ms is safe). The montage uses equal‑power crossfades (cosine/sine). If segments are extremely short, crossfade may fail – reduce crossfade or increase min_segment_sec.

Customising the lexicon and rules

The two JSON files are located in the plugin_AudioTools/py/ folder. You can edit them to add new semantic tokens, adjust intensity modifiers, or modify tagging thresholds. The files are human‑readable and documented with comments.

Energy‑based segmentation

The engine uses RMS‑gated segmentation: regions above peak × 10^(gate/20) are kept. This works well for most sounds. If the corpus contains very quiet or very dynamic material, adjust Segment_gate_dB (e.g., -20 dB for louder material, -50 dB for very quiet).

Diversity penalty

Without diversity, the top 8 results might all come from the same file (if that file contains many similar segments). The penalty encourages variety by reducing the score of additional segments from the same source file. A value of 0.15 is typical – increase to 0.3 for more aggressive diversification.