Semantic Timbre Retrieval — User Guide
Free‑text timbre search over a corpus of audio files. Type a prompt like “dark airy swelling scrape” – the engine analyses the corpus, extracts 8 semantic dimensions (brightness, noisiness, tonalness, stability, impulsiveness, sustain, roughness, spatiality), applies rule‑based tagging, and retrieves the best‑matching files or segments using hybrid scoring (semantic + tag + keyword). Returns a ranked list and an optional preview montage.
What this does
Semantic Timbre Retrieval is a content‑based audio search engine that lets you find sounds by describing them in everyday language. Given a folder of audio files (the corpus) and a free‑text prompt (e.g., “dark airy swelling scrape”), the system:
- Scans the corpus and segments each file into regions (energy‑based or fixed‑window).
- Extracts 16 raw acoustic features (RMS, spectral centroid, ZCR, onset strength, pitch confidence, etc.).
- Derives 8 semantic dimensions (brightness, noisiness, tonalness, stability, impulsiveness, sustain, roughness, spatiality) in [0,1].
- Applies rule‑based tagging (e.g., “dark”, “airy”, “swelling”, “scrape”) from a human‑editable JSON rule file.
- Parses the free‑text prompt using a lexicon (intensity modifiers, synonyms, negation).
- Scores each segment via a weighted hybrid of semantic distance, tag matches, and keyword presence.
- Ranks results with a diversity penalty (prevents the same source file from dominating).
- Exports ranked results to CSV and optionally builds a preview montage WAV.
Quick start
- Prepare a folder of audio files (WAV, FLAC, AIFF, MP3, OGG).
- In Praat, run the script
semantic_timbre_retrieval.praat. - Select the corpus folder.
- Enter a Prompt (e.g., “dark airy swelling scrape”).
- Choose Retrieval_mode:
- files – each file is treated as a single item.
- segments – files are split into shorter segments (energy‑based with min/max duration).
- Set Top_matches (number of results to return).
- Optionally enable Build_preview_montage – creates a concatenated WAV of the top matches with crossfades.
- Adjust weights: Weight_semantic (dimension distance), Weight_tag (tag matches), Weight_keyword (keyword presence), Diversity_penalty (avoid over‑sampling the same file).
- Click OK. Python scans the corpus, extracts features, scores, and returns a list of matches in the Info window. If preview is enabled, a Sound object named
str_preview_promptis created.
numpy, scipy, soundfile, librosa. The first run on a large corpus may take a few minutes while librosa extracts features. Subsequent runs are faster because the engine re‑analyses the corpus each time (no persistent index yet).
The 8 semantic dimensions
The engine maps raw acoustic features to eight perceptual dimensions, each normalised to [0,1]. These form the “semantic space” for retrieval.
| Dimension | Low (0) | High (1) | Derivation | ||||
|---|---|---|---|---|---|---|---|
| brightness | 。|||||||
| noisiness | 。|||||||
| tonalness | 。|||||||
| stability | 。|||||||
| impulsiveness | 。|||||||
| sustain | 。|||||||
| roughness | 。|||||||
| spatiality | 。
Prompt parsing (lexicon‑driven)
📝 How the prompt is interpreted
The system uses two JSON files:
- semantic_prompt_lexicon.json – defines intensity modifiers (“slightly”, “very”), synonyms (“deep” → “dark”), negations (“not”, “without”), and token definitions (each token has a target dim delta and tag boosts).
- semantic_timbre_rules.json – defines rule‑based tagging: per‑dim band thresholds and compound rules (e.g., “drone” requires high sustain + low impulsiveness + high stability).
Example: “very dark airy” → “very” scales intensity (1.35), “dark” shifts brightness → -0.7, “airy” shifts noisiness +0.4 and spatiality +0.3. The resulting target vector is compared to each segment’s semantic dims.
Supported prompt features:
- Intensity modifiers: slightly, somewhat, moderately, very, extremely, less, more, a bit, etc.
- Negation: “not bright”, “without sustain” – flips the effect and adds exclusion tags.
- Synonyms: “deep” → “dark”, “shiny” → “bright”, “whoosh” → “airy”, etc.
- Compound tags: “drone”, “burst”, “cloud”, “scrape”, “gesture”, “stream”, “metallic”, “wooden”, “glassy”, “frictional”, “vowel‑like”, etc.
Scoring & ranking
Tag score: sum of (boost × confidence) for matched tags, minus penalties for excluded tags, normalised to [0,1].
Keyword score: fraction of prompt keywords that appear in the segment’s short caption.
Total score: weighted average of semantic, tag, and keyword scores (weights adjustable).
Diversity penalty: after ranking, a greedy algorithm picks the highest‑scoring candidate, then reduces the score of any remaining candidate from the same source file by
diversity_penalty × k (where k is the number of already‑picked siblings).
The output includes a short caption (e.g., “dark tonal sustained gesture with smooth edge”) and an explanation (e.g., “strong on brightness+stability | tags: dark/sustained | weak on noisiness”).
Parameters & defaults
Corpus & prompt
| Parameter | Default | Description | |||
|---|---|---|---|---|---|
| Prompt | “dark airy swelling scrape” | 。||||
| Retrieval_mode | segments | 。||||
| Top_matches | 8 | 。
Segmentation (segments mode only)
Weights & diversity
| Parameter | Range | Default | Description | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Weight_semantic | 0–2 | 1.0 | 。|||||||||
| Weight_tag | 0–2 | 0.5 | 。|||||||||
| Weight_keyword | 0–2 | 0.25 | 。|||||||||
| Diversity_penalty | 0–0.5 | 0.15 | 。
Preview
Output
Visualization (Praat picture)
When Draw_visualization = 1, the script draws a 5‑panel figure:
- Retrieved profile – bar chart of the average semantic dimensions of the top‑3 matches (target profile).
- Ranked matches list – each rank shows source file, time range, short caption, and a horizontal bar representing the total score.
- Per‑rank semantic heatmap – 8 columns (dimensions) × up to 8 rows (ranks). Darker blue = higher value in that dimension.
- Summary panel – corpus files scanned, items indexed, prompt tokens resolved, weights, gate, render time.
FAQ / troubleshooting
Install: pip install numpy scipy soundfile librosa. The script checks for these packages and exits cleanly if missing.
If the corpus is small or doesn’t contain sounds that match the prompt, scores will be low. Try a broader prompt, or increase Top_matches to see more results. The semantic heatmap will show the actual dims of the retrieved items – you can see which dimensions are mismatched.
Check that Crossfade_ms is not too short (default 30 ms is safe). The montage uses equal‑power crossfades (cosine/sine). If segments are extremely short, crossfade may fail – reduce crossfade or increase min_segment_sec.
The two JSON files are located in the plugin_AudioTools/py/ folder. You can edit them to add new semantic tokens, adjust intensity modifiers, or modify tagging thresholds. The files are human‑readable and documented with comments.
The engine uses RMS‑gated segmentation: regions above peak × 10^(gate/20) are kept. This works well for most sounds. If the corpus contains very quiet or very dynamic material, adjust Segment_gate_dB (e.g., -20 dB for louder material, -50 dB for very quiet).
Without diversity, the top 8 results might all come from the same file (if that file contains many similar segments). The penalty encourages variety by reducing the score of additional segments from the same source file. A value of 0.15 is typical – increase to 0.3 for more aggressive diversification.