Timbral Similarity Browser — User Guide
Content-based audio navigation: analyzes MFCC timbral features, computes acoustic similarity, and creates seamless listening paths through sound collections ordered by perceptual similarity.
What this does
This script implements timbral similarity browsing — an intelligent content-based approach to navigating sound collections. Process involves: (1) Batch loading: Automatically loads all WAV files from a folder with stereo-to-mono conversion. (2) MFCC analysis: Extracts Mel-Frequency Cepstral Coefficients to capture perceptual timbral characteristics. (3) Similarity computation: Calculates Euclidean distances between mean MFCC vectors. (4) Path construction: Creates optimal listening sequence using nearest-neighbor algorithm. (5) Seamless concatenation: Joins sounds in similarity order for continuous playback. Result: A single audio file that flows naturally from one sound to its most timbrally similar neighbor, creating an intuitive acoustic journey through the collection.
Key Features:
- Batch Processing — Automatic loading of entire folders
- MFCC Analysis — Perceptually relevant timbral features
- Content-Based Ordering — Sounds grouped by acoustic similarity
- Nearest-Neighbor Path — Optimal similarity-based sequence
- Automatic Concatenation — Creates seamless listening experience
- Memory Management — Clean removal of temporary objects
What is timbral similarity browsing? Traditional file browsing: alphabetical, chronological, manual organization. Timbral browsing: automatic organization based on acoustic content. Benefits: (1) Discover hidden relationships: Finds connections you might not notice visually. (2) Intuitive navigation: Similar sounds placed close together. (3) Creative inspiration: Reveals unexpected transitions and combinations. (4) Educational tool: Demonstrates acoustic relationships between sounds. (5) Efficient organization: Automatically groups related sounds. Use cases: Sound library management, musical composition, sound design workflows, acoustic research, educational demonstrations, creative exploration.
Technical Implementation: (1) File loading: Case-insensitive WAV detection, stereo-to-mono conversion, progress tracking. (2) MFCC extraction: 12 coefficients, 15ms windows, 5ms steps, 100Hz-100Hz frequency range. (3) Feature aggregation: Mean MFCC vectors across time for each sound. (4) Distance computation: Euclidean distance between 12-dimensional MFCC vectors. (5) Path optimization: Greedy nearest-neighbor algorithm starting from first sound. (6) Output generation: Concatenation in similarity order, automatic cleanup, optional playback. Processing scales linearly with number of files and their durations.
Quick start
- In Praat, select any Sound object (selection ignored).
- Run script… →
timbral_similarity_browser.praat.
- Set max_files_to_load (0 = all files, or limit for testing).
- Enable auto_play for immediate playback of result.
- Click OK → select folder containing WAV files.
- Script processes files → displays similarity path → creates concatenated output.
- Result: "Timbral_Similarity_Path" sound object for playback/export.
Quick tip: Prepare a folder with 10-50 diverse sound files — musical instruments, environmental sounds, vocal samples work well. Use max_files_to_load = 0 for complete analysis. The script shows real-time progress: "Loading sounds" → "MFCC Analysis" → "Computing Similarity" → "Concatenating". Output appears in Objects window as "Timbral_Similarity_Path". Listen to discover unexpected acoustic relationships between your sounds. For large collections (>100 files), processing may take several minutes. All temporary objects are automatically cleaned up.
Important: FOLDER ORGANIZATION CRITICAL — script processes all WAV files in selected folder. File format: Only .wav files supported. Duration requirements: Files shorter than 20ms are skipped. Memory considerations: Large files or many files may cause memory issues. Similarity limitations: MFCC captures timbre but not rhythm, melody, or harmony. Path dependency: Nearest-neighbor path depends on starting point. Concatenation artifacts: Sudden transitions between dissimilar sounds may be jarring. Auto-play: Playback is asynchronous — script completes while sound plays.
MFCC Theory
Mel-Frequency Cepstral Coefficients
What are MFCCs?
Perceptual audio representation:
MFCC = Mel-Frequency Cepstral Coefficients
Purpose: Capture timbral characteristics in perceptually meaningful way
Processing pipeline:
Audio → FFT → Mel filterbank → Log compression → DCT → MFCCs
Why MFCC for timbre?
- Models human frequency perception (Mel scale)
- Decorrelates features (DCT step)
- Focuses on spectral envelope (cepstral analysis)
- Robust to pitch variations
- Standard in speech/sound recognition
Our parameters:
12 coefficients (standard for timbral analysis)
Window: 15ms (captures spectral snapshots)
Step: 5ms (temporal resolution)
Frequency range: 100-100Hz (effectively 100Hz to Nyquist)
Mel Scale Perception
Human frequency perception:
Mel scale: Nonlinear frequency scale matching human perception
Below 1kHz: approximately linear
Above 1kHz: logarithmic compression
Mel formula: m = 2595 × log₁₀(1 + f/700)
Where:
f = physical frequency (Hz)
m = perceptual frequency (Mels)
Interpretation:
1000 Hz = 1000 Mels
2000 Hz ≈ 1500 Mels (perceived as less than double)
4000 Hz ≈ 2100 Mels (further compression)
MFCC advantage: Places more resolution where human hearing is more sensitive
MFCC Processing Pipeline
Step-by-Step Transformation
STEP 1: FRAMING
Divide audio into short overlapping frames
Window: 15ms (e.g., 661 samples at 44.1kHz)
Step: 5ms (66% overlap)
Purpose: Capture time-varying spectral characteristics
STEP 2: WINDOWING
Apply Hanning window to each frame
Reduces spectral leakage
Tapers frame boundaries smoothly
STEP 3: FFT
Compute Fast Fourier Transform for each frame
Converts time domain → frequency domain
Obtain magnitude spectrum
STEP 4: MEL FILTERBANK
Apply triangular filters spaced according to Mel scale
Typically 20-40 Mel filters
Our implementation: Praat default (approx 26 filters)
Output: Filterbank energies
STEP 5: LOG COMPRESSION
Take logarithm of filterbank energies
Models human loudness perception (logarithmic)
Reduces dynamic range
STEP 6: DCT
Discrete Cosine Transform on log filterbank energies
Decorrelates features → independent coefficients
First coefficient (MFCC0) = overall energy (often discarded)
Coefficients 1-12 = spectral shape characteristics
OUTPUT: 12 MFCCs per frame capturing timbral information
MFCC Interpretation
What each coefficient represents:
MFCC Coefficient Meanings:
MFCC 1: Overall spectral tilt (bright vs dark)
High = bright, low = dark
MFCC 2: Spectral shape (peaked vs flat)
High = peaked spectrum, low = flat spectrum
MFCC 3-6: Mid-frequency spectral details
Capture formant-like structures
MFCC 7-12: Fine spectral details
Capture noise characteristics, fine texture
Together: 12-dimensional "timbral fingerprint"
Similar MFCC vectors = similar perceived timbre
Why MFCC for Similarity?
Acoustic Similarity Metrics
Advantages of MFCC-based similarity:
🎵 Perceptual Relevance
MFCC vs other features:
- Spectral centroid: Only brightness, misses spectral shape
- Spectral rolloff: High-frequency content only
- Zero-crossing rate: Noisiness, but poor timbral discrimination
- Pitch/F0: Captures melody but not timbre
- MFCC (12 coefficients): Comprehensive timbral representation
Real-world performance:
- Speech recognition: Distinguishes phonemes
- Music information retrieval: Genre/instrument classification
- Sound event detection: Identifies sound types
- Our use: Timbre-based sound organization
Euclidean Distance in MFCC Space
Similarity computation:
Given two sounds A and B with mean MFCC vectors:
A = [a₁, a₂, ..., a₁₂]
B = [b₁, b₂, ..., b₁₂]
Euclidean distance:
d(A,B) = √[(a₁-b₁)² + (a₂-b₂)² + ... + (a₁₂-b₁₂)²]
Properties:
- d(A,B) ≥ 0 (non-negative)
- d(A,B) = 0 iff A = B (identity)
- d(A,B) = d(B,A) (symmetry)
- d(A,C) ≤ d(A,B) + d(B,C) (triangle inequality)
Interpretation:
Smaller distance = more similar timbre
Larger distance = more different timbre
Normalization: Not needed since MFCCs already scaled appropriately
Analysis Pipeline
File Loading Phase
Batch Processing System
Automated folder processing:
STEP 1: Folder Selection
User selects folder via file dialog
Script checks: folder exists, contains .wav files
STEP 2: File Discovery
files$# = fileNames_caseInsensitive$# (directory$ + "*.wav")
Case-insensitive: finds .wav, .WAV, .Wav, etc.
Returns array of filenames
STEP 3: Loading Loop
FOR i from 1 to nFiles:
Read from file: directory$ + filename
Check success: if selected("Sound") ≠ undefined
Handle stereo: if nChannels > 1 → convert to mono
Store sound ID: sound'i' = loaded_sound_id
Progress reporting: "[i/nFiles] Loading: filename"
STEP 4: Validation
Count successful loads
Skip/remove failed loads
Final count: number_of_sounds
Output: Array of mono Sound objects ready for analysis
Stereo-to-Mono Conversion
Channel handling:
Why convert to mono?
- MFCC analysis designed for monaural signals
- Prevents channel imbalance issues
- Reduces computation time
- Standardizes feature extraction
Conversion method: Praat's "Convert to mono"
Algorithm: (left_channel + right_channel) / 2
Simple averaging preserves spectral content
Alternative: Could use only left channel or more sophisticated downmixing
But averaging works well for most applications
All subsequent processing uses mono signals
Timbre perception is largely monaural
Spatial information discarded (irrelevant for timbral similarity)
MFCC Extraction Phase
Parameter Settings
MFCC configuration:
Praat "To MFCC" parameters:
Number of coefficients: 12
Window length: 0.015 seconds (15ms)
Time step: 0.005 seconds (5ms)
First frequency: 100 Hz
Maximum frequency: 100 Hz (effectively: up to Nyquist)
Pre-emphasis: 0.0 (none)
Why these values?
12 coefficients: Standard for timbral analysis
Enough detail, not too many dimensions
15ms window: Good spectral resolution
Captures stationary segments of most sounds
5ms step: Good temporal resolution
66% overlap provides smooth analysis
100Hz-100Hz: Full frequency range
Let Praat determine appropriate maximum
No pre-emphasis: For general audio (not just speech)
Frame-Based to Sound-Based
Feature aggregation:
MFCC analysis produces: nFrames × 12 coefficients
But we need: 1 × 12 features per sound
Solution: Temporal averaging
FOR each coefficient c from 1 to 12:
values = all frame values for coefficient c
mean_value = average(values)
Store mean_value as sound's feature c
Why temporal averaging?
- Reduces each sound to single feature vector
- Captures overall timbral character
- Ignores temporal evolution (for simplicity)
- Works well for steady-state sounds
Alternative: Could use covariance or other statistics
But mean works surprisingly well for timbral similarity
Result: Each sound represented by 12-dimensional mean MFCC vector
Quality Control
Error Handling
Robust processing:
Error handling mechanisms:
File loading failures:
- Check if selected("Sound") = undefined
- Skip failed files, continue with others
- Report failure in info window
Short file detection:
- Check duration < 0.02 seconds
- Skip files too short for MFCC analysis
- Minimum: 2× window length + some margin
MFCC computation failures:
- Check nFrames > 0
- Skip sounds with no valid MFCC frames
- Report skipped files
Undefined value handling:
- Check for undefined MFCC values
- Skip undefined values in mean calculation
- Use count of valid values for averaging
Final validation:
- Ensure at least one sound successfully analyzed
- Exit gracefully if no sounds processed
- Provide clear error messages
Similarity Computation
Distance Matrix Construction
Pairwise Similarity
All-pairs distance computation:
INPUT: n sounds with 12-dimensional MFCC vectors
STEP 1: Create distance matrix D[n×n]
Initialize all values to 0
STEP 2: Compute distances
FOR i from 1 to n:
FOR j from i to n: // Exploit symmetry
IF i = j:
D[i,j] = 0 // Self-distance zero
ELSE:
dist = 0
FOR c from 1 to 12:
diff = MFCC[i,c] - MFCC[j,c]
dist = dist + diff × diff
dist = sqrt(dist)
D[i,j] = dist
D[j,i] = dist // Symmetric
STEP 3: Label matrix
Set row/column labels to sound names
For easy interpretation
OUTPUT: Symmetric distance matrix D
D[i,j] = timbral distance between sound i and j
Matrix Interpretation
Reading the similarity landscape:
Example: 5 sounds distance matrix
Violin Cello Flute Drum Noise
Violin [ 0.00 1.23 4.56 8.90 12.34 ]
Cello [ 1.23 0.00 5.12 9.87 13.45 ]
Flute [ 4.56 5.12 0.00 3.45 15.67 ]
Drum [ 8.90 9.87 3.45 0.00 18.90 ]
Noise [ 12.34 13.45 15.67 18.90 0.00 ]
Interpretation:
- Violin and Cello very similar (1.23)
- Flute and Drum somewhat similar (3.45)
- Noise very different from everything (>12)
- Violin/Flute moderately different (4.56)
Matrix captures the "acoustic geography" of the collection
Path Construction Algorithm
Nearest-Neighbor Traversal
Greedy path optimization:
INPUT: Distance matrix D[n×n]
STEP 1: Initialization
visited# = [0,0,...,0] // n zeros
path# = [0,0,...,0] // n zeros
current = 1 // Start with first sound
path#[1] = 1
visited#[1] = 1
STEP 2: Greedy traversal
FOR step from 2 to n:
min_dist = very_large_number
next_sound = 0
// Find closest unvisited sound
FOR candidate from 1 to n:
IF visited#[candidate] = 0:
dist = D[current, candidate]
IF dist < min_dist:
min_dist = dist
next_sound = candidate
// Add to path
IF next_sound > 0:
path#[step] = next_sound
visited#[next_sound] = 1
current = next_sound
OUTPUT: path# = ordered list of sound indices
Algorithm Properties
Nearest-neighbor characteristics:
🔍 Path Optimization
What nearest-neighbor achieves:
- Local optimality: Each step chooses best immediate neighbor
- Smooth transitions: Adjacent sounds in path are very similar
- Computational efficiency: O(n²) vs O(n!) for optimal path
- Intuitive progression: Gradual timbral evolution
Limitations:
- Starting point dependency: Different start = different path
- Local optima: May not find globally optimal path
- No backtracking: Once visited, never revisited
Alternatives considered:
- Traveling Salesman Problem (optimal but NP-hard)
- Multidimensional Scaling (preserves global structure)
- Hierarchical clustering (tree-based organization)
Output Generation
Seamless Concatenation
Creating the listening experience:
INPUT: path# = ordered sound indices
STEP 1: Select first sound
first_idx = path#[1]
select sound'first_idx'
STEP 2: Add remaining sounds
FOR i from 2 to n:
idx = path#[i]
plus sound'idx' // Add to selection
STEP 3: Concatenate
Concatenate // Praat built-in function
outputSound = selected("Sound")
STEP 4: Finalize
Rename: "Timbral_Similarity_Path"
Report total duration
Why this order matters:
Creates acoustic journey through similarity space
Each transition = small timbral change
Reveals relationships between sounds
Provides intuitive browsing experience
Memory Management
Cleanup strategy:
Temporary objects created:
- n Sound objects (original loaded files)
- n MFCC objects (analysis results)
- 1 TableOfReal (MFCC features)
- 1 TableOfReal (distance matrix)
Cleanup process:
FOR i from 1 to n:
select sound'i'
Remove
select mfcc'i'
Remove
select featureTable
Remove
select distMatrix
Remove
Final objects remaining:
- Only outputSound ("Timbral_Similarity_Path")
Benefits:
Prevents Praat object clutter
Reduces memory usage
Clear workspace for user
Only final result preserved
Parameters & Settings
Loading Parameters
| Parameter | Type | Default | Description |
| max_files_to_load | integer | 0 | Maximum files to process (0 = all) |
Playback Parameters
| Parameter | Type | Default | Description |
| auto_play | boolean | 1 | Auto-play concatenated result |
MFCC Parameters (Hard-coded)
| Parameter | Value | Description |
| Number of coefficients | 12 | MFCC feature dimension |
| Window length | 0.015s | Analysis frame duration |
| Time step | 0.005s | Frame overlap step |
| Frequency range | 100-100Hz | Effectively full range |
| Pre-emphasis | 0.0 | No high-frequency boost |
Parameter Guidance
max_files_to_load strategies:
- 0 (all files): Complete analysis, recommended for final processing
- 10-20: Quick testing, familiarization with the tool
- 50-100: Substantial collections, may take several minutes
- >100: Large libraries, consider memory and time constraints
Collection preparation tips:
- File format: Use WAV files for best compatibility
- Duration: Ensure all files > 20ms duration
- Content variety: Mix of similar and different sounds works best
- Naming: Descriptive names help interpret similarity path
- Normalization: Similar volume levels recommended
Applications
Sound Library Management
Use case: Organizing large sound effects libraries
Technique: Process entire library, discover natural groupings
Benefits: Automatic categorization, reveals hidden relationships
Musical Composition Tool
Use case: Finding smooth transitions between sounds
Technique: Process diverse sound palette, create seamless morphing sequences
Example: Granular synthesis source selection, sample-based composition
Acoustic Research
Use case: Studying perceptual similarity relationships
Technique: Compare algorithm results with human similarity judgments
Insights: Validate MFCC as perceptual model, discover acoustic cues
Educational Demonstration
Use case: Teaching timbre perception and audio features
Technique: Show how similar sounds group together automatically
Learning outcomes: Understand MFCC, acoustic similarity, content-based retrieval
Practical Workflow Examples
🎵 Instrument Family Exploration
Goal: Discover relationships between musical instruments
Setup:
- Collection: 30 instrument samples (strings, woodwinds, brass, percussion)
- Files: Single notes, similar pitch, similar duration
- Parameters: max_files_to_load = 0, auto_play = 1
Result: Path showing smooth transitions between instrument families
🌊 Environmental Sound Journey
Goal: Create natural progression through environmental sounds
Setup:
- Collection: Water, wind, forest, city, mechanical sounds
- Files: 10-20 second excerpts
- Parameters: max_files_to_load = 0, auto_play = 1
Result: Acoustic journey through different environments
🔍 Sound Design Analysis
Goal: Analyze timbral relationships in sound design work
Setup:
- Collection: 50 synthesized sounds, effects, processed recordings
- Files: Various synthetic and organic textures
- Parameters: max_files_to_load = 50, auto_play = 0 (listen manually)
Result: Understanding of timbral palette and transition opportunities
Advanced Techniques
Multi-scale similarity analysis:
- Stage 1: Run script on entire collection for global structure
- Stage 2: Run on similar subgroups for fine-grained organization
- Stage 3: Compare paths at different scales
- Result: Hierarchical understanding of timbral relationships
Cross-collection analysis:
- Method: Process multiple folders separately, then compare
- Comparison: Look for sounds that bridge different collections
- Application: Finding connections between different sound libraries
- Result: Unified organization across multiple sources
Troubleshooting Common Issues
Problem: No files loaded
Cause: Folder contains no .wav files, or files corrupted
Solution: Verify folder content, check file formats, try different folder
Problem: Memory errors with large collections
Cause: Too many/large files for available RAM
Solution: Use max_files_to_load to limit, close other applications
Problem: Poor similarity results
Cause: MFCC not capturing relevant features for specific sounds
Solution: Try different feature sets, ensure sounds have clear timbral content
Problem: Clicks at sound boundaries
Cause: Abrupt transitions between dissimilar sounds
Solution: Apply crossfade manually in DAW, or modify script for automatic crossfade
Algorithmic Extensions
Alternative Similarity Metrics
Beyond Euclidean Distance
Other distance measures:
COSINE SIMILARITY:
similarity(A,B) = (A·B) / (||A|| × ||B||)
Measures angle between vectors, ignores magnitude
Good when overall energy differences unimportant
MANHATTAN DISTANCE:
d(A,B) = ∑|aᵢ - bᵢ|
More robust to outliers than Euclidean
MAHALANOBIS DISTANCE:
d(A,B) = √[(A-B)ᵀ Σ⁻¹ (A-B)]
Accounts for feature correlations
Requires covariance matrix estimation
DYNAMIC TIME WARPING:
For comparing temporal evolution
Aligns sequences of different lengths
Computationally expensive
Advanced Path Finding
Beyond Nearest-Neighbor
Improved traversal algorithms:
OPTIMAL PATH (TRAVELING SALESMAN):
Find path that minimizes total distance
Exact solution: O(n!) - impractical for n>10
Approximations: Genetic algorithms, simulated annealing
SPANNING TREE APPROACH:
Build Minimum Spanning Tree (Prim's/Kruskal's)
Traverse tree for smooth path
Captures global structure better than greedy
MULTIDIMENSIONAL SCALING:
Project high-dimensional features to 2D/3D
Find path through low-dimensional space
Preserves global similarity relationships
HIERARCHICAL ORGANIZATION:
Build dendrogram using clustering
Traverse tree for hierarchical browsing
Provides multiple similarity scales
Enhanced Feature Extraction
Beyond Mean MFCC
Richer feature representations:
COVARIANCE FEATURES:
Use full covariance matrix of MFCCs
Captures temporal variability
More discriminative but higher dimensionality
MULTIPLE TEMPORAL WINDOWS:
Analyze beginning, middle, end separately
Capture sound evolution characteristics
Concatenate features from different segments
SPECTRAL CONTRAST:
Measure difference between peak and valley energies
Captures harmonicity and spectral shape
Complements MFCC information
RHYTHMIC FEATURES:
Onset detection, tempo estimation
For sounds with rhythmic content
Combine with timbral features