Timbral Similarity Browser — User Guide

Content-based audio navigation: analyzes MFCC timbral features, computes acoustic similarity, and creates seamless listening paths through sound collections ordered by perceptual similarity.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 0.1 (2025) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools
Contents:

What this does

This script implements timbral similarity browsing — an intelligent content-based approach to navigating sound collections. Process involves: (1) Batch loading: Automatically loads all WAV files from a folder with stereo-to-mono conversion. (2) MFCC analysis: Extracts Mel-Frequency Cepstral Coefficients to capture perceptual timbral characteristics. (3) Similarity computation: Calculates Euclidean distances between mean MFCC vectors. (4) Path construction: Creates optimal listening sequence using nearest-neighbor algorithm. (5) Seamless concatenation: Joins sounds in similarity order for continuous playback. Result: A single audio file that flows naturally from one sound to its most timbrally similar neighbor, creating an intuitive acoustic journey through the collection.

Key Features:

What is timbral similarity browsing? Traditional file browsing: alphabetical, chronological, manual organization. Timbral browsing: automatic organization based on acoustic content. Benefits: (1) Discover hidden relationships: Finds connections you might not notice visually. (2) Intuitive navigation: Similar sounds placed close together. (3) Creative inspiration: Reveals unexpected transitions and combinations. (4) Educational tool: Demonstrates acoustic relationships between sounds. (5) Efficient organization: Automatically groups related sounds. Use cases: Sound library management, musical composition, sound design workflows, acoustic research, educational demonstrations, creative exploration.

Technical Implementation: (1) File loading: Case-insensitive WAV detection, stereo-to-mono conversion, progress tracking. (2) MFCC extraction: 12 coefficients, 15ms windows, 5ms steps, 100Hz-100Hz frequency range. (3) Feature aggregation: Mean MFCC vectors across time for each sound. (4) Distance computation: Euclidean distance between 12-dimensional MFCC vectors. (5) Path optimization: Greedy nearest-neighbor algorithm starting from first sound. (6) Output generation: Concatenation in similarity order, automatic cleanup, optional playback. Processing scales linearly with number of files and their durations.

Quick start

  1. In Praat, select any Sound object (selection ignored).
  2. Run script…timbral_similarity_browser.praat.
  3. Set max_files_to_load (0 = all files, or limit for testing).
  4. Enable auto_play for immediate playback of result.
  5. Click OK → select folder containing WAV files.
  6. Script processes files → displays similarity path → creates concatenated output.
  7. Result: "Timbral_Similarity_Path" sound object for playback/export.
Quick tip: Prepare a folder with 10-50 diverse sound files — musical instruments, environmental sounds, vocal samples work well. Use max_files_to_load = 0 for complete analysis. The script shows real-time progress: "Loading sounds" → "MFCC Analysis" → "Computing Similarity" → "Concatenating". Output appears in Objects window as "Timbral_Similarity_Path". Listen to discover unexpected acoustic relationships between your sounds. For large collections (>100 files), processing may take several minutes. All temporary objects are automatically cleaned up.
Important: FOLDER ORGANIZATION CRITICAL — script processes all WAV files in selected folder. File format: Only .wav files supported. Duration requirements: Files shorter than 20ms are skipped. Memory considerations: Large files or many files may cause memory issues. Similarity limitations: MFCC captures timbre but not rhythm, melody, or harmony. Path dependency: Nearest-neighbor path depends on starting point. Concatenation artifacts: Sudden transitions between dissimilar sounds may be jarring. Auto-play: Playback is asynchronous — script completes while sound plays.

MFCC Theory

Mel-Frequency Cepstral Coefficients

What are MFCCs?

Perceptual audio representation:

MFCC = Mel-Frequency Cepstral Coefficients Purpose: Capture timbral characteristics in perceptually meaningful way Processing pipeline: Audio → FFT → Mel filterbank → Log compression → DCT → MFCCs Why MFCC for timbre? - Models human frequency perception (Mel scale) - Decorrelates features (DCT step) - Focuses on spectral envelope (cepstral analysis) - Robust to pitch variations - Standard in speech/sound recognition Our parameters: 12 coefficients (standard for timbral analysis) Window: 15ms (captures spectral snapshots) Step: 5ms (temporal resolution) Frequency range: 100-100Hz (effectively 100Hz to Nyquist)

Mel Scale Perception

Human frequency perception:

Mel scale: Nonlinear frequency scale matching human perception Below 1kHz: approximately linear Above 1kHz: logarithmic compression Mel formula: m = 2595 × log₁₀(1 + f/700) Where: f = physical frequency (Hz) m = perceptual frequency (Mels) Interpretation: 1000 Hz = 1000 Mels 2000 Hz ≈ 1500 Mels (perceived as less than double) 4000 Hz ≈ 2100 Mels (further compression) MFCC advantage: Places more resolution where human hearing is more sensitive

MFCC Processing Pipeline

Step-by-Step Transformation

STEP 1: FRAMING Divide audio into short overlapping frames Window: 15ms (e.g., 661 samples at 44.1kHz) Step: 5ms (66% overlap) Purpose: Capture time-varying spectral characteristics STEP 2: WINDOWING Apply Hanning window to each frame Reduces spectral leakage Tapers frame boundaries smoothly STEP 3: FFT Compute Fast Fourier Transform for each frame Converts time domain → frequency domain Obtain magnitude spectrum STEP 4: MEL FILTERBANK Apply triangular filters spaced according to Mel scale Typically 20-40 Mel filters Our implementation: Praat default (approx 26 filters) Output: Filterbank energies STEP 5: LOG COMPRESSION Take logarithm of filterbank energies Models human loudness perception (logarithmic) Reduces dynamic range STEP 6: DCT Discrete Cosine Transform on log filterbank energies Decorrelates features → independent coefficients First coefficient (MFCC0) = overall energy (often discarded) Coefficients 1-12 = spectral shape characteristics OUTPUT: 12 MFCCs per frame capturing timbral information

MFCC Interpretation

What each coefficient represents:

MFCC Coefficient Meanings:

MFCC 1: Overall spectral tilt (bright vs dark)
High = bright, low = dark

MFCC 2: Spectral shape (peaked vs flat)
High = peaked spectrum, low = flat spectrum

MFCC 3-6: Mid-frequency spectral details
Capture formant-like structures

MFCC 7-12: Fine spectral details
Capture noise characteristics, fine texture

Together: 12-dimensional "timbral fingerprint"
Similar MFCC vectors = similar perceived timbre

Why MFCC for Similarity?

Acoustic Similarity Metrics

Advantages of MFCC-based similarity:

🎵 Perceptual Relevance

MFCC vs other features:

  • Spectral centroid: Only brightness, misses spectral shape
  • Spectral rolloff: High-frequency content only
  • Zero-crossing rate: Noisiness, but poor timbral discrimination
  • Pitch/F0: Captures melody but not timbre
  • MFCC (12 coefficients): Comprehensive timbral representation

Real-world performance:

  • Speech recognition: Distinguishes phonemes
  • Music information retrieval: Genre/instrument classification
  • Sound event detection: Identifies sound types
  • Our use: Timbre-based sound organization

Euclidean Distance in MFCC Space

Similarity computation:

Given two sounds A and B with mean MFCC vectors: A = [a₁, a₂, ..., a₁₂] B = [b₁, b₂, ..., b₁₂] Euclidean distance: d(A,B) = √[(a₁-b₁)² + (a₂-b₂)² + ... + (a₁₂-b₁₂)²] Properties: - d(A,B) ≥ 0 (non-negative) - d(A,B) = 0 iff A = B (identity) - d(A,B) = d(B,A) (symmetry) - d(A,C) ≤ d(A,B) + d(B,C) (triangle inequality) Interpretation: Smaller distance = more similar timbre Larger distance = more different timbre Normalization: Not needed since MFCCs already scaled appropriately

Analysis Pipeline

File Loading Phase

Batch Processing System

Automated folder processing:

STEP 1: Folder Selection User selects folder via file dialog Script checks: folder exists, contains .wav files STEP 2: File Discovery files$# = fileNames_caseInsensitive$# (directory$ + "*.wav") Case-insensitive: finds .wav, .WAV, .Wav, etc. Returns array of filenames STEP 3: Loading Loop FOR i from 1 to nFiles: Read from file: directory$ + filename Check success: if selected("Sound") ≠ undefined Handle stereo: if nChannels > 1 → convert to mono Store sound ID: sound'i' = loaded_sound_id Progress reporting: "[i/nFiles] Loading: filename" STEP 4: Validation Count successful loads Skip/remove failed loads Final count: number_of_sounds Output: Array of mono Sound objects ready for analysis

Stereo-to-Mono Conversion

Channel handling:

Why convert to mono? - MFCC analysis designed for monaural signals - Prevents channel imbalance issues - Reduces computation time - Standardizes feature extraction Conversion method: Praat's "Convert to mono" Algorithm: (left_channel + right_channel) / 2 Simple averaging preserves spectral content Alternative: Could use only left channel or more sophisticated downmixing But averaging works well for most applications All subsequent processing uses mono signals Timbre perception is largely monaural Spatial information discarded (irrelevant for timbral similarity)

MFCC Extraction Phase

Parameter Settings

MFCC configuration:

Praat "To MFCC" parameters: Number of coefficients: 12 Window length: 0.015 seconds (15ms) Time step: 0.005 seconds (5ms) First frequency: 100 Hz Maximum frequency: 100 Hz (effectively: up to Nyquist) Pre-emphasis: 0.0 (none) Why these values? 12 coefficients: Standard for timbral analysis Enough detail, not too many dimensions 15ms window: Good spectral resolution Captures stationary segments of most sounds 5ms step: Good temporal resolution 66% overlap provides smooth analysis 100Hz-100Hz: Full frequency range Let Praat determine appropriate maximum No pre-emphasis: For general audio (not just speech)

Frame-Based to Sound-Based

Feature aggregation:

MFCC analysis produces: nFrames × 12 coefficients But we need: 1 × 12 features per sound Solution: Temporal averaging FOR each coefficient c from 1 to 12: values = all frame values for coefficient c mean_value = average(values) Store mean_value as sound's feature c Why temporal averaging? - Reduces each sound to single feature vector - Captures overall timbral character - Ignores temporal evolution (for simplicity) - Works well for steady-state sounds Alternative: Could use covariance or other statistics But mean works surprisingly well for timbral similarity Result: Each sound represented by 12-dimensional mean MFCC vector

Quality Control

Error Handling

Robust processing:

Error handling mechanisms:

File loading failures:
- Check if selected("Sound") = undefined
- Skip failed files, continue with others
- Report failure in info window

Short file detection:
- Check duration < 0.02 seconds
- Skip files too short for MFCC analysis
- Minimum: 2× window length + some margin

MFCC computation failures:
- Check nFrames > 0
- Skip sounds with no valid MFCC frames
- Report skipped files

Undefined value handling:
- Check for undefined MFCC values
- Skip undefined values in mean calculation
- Use count of valid values for averaging

Final validation:
- Ensure at least one sound successfully analyzed
- Exit gracefully if no sounds processed
- Provide clear error messages

Similarity Computation

Distance Matrix Construction

Pairwise Similarity

All-pairs distance computation:

INPUT: n sounds with 12-dimensional MFCC vectors STEP 1: Create distance matrix D[n×n] Initialize all values to 0 STEP 2: Compute distances FOR i from 1 to n: FOR j from i to n: // Exploit symmetry IF i = j: D[i,j] = 0 // Self-distance zero ELSE: dist = 0 FOR c from 1 to 12: diff = MFCC[i,c] - MFCC[j,c] dist = dist + diff × diff dist = sqrt(dist) D[i,j] = dist D[j,i] = dist // Symmetric STEP 3: Label matrix Set row/column labels to sound names For easy interpretation OUTPUT: Symmetric distance matrix D D[i,j] = timbral distance between sound i and j

Matrix Interpretation

Reading the similarity landscape:

Example: 5 sounds distance matrix
Violin Cello Flute Drum Noise
Violin [ 0.00 1.23 4.56 8.90 12.34 ]
Cello [ 1.23 0.00 5.12 9.87 13.45 ]
Flute [ 4.56 5.12 0.00 3.45 15.67 ]
Drum [ 8.90 9.87 3.45 0.00 18.90 ]
Noise [ 12.34 13.45 15.67 18.90 0.00 ]

Interpretation:
- Violin and Cello very similar (1.23)
- Flute and Drum somewhat similar (3.45)
- Noise very different from everything (>12)
- Violin/Flute moderately different (4.56)
Matrix captures the "acoustic geography" of the collection

Path Construction Algorithm

Nearest-Neighbor Traversal

Greedy path optimization:

INPUT: Distance matrix D[n×n] STEP 1: Initialization visited# = [0,0,...,0] // n zeros path# = [0,0,...,0] // n zeros current = 1 // Start with first sound path#[1] = 1 visited#[1] = 1 STEP 2: Greedy traversal FOR step from 2 to n: min_dist = very_large_number next_sound = 0 // Find closest unvisited sound FOR candidate from 1 to n: IF visited#[candidate] = 0: dist = D[current, candidate] IF dist < min_dist: min_dist = dist next_sound = candidate // Add to path IF next_sound > 0: path#[step] = next_sound visited#[next_sound] = 1 current = next_sound OUTPUT: path# = ordered list of sound indices

Algorithm Properties

Nearest-neighbor characteristics:

🔍 Path Optimization

What nearest-neighbor achieves:

  • Local optimality: Each step chooses best immediate neighbor
  • Smooth transitions: Adjacent sounds in path are very similar
  • Computational efficiency: O(n²) vs O(n!) for optimal path
  • Intuitive progression: Gradual timbral evolution

Limitations:

  • Starting point dependency: Different start = different path
  • Local optima: May not find globally optimal path
  • No backtracking: Once visited, never revisited

Alternatives considered:

  • Traveling Salesman Problem (optimal but NP-hard)
  • Multidimensional Scaling (preserves global structure)
  • Hierarchical clustering (tree-based organization)

Output Generation

Seamless Concatenation

Creating the listening experience:

INPUT: path# = ordered sound indices STEP 1: Select first sound first_idx = path#[1] select sound'first_idx' STEP 2: Add remaining sounds FOR i from 2 to n: idx = path#[i] plus sound'idx' // Add to selection STEP 3: Concatenate Concatenate // Praat built-in function outputSound = selected("Sound") STEP 4: Finalize Rename: "Timbral_Similarity_Path" Report total duration Why this order matters: Creates acoustic journey through similarity space Each transition = small timbral change Reveals relationships between sounds Provides intuitive browsing experience

Memory Management

Cleanup strategy:

Temporary objects created: - n Sound objects (original loaded files) - n MFCC objects (analysis results) - 1 TableOfReal (MFCC features) - 1 TableOfReal (distance matrix) Cleanup process: FOR i from 1 to n: select sound'i' Remove select mfcc'i' Remove select featureTable Remove select distMatrix Remove Final objects remaining: - Only outputSound ("Timbral_Similarity_Path") Benefits: Prevents Praat object clutter Reduces memory usage Clear workspace for user Only final result preserved

Parameters & Settings

Loading Parameters

ParameterTypeDefaultDescription
max_files_to_loadinteger0Maximum files to process (0 = all)

Playback Parameters

ParameterTypeDefaultDescription
auto_playboolean1Auto-play concatenated result

MFCC Parameters (Hard-coded)

ParameterValueDescription
Number of coefficients12MFCC feature dimension
Window length0.015sAnalysis frame duration
Time step0.005sFrame overlap step
Frequency range100-100HzEffectively full range
Pre-emphasis0.0No high-frequency boost

Parameter Guidance

max_files_to_load strategies:
  • 0 (all files): Complete analysis, recommended for final processing
  • 10-20: Quick testing, familiarization with the tool
  • 50-100: Substantial collections, may take several minutes
  • >100: Large libraries, consider memory and time constraints
Collection preparation tips:
  • File format: Use WAV files for best compatibility
  • Duration: Ensure all files > 20ms duration
  • Content variety: Mix of similar and different sounds works best
  • Naming: Descriptive names help interpret similarity path
  • Normalization: Similar volume levels recommended

Applications

Sound Library Management

Use case: Organizing large sound effects libraries

Technique: Process entire library, discover natural groupings

Benefits: Automatic categorization, reveals hidden relationships

Musical Composition Tool

Use case: Finding smooth transitions between sounds

Technique: Process diverse sound palette, create seamless morphing sequences

Example: Granular synthesis source selection, sample-based composition

Acoustic Research

Use case: Studying perceptual similarity relationships

Technique: Compare algorithm results with human similarity judgments

Insights: Validate MFCC as perceptual model, discover acoustic cues

Educational Demonstration

Use case: Teaching timbre perception and audio features

Technique: Show how similar sounds group together automatically

Learning outcomes: Understand MFCC, acoustic similarity, content-based retrieval

Practical Workflow Examples

🎵 Instrument Family Exploration

Goal: Discover relationships between musical instruments

Setup:

  • Collection: 30 instrument samples (strings, woodwinds, brass, percussion)
  • Files: Single notes, similar pitch, similar duration
  • Parameters: max_files_to_load = 0, auto_play = 1

Result: Path showing smooth transitions between instrument families

🌊 Environmental Sound Journey

Goal: Create natural progression through environmental sounds

Setup:

  • Collection: Water, wind, forest, city, mechanical sounds
  • Files: 10-20 second excerpts
  • Parameters: max_files_to_load = 0, auto_play = 1

Result: Acoustic journey through different environments

🔍 Sound Design Analysis

Goal: Analyze timbral relationships in sound design work

Setup:

  • Collection: 50 synthesized sounds, effects, processed recordings
  • Files: Various synthetic and organic textures
  • Parameters: max_files_to_load = 50, auto_play = 0 (listen manually)

Result: Understanding of timbral palette and transition opportunities

Advanced Techniques

Multi-scale similarity analysis:
  • Stage 1: Run script on entire collection for global structure
  • Stage 2: Run on similar subgroups for fine-grained organization
  • Stage 3: Compare paths at different scales
  • Result: Hierarchical understanding of timbral relationships
Cross-collection analysis:
  • Method: Process multiple folders separately, then compare
  • Comparison: Look for sounds that bridge different collections
  • Application: Finding connections between different sound libraries
  • Result: Unified organization across multiple sources

Troubleshooting Common Issues

Problem: No files loaded
Cause: Folder contains no .wav files, or files corrupted
Solution: Verify folder content, check file formats, try different folder
Problem: Memory errors with large collections
Cause: Too many/large files for available RAM
Solution: Use max_files_to_load to limit, close other applications
Problem: Poor similarity results
Cause: MFCC not capturing relevant features for specific sounds
Solution: Try different feature sets, ensure sounds have clear timbral content
Problem: Clicks at sound boundaries
Cause: Abrupt transitions between dissimilar sounds
Solution: Apply crossfade manually in DAW, or modify script for automatic crossfade

Algorithmic Extensions

Alternative Similarity Metrics

Beyond Euclidean Distance

Other distance measures:

COSINE SIMILARITY: similarity(A,B) = (A·B) / (||A|| × ||B||) Measures angle between vectors, ignores magnitude Good when overall energy differences unimportant MANHATTAN DISTANCE: d(A,B) = ∑|aᵢ - bᵢ| More robust to outliers than Euclidean MAHALANOBIS DISTANCE: d(A,B) = √[(A-B)ᵀ Σ⁻¹ (A-B)] Accounts for feature correlations Requires covariance matrix estimation DYNAMIC TIME WARPING: For comparing temporal evolution Aligns sequences of different lengths Computationally expensive

Advanced Path Finding

Beyond Nearest-Neighbor

Improved traversal algorithms:

OPTIMAL PATH (TRAVELING SALESMAN): Find path that minimizes total distance Exact solution: O(n!) - impractical for n>10 Approximations: Genetic algorithms, simulated annealing SPANNING TREE APPROACH: Build Minimum Spanning Tree (Prim's/Kruskal's) Traverse tree for smooth path Captures global structure better than greedy MULTIDIMENSIONAL SCALING: Project high-dimensional features to 2D/3D Find path through low-dimensional space Preserves global similarity relationships HIERARCHICAL ORGANIZATION: Build dendrogram using clustering Traverse tree for hierarchical browsing Provides multiple similarity scales

Enhanced Feature Extraction

Beyond Mean MFCC

Richer feature representations:

COVARIANCE FEATURES: Use full covariance matrix of MFCCs Captures temporal variability More discriminative but higher dimensionality MULTIPLE TEMPORAL WINDOWS: Analyze beginning, middle, end separately Capture sound evolution characteristics Concatenate features from different segments SPECTRAL CONTRAST: Measure difference between peak and valley energies Captures harmonicity and spectral shape Complements MFCC information RHYTHMIC FEATURES: Onset detection, tempo estimation For sounds with rhythmic content Combine with timbral features