Self-Similarity Matrix — User Guide
Multi‑feature audio comparison: compute similarity between all time frames using pitch, MFCC, spectral entropy, LPC, or mel features to reveal musical structure, repetitions, and patterns.
What this does
This script computes a self‑similarity matrix (SSM) — a square matrix where each cell (i,j) represents the similarity between the audio at time frame i and time frame j. Unlike traditional spectrograms that show frequency content over time, SSMs reveal temporal structure: repetitions, patterns, and musical form through symmetry and patterns in the matrix.
Key Features:
- 6 Feature Types — Pitch, Pitch+Intensity, MFCC, Spectral Entropy, LPC, Mel Features
- Speed Optimization — Downsampling, frame decimation, efficient computation
- Automatic Contrast — Intelligently enhances matrix visibility
- Normalization — Feature scaling for meaningful comparison
- Visualization — Draws matrix directly in Praat Picture window
- Interpretive Output — Detailed info window with pattern explanations
- Axis: Both axes represent time (same scale)
- Diagonal: Always bright — each frame is perfectly similar to itself
- Symmetry: Matrix is symmetric across diagonal (S[i,j] = S[j,i])
- Blocks: Bright blocks indicate repeated sections
- Patterns: Different musical structures create characteristic patterns
Technical Implementation: The script extracts features from audio in overlapping time frames (default 25ms windows, 10ms steps). Features are normalized (unit length for vectors, 0‑1 range for entropy). The SSM is computed as cosine similarity between feature vectors (dot product) or inverse difference for entropy. Matrix values range 0‑1, where 1 = identical, 0 = completely dissimilar. Automatic contrast enhancement applies power scaling based on mean similarity to improve visualization.
Quick start
- In Praat, select exactly one Sound object.
- Run script… →
self_similarity_matrix.praat. - Choose Feature Type:
- Pitch only (fastest) — for melodic analysis
- MFCC (best quality) — for timbre/structure
- Spectral Entropy — for texture/complexity
- Other features as needed
- Enable use_downsampling (recommended) and set processing_sample_rate (22050 Hz).
- Set frame_decimation for speed: 1=all frames, 2=2× faster, 5=5× faster.
- Enable auto_contrast and draw_matrix (visualization).
- Click OK — processes audio, extracts features, computes SSM.
- Watch Info window for progress and interpretation guide.
- Matrix appears as
originalName_SSM_featureName. - Picture window shows visual SSM (if draw_matrix=1).
Feature Types Explained
1. Pitch Only
🎵 Melodic/Harmonic Similarity
What it captures: Fundamental frequency (F0) over time. Measures melodic/harmonic similarity between frames.
Technical details:
Best for: Monophonic music (melodies), vocal analysis, harmonic progressions.
Limitations: Ignores timbre/spectrum; only works for pitched sounds.
2. Pitch + Intensity
🎚️ Melody + Dynamics
What it captures: Combines pitch (F0) with intensity (RMS energy). Captures both melodic contour and dynamic changes.
Feature vector: [pitch, intensity] for each frame.
Best for: Expressive performances, speech prosody, music with dynamic variations.
Advantage over pitch only: Distinguishes between same pitch at different volumes.
3. MFCC (Mel‑Frequency Cepstral Coefficients)
🎛️ Spectral Timbre (Best Quality)
What it captures: Short‑term power spectrum in perceptually‑warped mel scale, compressed via DCT. Standard for timbre/speaker recognition.
Technical pipeline:
Best for: Music structure analysis, timbre similarity, instrument recognition, general‑purpose audio comparison.
Quality vs speed: Highest quality but slowest computation.
4. Spectral Entropy
📊 Texture/Complexity Measure
What it captures: Shannon entropy of power distribution across frequency bands. Measures spectral "disorder".
Mathematical definition:
Best for: Texture segmentation, noise vs tone detection, complexity changes.
Interpretation: Bright SSM regions = similar complexity; dark = different complexity.
5. LPC (Linear Predictive Coding)
🗣️ Vocal Tract/Formant Structure
What it captures: All‑pole model of spectral envelope. Represents resonances (formants) as predictor coefficients.
Technical details:
Best for: Speech/phonetic analysis, vowel similarity, singing voice, wind instruments.
Order selection: 12‑16 for speech, 10‑12 for music, 8‑10 for simple tones.
6. Mel Features (Simplified MFCC)
🎧 Perceptually‑Weighted Spectrum
What it captures: Log‑energy in mel‑spaced frequency bands, without DCT compression. Faster alternative to MFCC.
Computation:
Best for: Fast timbre comparison, general audio similarity when MFCC too slow.
Speed advantage: 3‑5× faster than MFCC (no DCT).
Feature Selection Guide
| Analysis Goal | Recommended Feature | Why | Speed |
|---|---|---|---|
| Musical structure | MFCC | Captures timbre changes | Slow |
| Melodic patterns | Pitch only | Focus on F0 contour | Fastest |
| Speech/phonetics | LPC | Models vocal tract | Medium |
| Texture segmentation | Spectral Entropy | Measures complexity | Medium |
| General purpose | Mel Features | Balanced quality/speed | Medium‑Fast |
| Expressive analysis | Pitch+Intensity | Adds dynamics | Fast |
SSM Theory & Computation
Mathematical Foundation
🔢 Similarity Metrics
For vector features (Pitch, MFCC, LPC, Mel):
For scalar features (Spectral Entropy):
Normalization Strategies
📐 Why Normalization Matters
Problem without normalization:
- Pitch values: 100 Hz vs 200 Hz → difference = 100
- Pitch values: 1000 Hz vs 1100 Hz → difference = 100
- But 100 Hz difference is more significant at low frequencies!
Solution: Unit length normalization
Exception: Spectral Entropy — already dimensionless, scaled to 0‑1 range.
Contrast Enhancement
🎨 Automatic Contrast Adjustment
Problem: Raw similarity matrices often have low contrast (most values near 1).
Solution: Power scaling
Why this works:
- High mean similarity → matrix needs more contrast
- Power function expands differences: 0.99²⁰ = 0.82, 0.95²⁰ = 0.36
- Preserves order: if S(i,j) > S(k,l), then S'(i,j) > S'(k,l)
Computational Complexity
⚡ O(N²) Warning
Frame count determines computation time:
Optimization strategies:
- Frame decimation: Use every k‑th frame (k=frame_decimation)
- Downsampling: Reduce sample rate → fewer frequency bins
- Feature choice: Some features faster to extract (Pitch vs MFCC)
- Early experiments: Use short excerpts first
Pattern Interpretation
Basic SSM Anatomy
📐 Matrix Components
Time j →
┌─────────────────────────────┐
│ │
│ Block │
│ (repetition) │
Time i │ ┌─────┐ │
↓ │ │ │ │
│ │ X │ │
│ │ │ │
│ └─────┘ │
│ Diagonal │
│ (self) │
└─────────────────────────────┘
Key elements:
- Diagonal (X): Always bright (S(i,i)=1). Self‑similarity.
- Blocks: Bright squares indicate repeated sections.
- Checkerboard: Alternating pattern indicates ABAB structure.
- Parallel diagonals: Indicate periodic/repeating patterns.
- Dark areas: Dissimilar sections.
Common Musical Structures
🎵 Verse‑Chorus Form
🔄 Rondo Form (ABACA)
🎼 Through‑Composed (ABCD)
Speech & Phonetic Patterns
🗣️ Phoneme Repetition
With LPC or MFCC features:
- Vowel repetitions: Bright blocks for same vowel sounds
- Consonant‑vowel contrast: Dark areas between consonant and vowel regions
- Speaker consistency: Uniform brightness for same speaker
- Phone transitions: Dark diagonals at phone boundaries
Example sentence: "She sells sea shells"
Texture & Complexity Patterns
🌊 Spectral Entropy SSM
Patterns reveal texture changes:
| Pattern | Meaning | Audio Example |
|---|---|---|
| Bright block | Similar complexity | Two noise sections |
| Dark block | Different complexity | Noise vs tone |
| Gradient | Gradual complexity change | Fade from tone to noise |
| Checkerboard | Alternating textures | Tone‑noise‑tone‑noise |
Entropy values guide:
- 0.0‑0.2: Pure tones, simple harmonic sounds
- 0.2‑0.4: Complex tones, voiced speech
- 0.4‑0.6: Mixed sources, consonants
- 0.6‑0.8: Noise‑like, unvoiced fricatives
- 0.8‑1.0: White noise, applause
Path Detection in SSM
🔍 Finding Patterns Manually
Step‑by‑step analysis:
- Identify diagonal: Should be brightest line
- Look for bright off‑diagonal blocks: These indicate repetitions
- Trace block boundaries: Correspond to section boundaries
- Check symmetry: Pattern should be symmetric across diagonal
- Note parallel lines: Indicate periodic structure
- Compare with audio: Click in matrix to hear corresponding frames
Praat interaction:
- SSM matrix is selectable object
- Use
View & Editto explore values - Row/column numbers correspond to frame indices
- Convert frame index to time: time = frame_index × time_step
Speed Optimization
Frame Decimation
⏱️ Reduce Temporal Resolution
How it works: Use only every k‑th frame for SSM computation.
Temporal resolution trade‑off:
| Decimation | Effective time step | Max detectable frequency | Use case |
|---|---|---|---|
| 1 (none) | 0.01s | 50 Hz | Detailed analysis |
| 2 | 0.02s | 25 Hz | Music structure |
| 5 | 0.05s | 10 Hz | Section detection |
| 10 | 0.10s | 5 Hz | Quick overview |
Guideline: For music structure analysis, frequencies above 10 Hz (0.1s periods) are rarely meaningful. Decimation=5 (0.05s resolution) is usually sufficient.
Downsampling
📉 Reduce Sample Rate
How it works: Resample audio to lower rate before feature extraction.
When to use:
- Yes: MFCC, Spectral Entropy, Mel Features (preserve timbre up to 11kHz)
- Maybe: LPC (if analyzing speech below 5kHz)
- No: Pitch extraction (needs high frequency resolution)
Combined Optimization
🚀 Maximum Speed Strategy
For long audio exploration (>2 minutes):
Practical example: 3‑minute song (180s)
- Full: N = 180/0.01 = 18,000 frames → 162M comparisons
- Optimized: N = (180/0.10) = 1,800 frames → 1.62M comparisons
- Speedup: 100× just from decimation
Memory Considerations
💾 Matrix Memory Usage
Memory formula: N² × 8 bytes (double precision)
Solutions:
- Frame decimation: Reduces N linearly → reduces memory quadratically
- Process shorter segments: Analyze 30‑60s chunks instead of whole file
- Use 32‑bit floats: Not available in Praat Matrix objects
- External computation: For very large files, use Python/MATLAB
Safe limits for Praat: Keep N < 2000 frames (≈20s at 0.01s step, or 200s at 0.10s step).
Parameters & Settings
Feature‑Specific Parameters
🎛️ Pitch Settings (for options 1‑2)
| Parameter | Default | Range | Description |
|---|---|---|---|
| pitch_floor | 75 Hz | 50‑200 | Minimum F0 to consider. Lower = more sensitive but more errors. |
| pitch_ceiling | 600 Hz | 200‑2000 | Maximum F0. Set to expected vocal/instrument range. |
Recommendations:
- Male speech: 75‑300 Hz
- Female speech: 100‑500 Hz
- Singing: 80‑1000 Hz (tenor to soprano)
- Instrumental: Match instrument range
🎚️ MFCC Settings
| Parameter | Default | Range | Description |
|---|---|---|---|
| number_of_mfcc | 12 | 8‑20 | Number of MFCC coefficients to keep. First coefficient (energy) always included. |
Coefficient meaning:
- c1: Overall energy (often discarded in speech processing)
- c2‑c4: Broad spectral shape
- c5‑c8: Mid‑frequency spectral details
- c9‑c12: Fine spectral details
- c13+: Very fine details (often noise)
Standard values: 12‑13 for speech, 20 for music, 8‑10 for fast processing.
📊 Spectral Entropy Settings
| Parameter | Default | Range | Description |
|---|---|---|---|
| freq_min_entropy | 100 Hz | 0‑1000 | Lowest frequency band for entropy calculation. |
| freq_max_entropy | 8000 Hz | 1000‑Nyquist | Highest frequency band. |
| num_freq_bands | 40 | 10‑100 | Number of frequency bands to divide spectrum into. |
Bandwidth effect: More bands = finer frequency resolution but noisier entropy estimates.
Frequency range tips: For speech, 100‑5000 Hz covers most information. For music, 100‑8000 Hz or higher.
🗣️ LPC Settings
| Parameter | Default | Range | Description |
|---|---|---|---|
| lpc_order | 16 | 8‑20 | Number of LPC coefficients (model order). |
Order selection guide:
- 8‑10: Simple tones, basic formants
- 12‑14: Speech (2‑3 formants), singing voice
- 16‑18: Detailed speech (3‑4 formants), complex tones
- 20+: Very detailed spectral envelope
Rule of thumb: order ≈ sampling_rate/1000 + 2‑4. For 22050 Hz: 22+2=24, but 16‑18 usually sufficient.
🎧 Mel Feature Settings
| Parameter | Default | Range | Description |
|---|---|---|---|
| num_mel_bands | 40 | 20‑80 | Number of mel‑spaced frequency bands. |
| freq_min_mel | 100 Hz | 0‑1000 | Lowest mel band center frequency. |
| freq_max_mel | 8000 Hz | 1000‑Nyquist | Highest mel band center frequency. |
Mel spacing: Bands are spaced equally on mel scale (perceptually uniform), not linear Hz.
Band count trade‑off: More bands = finer frequency resolution but larger feature vectors = slower SSM computation.
General Parameters
| Parameter | Default | Description |
|---|---|---|
| time_step | 0.01s | Time between analysis frames (10ms). Smaller = more temporal resolution. |
| window_length | 0.025s | Analysis window length (25ms). Longer = better frequency resolution. |
| use_downsampling | 1 | Resample to processing_sample_rate before analysis. |
| processing_sample_rate | 22050 Hz | Target sample rate if downsampling enabled. |
| frame_decimation | 1 | Use only every k‑th frame (1=all, 2=half, 5=one‑fifth). |
| draw_matrix | 1 | Draw SSM in Picture window after computation. |
| auto_contrast | 1 | Automatically enhance contrast based on mean similarity. |
Parameter Interactions
- time_step vs window_length: window_length should be 2‑3× time_step for proper overlap.
- frame_decimation vs time_step: Effective time resolution = time_step × frame_decimation.
- processing_sample_rate vs freq_max: Ensure freq_max_entropy/freq_max_mel < processing_sample_rate/2.
- num_mel_bands vs number_of_mfcc: MFCCs compress mel bands via DCT; number_of_mfcc should be ≤ num_mel_bands.
- lpc_order vs window_length: Need enough samples for stable LPC estimation: window_length × sample_rate > 2 × lpc_order.
Practical Applications
Music Structure Analysis
🎵 Automatic Section Detection
Workflow:
- Load song (30‑180 seconds)
- Use MFCC features with frame_decimation=2
- Compute SSM
- Identify bright off‑diagonal blocks
- Map block boundaries to time
- Label sections (verse, chorus, bridge)
Example: Pop song (3‑4 minutes)
Advanced: Checkerboard detection — ABAB patterns create checkerboard SSM.
Speech & Phonetics
🗣️ Phoneme Repetition Analysis
Workflow:
- Use LPC features with order=12‑16
- Process short utterance (5‑10 seconds)
- Enable high resolution (frame_decimation=1)
- Look for small bright blocks corresponding to phonemes
Applications:
- Pronunciation consistency: Compare multiple repetitions of same word
- Dialect analysis: Compare vowel realizations across speakers
- Speech therapy: Track consistency of target phonemes
- Forensics: Compare speech samples for speaker similarity
Example: "The cat sat on the mat" — bright blocks for each /æ/ sound.
Sound Design & Composition
🎨 Texture Mapping
Using Spectral Entropy SSM:
Composition tool: Use SSM to ensure desired structure — e.g., symmetric patterns for formal structure, chaotic patterns for experimental works.
Music Information Retrieval Research
🔬 Feature Comparison Study
Research question: Which feature best captures musical similarity for genre X?
Methodology:
- Select representative audio corpus
- Compute SSM with multiple features (Pitch, MFCC, LPC, etc.)
- Compare SSM patterns to ground‑truth annotations
- Quantify accuracy for section boundary detection
- Analyze computational efficiency
Metrics:
- Boundary F‑score: Precision/recall for detecting section boundaries
- Pattern consistency: How well SSM captures known repetitions
- Computation time: Real‑time feasibility
- Memory usage: Practical limitations
Educational Use
📚 Teaching Musical Form
Classroom activity: Visualizing musical structure
- Students bring favorite songs (30‑second excerpts)
- Predict structure (draw expected SSM pattern)
- Compute actual SSM (MFCC, frame_decimation=5 for speed)
- Compare prediction to reality
- Discuss why patterns appear/disappear
Learning outcomes:
- Understand repetition in music
- Recognize common forms (verse‑chorus, rondo, etc.)
- Connect auditory perception to visual representation
- Learn basics of audio feature extraction
Troubleshooting & Tips
Causes: 1) Audio has no repetitions, 2) Contrast too low, 3) Feature mismatch, 4) Decimation too high
Solutions: Check audio for actual repetitions, increase auto_contrast, try different features, reduce frame_decimation
Causes: Long audio, high sample rate, low decimation, MFCC features
Solutions: Use shorter excerpt, enable downsampling, increase frame_decimation, use Pitch or Mel Features
Causes: Too many frames (N too large)
Solutions: Increase frame_decimation dramatically (10‑20), use shorter audio segment, process in chunks externally
Causes: Wrong feature type for analysis goal
Solutions: Match feature to task: Pitch for melody, MFCC for timbre, Entropy for texture, LPC for speech
Advanced Techniques
- Multi‑feature SSM: Combine features (e.g., concatenate MFCC and Pitch vectors) for richer representation
- Time‑lag matrix: Instead of S(i,j), compute S(i,i+lag) to find periodicities
- Thresholding: Apply threshold to create binary SSM (similar/dissimilar)
- Novelty detection: Compute novelty curve from SSM diagonal band
- Cross‑SSM: Compare two different audio files (not implemented in this script)
- SSM filtering: Apply smoothing or median filtering to reduce noise