Self-Similarity Matrix — User Guide

Multi‑feature audio comparison: compute similarity between all time frames using pitch, MFCC, spectral entropy, LPC, or mel features to reveal musical structure, repetitions, and patterns.

Technique: Self-Similarity Analysis Implementation: Praat Script Category: Music Information Retrieval Version: Complete Based on: Müller, M. (2015). Fundamentals of Music Processing

Contents:

What this does Quick start Feature Types SSM Theory Pattern Interpretation Speed Optimization Parameters Applications

What this does

This script computes a self‑similarity matrix (SSM) — a square matrix where each cell (i,j) represents the similarity between the audio at time frame i and time frame j. Unlike traditional spectrograms that show frequency content over time, SSMs reveal temporal structure: repetitions, patterns, and musical form through symmetry and patterns in the matrix.

Key Features:

6 Feature Types — Pitch, Pitch+Intensity, MFCC, Spectral Entropy, LPC, Mel Features
Speed Optimization — Downsampling, frame decimation, efficient computation
Automatic Contrast — Intelligently enhances matrix visibility
Normalization — Feature scaling for meaningful comparison
Visualization — Draws matrix directly in Praat Picture window
Interpretive Output — Detailed info window with pattern explanations

What is a self‑similarity matrix? An SSM is a fundamental tool in Music Information Retrieval (MIR):

Axis: Both axes represent time (same scale)
Diagonal: Always bright — each frame is perfectly similar to itself
Symmetry: Matrix is symmetric across diagonal (S[i,j] = S[j,i])
Blocks: Bright blocks indicate repeated sections
Patterns: Different musical structures create characteristic patterns

SSMs are used for: structure analysis, cover song detection, pattern finding, music segmentation.

Technical Implementation: The script extracts features from audio in overlapping time frames (default 25ms windows, 10ms steps). Features are normalized (unit length for vectors, 0‑1 range for entropy). The SSM is computed as cosine similarity between feature vectors (dot product) or inverse difference for entropy. Matrix values range 0‑1, where 1 = identical, 0 = completely dissimilar. Automatic contrast enhancement applies power scaling based on mean similarity to improve visualization.

Quick start

In Praat, select exactly one Sound object.
Run script… → self_similarity_matrix.praat.
Choose Feature Type:
- Pitch only (fastest) — for melodic analysis
- MFCC (best quality) — for timbre/structure
- Spectral Entropy — for texture/complexity
- Other features as needed
Enable use_downsampling (recommended) and set processing_sample_rate (22050 Hz).
Set frame_decimation for speed: 1=all frames, 2=2× faster, 5=5× faster.
Enable auto_contrast and draw_matrix (visualization).
Click OK — processes audio, extracts features, computes SSM.
Watch Info window for progress and interpretation guide.
Matrix appears as originalName_SSM_featureName.
Picture window shows visual SSM (if draw_matrix=1).

Quick tip: For musical structure analysis, use MFCC (best quality) with frame_decimation=2 and use_downsampling=1. For speech/phonetic analysis, use LPC with order=12‑16. For texture segmentation, use Spectral Entropy. Start with short sounds (10‑30s) to understand patterns. Check Info window after processing — it explains what patterns to look for. Use auto_contrast=1 for best visualization. For very long files (>2min), use frame_decimation=5‑10. The diagonal should always be bright — if it's not, contrast is too low.

Important: COMPUTATION TIME — SSM computation is O(N²) where N = number of frames. 1000 frames → 1M comparisons. Use frame_decimation! MEMORY USAGE — SSM for N frames requires N×N matrix. 1000 frames → 1M cells (8MB). FEATURE CHOICE MATTERS — Different features reveal different structures. Pitch for melody, MFCC for timbre, Entropy for texture. CONTRAST INTERPRETATION — auto_contrast modifies values for visualization; original similarity values preserved in matrix. FRAME DECIMATION — Higher values speed computation but reduce temporal resolution. DOWN SAMPLING — Reduces frequency range; keep for timbre/entropy features, disable for pitch analysis.

Feature Types Explained

1. Pitch Only

🎵 Melodic/Harmonic Similarity

What it captures: Fundamental frequency (F0) over time. Measures melodic/harmonic similarity between frames.

Technical details:

Praat's pitch extraction algorithm (autocorrelation): • Time step: 0.01s (10ms) • Pitch range: 75‑600 Hz (adjustable) • Handles voiced/unvoiced regions • Undefined values treated as 0 Similarity calculation: • Normalized pitch values (unit length) • Cosine similarity: S(i,j) = pitch_i·pitch_j

Best for: Monophonic music (melodies), vocal analysis, harmonic progressions.

Limitations: Ignores timbre/spectrum; only works for pitched sounds.

2. Pitch + Intensity

🎚️ Melody + Dynamics

What it captures: Combines pitch (F0) with intensity (RMS energy). Captures both melodic contour and dynamic changes.

Feature vector: [pitch, intensity] for each frame.

Normalization: • Unit length normalization: v = [pitch, intensity] / ||v|| • Ensures equal weight to pitch and intensity Similarity: Cosine similarity in 2D space.

Best for: Expressive performances, speech prosody, music with dynamic variations.

Advantage over pitch only: Distinguishes between same pitch at different volumes.

3. MFCC (Mel‑Frequency Cepstral Coefficients)

🎛️ Spectral Timbre (Best Quality)

What it captures: Short‑term power spectrum in perceptually‑warped mel scale, compressed via DCT. Standard for timbre/speaker recognition.

Technical pipeline:

1. Frame audio (25ms windows, 10ms hop) 2. Compute power spectrum 3. Apply mel filterbank (40 triangular filters) 4. Log compression 5. Discrete Cosine Transform (DCT) 6. Keep first N coefficients (default 12) MFCC vector: [c1, c2, ..., c12] c1 = "energy", c2‑c12 = spectral shape

Best for: Music structure analysis, timbre similarity, instrument recognition, general‑purpose audio comparison.

Quality vs speed: Highest quality but slowest computation.

4. Spectral Entropy

📊 Texture/Complexity Measure

What it captures: Shannon entropy of power distribution across frequency bands. Measures spectral "disorder".

Mathematical definition:

H = - Σ (pₖ × ln(pₖ)) where pₖ = power_in_bandₖ / total_power Properties: • High entropy = noisy/complex spectrum (white noise → max entropy) • Low entropy = tonal/simple spectrum (sine wave → zero entropy) • Range: 0 to ln(num_bands) Normalization: Scale to 0‑1 range.

Best for: Texture segmentation, noise vs tone detection, complexity changes.

Interpretation: Bright SSM regions = similar complexity; dark = different complexity.

5. LPC (Linear Predictive Coding)

🗣️ Vocal Tract/Formant Structure

What it captures: All‑pole model of spectral envelope. Represents resonances (formants) as predictor coefficients.

Technical details:

LPC models signal as: x[n] = Σ aₖ·x[n‑k] + e[n] where aₖ = LPC coefficients (order = 12‑16) e[n] = residual/excitation Feature vector: [a₁, a₂, ..., a_order] Represents spectral envelope (ignores excitation).

Best for: Speech/phonetic analysis, vowel similarity, singing voice, wind instruments.

Order selection: 12‑16 for speech, 10‑12 for music, 8‑10 for simple tones.

6. Mel Features (Simplified MFCC)

🎧 Perceptually‑Weighted Spectrum

What it captures: Log‑energy in mel‑spaced frequency bands, without DCT compression. Faster alternative to MFCC.

Computation:

1. Frame audio (25ms windows) 2. Compute power spectrum 3. Apply mel filterbank (40 bands) 4. Log compression: 10×log10(power) + 100 5. No DCT — keep band energies Feature vector: [E₁, E₂, ..., E₄₀] where Eₖ = log‑energy in mel band k

Best for: Fast timbre comparison, general audio similarity when MFCC too slow.

Speed advantage: 3‑5× faster than MFCC (no DCT).

Feature Selection Guide

Analysis Goal	Recommended Feature	Why	Speed
Musical structure	MFCC	Captures timbre changes	Slow
Melodic patterns	Pitch only	Focus on F0 contour	Fastest
Speech/phonetics	LPC	Models vocal tract	Medium
Texture segmentation	Spectral Entropy	Measures complexity	Medium
General purpose	Mel Features	Balanced quality/speed	Medium‑Fast
Expressive analysis	Pitch+Intensity	Adds dynamics	Fast

SSM Theory & Computation

Mathematical Foundation

🔢 Similarity Metrics

For vector features (Pitch, MFCC, LPC, Mel):

Given feature vectors v_i and v_j (length D): 1. Normalize to unit length: v̂_i = v_i / ||v_i|| where ||v|| = √(Σ vₖ²) 2. Cosine similarity: S(i,j) = v̂_i · v̂_j = Σ (v̂_i[k] × v̂_j[k]) Properties: • Range: [-1, 1], but typically [0,1] after normalization • S(i,i) = 1 (self‑similarity) • S(i,j) = S(j,i) (symmetric)

For scalar features (Spectral Entropy):

Given entropy values e_i and e_j: Similarity = 1 - |e_i - e_j| After normalization to [0,1] range. Properties: • S(i,i) = 1 • S(i,j) decreases linearly with entropy difference • Maximum difference = 1 (if |e_i - e_j| = 1)

Normalization Strategies

📐 Why Normalization Matters

Problem without normalization:

Pitch values: 100 Hz vs 200 Hz → difference = 100
Pitch values: 1000 Hz vs 1100 Hz → difference = 100
But 100 Hz difference is more significant at low frequencies!

Solution: Unit length normalization

For vector v = [x₁, x₂, ..., x_D]: normalized = [x₁/||v||, x₂/||v||, ..., x_D/||v||] where ||v|| = √(x₁² + x₂² + ... + x_D²) Effect: • Compares shape/direction, not magnitude • Pitch contour matters more than absolute pitch • Timbre spectral shape matters more than loudness

Exception: Spectral Entropy — already dimensionless, scaled to 0‑1 range.

Contrast Enhancement

🎨 Automatic Contrast Adjustment

Problem: Raw similarity matrices often have low contrast (most values near 1).

Solution: Power scaling

Given similarity matrix S with mean μ: If μ > 0.95: contrast = 20× If μ > 0.90: contrast = 10× If μ > 0.80: contrast = 5× Else: contrast = 3× Apply: S' = S^contrast Then re‑scale to [0,1]: S'' = (S' - min(S')) / (max(S') - min(S'))

Why this works:

High mean similarity → matrix needs more contrast
Power function expands differences: 0.99²⁰ = 0.82, 0.95²⁰ = 0.36
Preserves order: if S(i,j) > S(k,l), then S'(i,j) > S'(k,l)

Computational Complexity

⚡ O(N²) Warning

Frame count determines computation time:

N = duration / time_step Comparisons = N × (N+1) / 2 ≈ N²/2 Examples: 30s sound, time_step=0.01s → N=3000 → comparisons ≈ 4.5 million 60s sound, time_step=0.01s → N=6000 → comparisons ≈ 18 million 180s (3min) sound → N=18000 → comparisons ≈ 162 million

Optimization strategies:

Frame decimation: Use every k‑th frame (k=frame_decimation)
Downsampling: Reduce sample rate → fewer frequency bins
Feature choice: Some features faster to extract (Pitch vs MFCC)
Early experiments: Use short excerpts first

Pattern Interpretation

Basic SSM Anatomy

📐 Matrix Components

Time j →
       ┌─────────────────────────────┐
       │                             │
       │      Block                  │
       │      (repetition)           │
Time i │          ┌─────┐            │
   ↓   │          │     │            │
       │          │  X  │            │
       │          │     │            │
       │          └─────┘            │
       │                Diagonal     │
       │                (self)       │
       └─────────────────────────────┘

Key elements:

Diagonal (X): Always bright (S(i,i)=1). Self‑similarity.
Blocks: Bright squares indicate repeated sections.
Checkerboard: Alternating pattern indicates ABAB structure.
Parallel diagonals: Indicate periodic/repeating patterns.
Dark areas: Dissimilar sections.

Common Musical Structures

🎵 Verse‑Chorus Form

Structure: Intro - Verse - Chorus - Verse - Chorus - Outro SSM pattern: • Bright block: Verse1 vs Verse2 • Bright block: Chorus1 vs Chorus2 • Dark: Verse vs Chorus (different) • Parallel diagonals: repeated phrases within sections Matrix would show 2×2 block structure: [V1V1 V1C1 V1V2 V1C2] [C1V1 C1C1 C1V2 C1C2] [V2V1 V2C1 V2V2 V2C2] [C2V1 C2C1 C2V2 C2C2] Bright cells: V1V1, V2V2, C1C1, C2C2 (diagonal) V1V2, C1C2 (repetitions)

🔄 Rondo Form (ABACA)

Structure: A - B - A - C - A SSM pattern: • Bright blocks at: (t_A1, t_A2), (t_A1, t_A3), (t_A2, t_A3) • Three A sections create triangle of bright blocks • B and C appear as isolated bright self‑blocks • A vs B, A vs C = dark (different material) Visual: "Checkboard with diagonal stripes"

🎼 Through‑Composed (ABCD)

Structure: A - B - C - D (no repetitions) SSM pattern: • Only diagonal is bright • All off‑diagonal cells are dark • Uniform dark background • May show local similarity for short motifs Visual: "Bright line on dark background"

Speech & Phonetic Patterns

🗣️ Phoneme Repetition

With LPC or MFCC features:

Vowel repetitions: Bright blocks for same vowel sounds
Consonant‑vowel contrast: Dark areas between consonant and vowel regions
Speaker consistency: Uniform brightness for same speaker
Phone transitions: Dark diagonals at phone boundaries

Example sentence: "She sells sea shells"

Phones: /ʃ/ /i/ /s/ /ɛ/ /l/ /z/ /s/ /i/ /ʃ/ /ɛ/ /l/ /z/ Patterns: • Bright blocks for /ʃ/ sounds (she, shells) • Bright blocks for /s/ sounds (sells, sea, shells) • Bright blocks for /ɛ/ sounds (sells, shells) • Bright blocks for /i/ sounds (she, sea) • Dark areas between different phonemes

Texture & Complexity Patterns

🌊 Spectral Entropy SSM

Patterns reveal texture changes:

Pattern	Meaning	Audio Example
Bright block	Similar complexity	Two noise sections
Dark block	Different complexity	Noise vs tone
Gradient	Gradual complexity change	Fade from tone to noise
Checkerboard	Alternating textures	Tone‑noise‑tone‑noise

Entropy values guide:

0.0‑0.2: Pure tones, simple harmonic sounds
0.2‑0.4: Complex tones, voiced speech
0.4‑0.6: Mixed sources, consonants
0.6‑0.8: Noise‑like, unvoiced fricatives
0.8‑1.0: White noise, applause

Path Detection in SSM

🔍 Finding Patterns Manually

Step‑by‑step analysis:

Identify diagonal: Should be brightest line
Look for bright off‑diagonal blocks: These indicate repetitions
Trace block boundaries: Correspond to section boundaries
Check symmetry: Pattern should be symmetric across diagonal
Note parallel lines: Indicate periodic structure
Compare with audio: Click in matrix to hear corresponding frames

Praat interaction:

SSM matrix is selectable object
Use View & Edit to explore values
Row/column numbers correspond to frame indices
Convert frame index to time: time = frame_index × time_step

Speed Optimization

Frame Decimation

⏱️ Reduce Temporal Resolution

How it works: Use only every k‑th frame for SSM computation.

Original: N frames with time_step = 0.01s Decimation by k: N' = N/k frames Computation time: O(N²) → O((N/k)²) = O(N²/k²) Speedup = k² Examples: k=2 → 4× faster, N' = N/2 k=5 → 25× faster, N' = N/5 k=10 → 100× faster, N' = N/10

Temporal resolution trade‑off:

Decimation	Effective time step	Max detectable frequency	Use case
1 (none)	0.01s	50 Hz	Detailed analysis
2	0.02s	25 Hz	Music structure
5	0.05s	10 Hz	Section detection
10	0.10s	5 Hz	Quick overview

Guideline: For music structure analysis, frequencies above 10 Hz (0.1s periods) are rarely meaningful. Decimation=5 (0.05s resolution) is usually sufficient.

Downsampling

📉 Reduce Sample Rate

How it works: Resample audio to lower rate before feature extraction.

Original: Fs = 44100 Hz Downsample to: Fs' = 22050 Hz Effects: 1. FFT size halves → 4× faster FFT 2. Frequency resolution halves 3. Nyquist = 11025 Hz (still covers most audio) Feature extraction speedup: • MFCC: 2‑4× faster • Spectrogram: 4× faster • Pitch extraction: minimal effect

When to use:

Yes: MFCC, Spectral Entropy, Mel Features (preserve timbre up to 11kHz)
Maybe: LPC (if analyzing speech below 5kHz)
No: Pitch extraction (needs high frequency resolution)

Combined Optimization

🚀 Maximum Speed Strategy

For long audio exploration (>2 minutes):

Strategy: 1. use_downsampling = 1, processing_sample_rate = 22050 2. frame_decimation = 10 3. feature_type = Mel Features (faster than MFCC) Speed improvements: • Downsampling: 4× faster • Decimation=10: 100× faster (O(N²) reduction) • Total: 400× faster than full resolution! Quality trade‑offs: • Time resolution: 0.10s (10Hz max) • Frequency range: 0‑11025Hz • Feature quality: Good for structure, less for details

Practical example: 3‑minute song (180s)

Full: N = 180/0.01 = 18,000 frames → 162M comparisons
Optimized: N = (180/0.10) = 1,800 frames → 1.62M comparisons
Speedup: 100× just from decimation

Memory Considerations

💾 Matrix Memory Usage

Memory formula: N² × 8 bytes (double precision)

Examples: N=1000 → 1000² × 8 = 8,000,000 bytes = 8 MB N=3000 → 9,000,000 × 8 = 72 MB N=6000 → 36,000,000 × 8 = 288 MB N=10000 → 100,000,000 × 8 = 800 MB Praat limits: • Default memory limit: ~1GB • Large matrices may crash Praat

Solutions:

Frame decimation: Reduces N linearly → reduces memory quadratically
Process shorter segments: Analyze 30‑60s chunks instead of whole file
Use 32‑bit floats: Not available in Praat Matrix objects
External computation: For very large files, use Python/MATLAB

Safe limits for Praat: Keep N < 2000 frames (≈20s at 0.01s step, or 200s at 0.10s step).

Parameters & Settings

Feature‑Specific Parameters

🎛️ Pitch Settings (for options 1‑2)

Parameter	Default	Range	Description
pitch_floor	75 Hz	50‑200	Minimum F0 to consider. Lower = more sensitive but more errors.
pitch_ceiling	600 Hz	200‑2000	Maximum F0. Set to expected vocal/instrument range.

Recommendations:

Male speech: 75‑300 Hz
Female speech: 100‑500 Hz
Singing: 80‑1000 Hz (tenor to soprano)
Instrumental: Match instrument range

🎚️ MFCC Settings

Parameter	Default	Range	Description
number_of_mfcc	12	8‑20	Number of MFCC coefficients to keep. First coefficient (energy) always included.

Coefficient meaning:

c1: Overall energy (often discarded in speech processing)
c2‑c4: Broad spectral shape
c5‑c8: Mid‑frequency spectral details
c9‑c12: Fine spectral details
c13+: Very fine details (often noise)

Standard values: 12‑13 for speech, 20 for music, 8‑10 for fast processing.

📊 Spectral Entropy Settings

Parameter	Default	Range	Description
freq_min_entropy	100 Hz	0‑1000	Lowest frequency band for entropy calculation.
freq_max_entropy	8000 Hz	1000‑Nyquist	Highest frequency band.
num_freq_bands	40	10‑100	Number of frequency bands to divide spectrum into.

Bandwidth effect: More bands = finer frequency resolution but noisier entropy estimates.

Frequency range tips: For speech, 100‑5000 Hz covers most information. For music, 100‑8000 Hz or higher.

🗣️ LPC Settings

Parameter	Default	Range	Description
lpc_order	16	8‑20	Number of LPC coefficients (model order).

Order selection guide:

8‑10: Simple tones, basic formants
12‑14: Speech (2‑3 formants), singing voice
16‑18: Detailed speech (3‑4 formants), complex tones
20+: Very detailed spectral envelope

Rule of thumb: order ≈ sampling_rate/1000 + 2‑4. For 22050 Hz: 22+2=24, but 16‑18 usually sufficient.

🎧 Mel Feature Settings

Parameter	Default	Range	Description
num_mel_bands	40	20‑80	Number of mel‑spaced frequency bands.
freq_min_mel	100 Hz	0‑1000	Lowest mel band center frequency.
freq_max_mel	8000 Hz	1000‑Nyquist	Highest mel band center frequency.

Mel spacing: Bands are spaced equally on mel scale (perceptually uniform), not linear Hz.

Band count trade‑off: More bands = finer frequency resolution but larger feature vectors = slower SSM computation.

General Parameters

Parameter	Default	Description
time_step	0.01s	Time between analysis frames (10ms). Smaller = more temporal resolution.
window_length	0.025s	Analysis window length (25ms). Longer = better frequency resolution.
use_downsampling	1	Resample to processing_sample_rate before analysis.
processing_sample_rate	22050 Hz	Target sample rate if downsampling enabled.
frame_decimation	1	Use only every k‑th frame (1=all, 2=half, 5=one‑fifth).
draw_matrix	1	Draw SSM in Picture window after computation.
auto_contrast	1	Automatically enhance contrast based on mean similarity.

Parameter Interactions

Important relationships:

time_step vs window_length: window_length should be 2‑3× time_step for proper overlap.
frame_decimation vs time_step: Effective time resolution = time_step × frame_decimation.
processing_sample_rate vs freq_max: Ensure freq_max_entropy/freq_max_mel < processing_sample_rate/2.
num_mel_bands vs number_of_mfcc: MFCCs compress mel bands via DCT; number_of_mfcc should be ≤ num_mel_bands.
lpc_order vs window_length: Need enough samples for stable LPC estimation: window_length × sample_rate > 2 × lpc_order.

Practical Applications

Music Structure Analysis

🎵 Automatic Section Detection

Workflow:

Load song (30‑180 seconds)
Use MFCC features with frame_decimation=2
Compute SSM
Identify bright off‑diagonal blocks
Map block boundaries to time
Label sections (verse, chorus, bridge)

Example: Pop song (3‑4 minutes)

Expected pattern: Intro - Verse - Chorus - Verse - Chorus - Bridge - Chorus - Outro SSM will show: • Bright block: Verse1 vs Verse2 • Bright block: Chorus1 vs Chorus2, Chorus1 vs Chorus3, etc. • Bridge appears as isolated bright square • Intro/Outro may show self‑similarity only

Advanced: Checkerboard detection — ABAB patterns create checkerboard SSM.

Speech & Phonetics

🗣️ Phoneme Repetition Analysis

Workflow:

Use LPC features with order=12‑16
Process short utterance (5‑10 seconds)
Enable high resolution (frame_decimation=1)
Look for small bright blocks corresponding to phonemes

Applications:

Pronunciation consistency: Compare multiple repetitions of same word
Dialect analysis: Compare vowel realizations across speakers
Speech therapy: Track consistency of target phonemes
Forensics: Compare speech samples for speaker similarity

Example: "The cat sat on the mat" — bright blocks for each /æ/ sound.

Sound Design & Composition

🎨 Texture Mapping

Using Spectral Entropy SSM:

Goal: Create evolving texture piece Process: 1. Record/Generate various textures: noise, tones, complex sounds 2. Arrange in timeline 3. Compute Spectral Entropy SSM 4. Analyze transitions between textures SSM reveals: • Bright blocks: similar textures • Dark areas: texture boundaries • Gradients: smooth transitions • Patterns: repetitive texture sequences

Composition tool: Use SSM to ensure desired structure — e.g., symmetric patterns for formal structure, chaotic patterns for experimental works.

Music Information Retrieval Research

🔬 Feature Comparison Study

Research question: Which feature best captures musical similarity for genre X?

Methodology:

Select representative audio corpus
Compute SSM with multiple features (Pitch, MFCC, LPC, etc.)
Compare SSM patterns to ground‑truth annotations
Quantify accuracy for section boundary detection
Analyze computational efficiency

Metrics:

Boundary F‑score: Precision/recall for detecting section boundaries
Pattern consistency: How well SSM captures known repetitions
Computation time: Real‑time feasibility
Memory usage: Practical limitations

Educational Use

📚 Teaching Musical Form

Classroom activity: Visualizing musical structure

Students bring favorite songs (30‑second excerpts)
Predict structure (draw expected SSM pattern)
Compute actual SSM (MFCC, frame_decimation=5 for speed)
Compare prediction to reality
Discuss why patterns appear/disappear

Learning outcomes:

Understand repetition in music
Recognize common forms (verse‑chorus, rondo, etc.)
Connect auditory perception to visual representation
Learn basics of audio feature extraction

Troubleshooting & Tips

Problem: SSM shows only diagonal, no off‑diagonal patterns
Causes: 1) Audio has no repetitions, 2) Contrast too low, 3) Feature mismatch, 4) Decimation too high
Solutions: Check audio for actual repetitions, increase auto_contrast, try different features, reduce frame_decimation

Problem: Computation takes too long
Causes: Long audio, high sample rate, low decimation, MFCC features
Solutions: Use shorter excerpt, enable downsampling, increase frame_decimation, use Pitch or Mel Features

Problem: Praat crashes or runs out of memory
Causes: Too many frames (N too large)
Solutions: Increase frame_decimation dramatically (10‑20), use shorter audio segment, process in chunks externally

Problem: SSM patterns don't match perceived structure
Causes: Wrong feature type for analysis goal
Solutions: Match feature to task: Pitch for melody, MFCC for timbre, Entropy for texture, LPC for speech

Advanced Techniques

For experienced users:

Multi‑feature SSM: Combine features (e.g., concatenate MFCC and Pitch vectors) for richer representation
Time‑lag matrix: Instead of S(i,j), compute S(i,i+lag) to find periodicities
Thresholding: Apply threshold to create binary SSM (similar/dissimilar)
Novelty detection: Compute novelty curve from SSM diagonal band
Cross‑SSM: Compare two different audio files (not implemented in this script)
SSM filtering: Apply smoothing or median filtering to reduce noise