Self-Similarity Matrix — User Guide

Multi‑feature audio comparison: compute similarity between all time frames using pitch, MFCC, spectral entropy, LPC, or mel features to reveal musical structure, repetitions, and patterns.

Technique: Self-Similarity Analysis Implementation: Praat Script Category: Music Information Retrieval Version: Complete Based on: Müller, M. (2015). Fundamentals of Music Processing
Contents:

What this does

This script computes a self‑similarity matrix (SSM) — a square matrix where each cell (i,j) represents the similarity between the audio at time frame i and time frame j. Unlike traditional spectrograms that show frequency content over time, SSMs reveal temporal structure: repetitions, patterns, and musical form through symmetry and patterns in the matrix.

Key Features:

What is a self‑similarity matrix? An SSM is a fundamental tool in Music Information Retrieval (MIR):
  • Axis: Both axes represent time (same scale)
  • Diagonal: Always bright — each frame is perfectly similar to itself
  • Symmetry: Matrix is symmetric across diagonal (S[i,j] = S[j,i])
  • Blocks: Bright blocks indicate repeated sections
  • Patterns: Different musical structures create characteristic patterns
SSMs are used for: structure analysis, cover song detection, pattern finding, music segmentation.

Technical Implementation: The script extracts features from audio in overlapping time frames (default 25ms windows, 10ms steps). Features are normalized (unit length for vectors, 0‑1 range for entropy). The SSM is computed as cosine similarity between feature vectors (dot product) or inverse difference for entropy. Matrix values range 0‑1, where 1 = identical, 0 = completely dissimilar. Automatic contrast enhancement applies power scaling based on mean similarity to improve visualization.

Quick start

  1. In Praat, select exactly one Sound object.
  2. Run script…self_similarity_matrix.praat.
  3. Choose Feature Type:
    • Pitch only (fastest) — for melodic analysis
    • MFCC (best quality) — for timbre/structure
    • Spectral Entropy — for texture/complexity
    • Other features as needed
  4. Enable use_downsampling (recommended) and set processing_sample_rate (22050 Hz).
  5. Set frame_decimation for speed: 1=all frames, 2=2× faster, 5=5× faster.
  6. Enable auto_contrast and draw_matrix (visualization).
  7. Click OK — processes audio, extracts features, computes SSM.
  8. Watch Info window for progress and interpretation guide.
  9. Matrix appears as originalName_SSM_featureName.
  10. Picture window shows visual SSM (if draw_matrix=1).
Quick tip: For musical structure analysis, use MFCC (best quality) with frame_decimation=2 and use_downsampling=1. For speech/phonetic analysis, use LPC with order=12‑16. For texture segmentation, use Spectral Entropy. Start with short sounds (10‑30s) to understand patterns. Check Info window after processing — it explains what patterns to look for. Use auto_contrast=1 for best visualization. For very long files (>2min), use frame_decimation=5‑10. The diagonal should always be bright — if it's not, contrast is too low.
Important: COMPUTATION TIME — SSM computation is O(N²) where N = number of frames. 1000 frames → 1M comparisons. Use frame_decimation! MEMORY USAGE — SSM for N frames requires N×N matrix. 1000 frames → 1M cells (8MB). FEATURE CHOICE MATTERS — Different features reveal different structures. Pitch for melody, MFCC for timbre, Entropy for texture. CONTRAST INTERPRETATION — auto_contrast modifies values for visualization; original similarity values preserved in matrix. FRAME DECIMATION — Higher values speed computation but reduce temporal resolution. DOWN SAMPLING — Reduces frequency range; keep for timbre/entropy features, disable for pitch analysis.

Feature Types Explained

1. Pitch Only

🎵 Melodic/Harmonic Similarity

What it captures: Fundamental frequency (F0) over time. Measures melodic/harmonic similarity between frames.

Technical details:

Praat's pitch extraction algorithm (autocorrelation): • Time step: 0.01s (10ms) • Pitch range: 75‑600 Hz (adjustable) • Handles voiced/unvoiced regions • Undefined values treated as 0 Similarity calculation: • Normalized pitch values (unit length) • Cosine similarity: S(i,j) = pitch_i·pitch_j

Best for: Monophonic music (melodies), vocal analysis, harmonic progressions.

Limitations: Ignores timbre/spectrum; only works for pitched sounds.

2. Pitch + Intensity

🎚️ Melody + Dynamics

What it captures: Combines pitch (F0) with intensity (RMS energy). Captures both melodic contour and dynamic changes.

Feature vector: [pitch, intensity] for each frame.

Normalization: • Unit length normalization: v = [pitch, intensity] / ||v|| • Ensures equal weight to pitch and intensity Similarity: Cosine similarity in 2D space.

Best for: Expressive performances, speech prosody, music with dynamic variations.

Advantage over pitch only: Distinguishes between same pitch at different volumes.

3. MFCC (Mel‑Frequency Cepstral Coefficients)

🎛️ Spectral Timbre (Best Quality)

What it captures: Short‑term power spectrum in perceptually‑warped mel scale, compressed via DCT. Standard for timbre/speaker recognition.

Technical pipeline:

1. Frame audio (25ms windows, 10ms hop) 2. Compute power spectrum 3. Apply mel filterbank (40 triangular filters) 4. Log compression 5. Discrete Cosine Transform (DCT) 6. Keep first N coefficients (default 12) MFCC vector: [c1, c2, ..., c12] c1 = "energy", c2‑c12 = spectral shape

Best for: Music structure analysis, timbre similarity, instrument recognition, general‑purpose audio comparison.

Quality vs speed: Highest quality but slowest computation.

4. Spectral Entropy

📊 Texture/Complexity Measure

What it captures: Shannon entropy of power distribution across frequency bands. Measures spectral "disorder".

Mathematical definition:

H = - Σ (pₖ × ln(pₖ)) where pₖ = power_in_bandₖ / total_power Properties: • High entropy = noisy/complex spectrum (white noise → max entropy) • Low entropy = tonal/simple spectrum (sine wave → zero entropy) • Range: 0 to ln(num_bands) Normalization: Scale to 0‑1 range.

Best for: Texture segmentation, noise vs tone detection, complexity changes.

Interpretation: Bright SSM regions = similar complexity; dark = different complexity.

5. LPC (Linear Predictive Coding)

🗣️ Vocal Tract/Formant Structure

What it captures: All‑pole model of spectral envelope. Represents resonances (formants) as predictor coefficients.

Technical details:

LPC models signal as: x[n] = Σ aₖ·x[n‑k] + e[n] where aₖ = LPC coefficients (order = 12‑16) e[n] = residual/excitation Feature vector: [a₁, a₂, ..., a_order] Represents spectral envelope (ignores excitation).

Best for: Speech/phonetic analysis, vowel similarity, singing voice, wind instruments.

Order selection: 12‑16 for speech, 10‑12 for music, 8‑10 for simple tones.

6. Mel Features (Simplified MFCC)

🎧 Perceptually‑Weighted Spectrum

What it captures: Log‑energy in mel‑spaced frequency bands, without DCT compression. Faster alternative to MFCC.

Computation:

1. Frame audio (25ms windows) 2. Compute power spectrum 3. Apply mel filterbank (40 bands) 4. Log compression: 10×log10(power) + 100 5. No DCT — keep band energies Feature vector: [E₁, E₂, ..., E₄₀] where Eₖ = log‑energy in mel band k

Best for: Fast timbre comparison, general audio similarity when MFCC too slow.

Speed advantage: 3‑5× faster than MFCC (no DCT).

Feature Selection Guide

Analysis GoalRecommended FeatureWhySpeed
Musical structureMFCCCaptures timbre changesSlow
Melodic patternsPitch onlyFocus on F0 contourFastest
Speech/phoneticsLPCModels vocal tractMedium
Texture segmentationSpectral EntropyMeasures complexityMedium
General purposeMel FeaturesBalanced quality/speedMedium‑Fast
Expressive analysisPitch+IntensityAdds dynamicsFast

SSM Theory & Computation

Mathematical Foundation

🔢 Similarity Metrics

For vector features (Pitch, MFCC, LPC, Mel):

Given feature vectors v_i and v_j (length D): 1. Normalize to unit length: v̂_i = v_i / ||v_i|| where ||v|| = √(Σ vₖ²) 2. Cosine similarity: S(i,j) = v̂_i · v̂_j = Σ (v̂_i[k] × v̂_j[k]) Properties: • Range: [-1, 1], but typically [0,1] after normalization • S(i,i) = 1 (self‑similarity) • S(i,j) = S(j,i) (symmetric)

For scalar features (Spectral Entropy):

Given entropy values e_i and e_j: Similarity = 1 - |e_i - e_j| After normalization to [0,1] range. Properties: • S(i,i) = 1 • S(i,j) decreases linearly with entropy difference • Maximum difference = 1 (if |e_i - e_j| = 1)

Normalization Strategies

📐 Why Normalization Matters

Problem without normalization:

  • Pitch values: 100 Hz vs 200 Hz → difference = 100
  • Pitch values: 1000 Hz vs 1100 Hz → difference = 100
  • But 100 Hz difference is more significant at low frequencies!

Solution: Unit length normalization

For vector v = [x₁, x₂, ..., x_D]: normalized = [x₁/||v||, x₂/||v||, ..., x_D/||v||] where ||v|| = √(x₁² + x₂² + ... + x_D²) Effect: • Compares shape/direction, not magnitude • Pitch contour matters more than absolute pitch • Timbre spectral shape matters more than loudness

Exception: Spectral Entropy — already dimensionless, scaled to 0‑1 range.

Contrast Enhancement

🎨 Automatic Contrast Adjustment

Problem: Raw similarity matrices often have low contrast (most values near 1).

Solution: Power scaling

Given similarity matrix S with mean μ: If μ > 0.95: contrast = 20× If μ > 0.90: contrast = 10× If μ > 0.80: contrast = 5× Else: contrast = 3× Apply: S' = S^contrast Then re‑scale to [0,1]: S'' = (S' - min(S')) / (max(S') - min(S'))

Why this works:

  • High mean similarity → matrix needs more contrast
  • Power function expands differences: 0.99²⁰ = 0.82, 0.95²⁰ = 0.36
  • Preserves order: if S(i,j) > S(k,l), then S'(i,j) > S'(k,l)

Computational Complexity

⚡ O(N²) Warning

Frame count determines computation time:

N = duration / time_step Comparisons = N × (N+1) / 2 ≈ N²/2 Examples: 30s sound, time_step=0.01s → N=3000 → comparisons ≈ 4.5 million 60s sound, time_step=0.01s → N=6000 → comparisons ≈ 18 million 180s (3min) sound → N=18000 → comparisons ≈ 162 million

Optimization strategies:

  1. Frame decimation: Use every k‑th frame (k=frame_decimation)
  2. Downsampling: Reduce sample rate → fewer frequency bins
  3. Feature choice: Some features faster to extract (Pitch vs MFCC)
  4. Early experiments: Use short excerpts first

Pattern Interpretation

Basic SSM Anatomy

📐 Matrix Components

Time j →
       ┌─────────────────────────────┐
       │                             │
       │      Block                  │
       │      (repetition)           │
Time i │          ┌─────┐            │
   ↓   │          │     │            │
       │          │  X  │            │
       │          │     │            │
       │          └─────┘            │
       │                Diagonal     │
       │                (self)       │
       └─────────────────────────────┘
      

Key elements:

  • Diagonal (X): Always bright (S(i,i)=1). Self‑similarity.
  • Blocks: Bright squares indicate repeated sections.
  • Checkerboard: Alternating pattern indicates ABAB structure.
  • Parallel diagonals: Indicate periodic/repeating patterns.
  • Dark areas: Dissimilar sections.

Common Musical Structures

🎵 Verse‑Chorus Form

Structure: Intro - Verse - Chorus - Verse - Chorus - Outro SSM pattern: • Bright block: Verse1 vs Verse2 • Bright block: Chorus1 vs Chorus2 • Dark: Verse vs Chorus (different) • Parallel diagonals: repeated phrases within sections Matrix would show 2×2 block structure: [V1V1 V1C1 V1V2 V1C2] [C1V1 C1C1 C1V2 C1C2] [V2V1 V2C1 V2V2 V2C2] [C2V1 C2C1 C2V2 C2C2] Bright cells: V1V1, V2V2, C1C1, C2C2 (diagonal) V1V2, C1C2 (repetitions)

🔄 Rondo Form (ABACA)

Structure: A - B - A - C - A SSM pattern: • Bright blocks at: (t_A1, t_A2), (t_A1, t_A3), (t_A2, t_A3) • Three A sections create triangle of bright blocks • B and C appear as isolated bright self‑blocks • A vs B, A vs C = dark (different material) Visual: "Checkboard with diagonal stripes"

🎼 Through‑Composed (ABCD)

Structure: A - B - C - D (no repetitions) SSM pattern: • Only diagonal is bright • All off‑diagonal cells are dark • Uniform dark background • May show local similarity for short motifs Visual: "Bright line on dark background"

Speech & Phonetic Patterns

🗣️ Phoneme Repetition

With LPC or MFCC features:

  • Vowel repetitions: Bright blocks for same vowel sounds
  • Consonant‑vowel contrast: Dark areas between consonant and vowel regions
  • Speaker consistency: Uniform brightness for same speaker
  • Phone transitions: Dark diagonals at phone boundaries

Example sentence: "She sells sea shells"

Phones: /ʃ/ /i/ /s/ /ɛ/ /l/ /z/ /s/ /i/ /ʃ/ /ɛ/ /l/ /z/ Patterns: • Bright blocks for /ʃ/ sounds (she, shells) • Bright blocks for /s/ sounds (sells, sea, shells) • Bright blocks for /ɛ/ sounds (sells, shells) • Bright blocks for /i/ sounds (she, sea) • Dark areas between different phonemes

Texture & Complexity Patterns

🌊 Spectral Entropy SSM

Patterns reveal texture changes:

PatternMeaningAudio Example
Bright blockSimilar complexityTwo noise sections
Dark blockDifferent complexityNoise vs tone
GradientGradual complexity changeFade from tone to noise
CheckerboardAlternating texturesTone‑noise‑tone‑noise

Entropy values guide:

  • 0.0‑0.2: Pure tones, simple harmonic sounds
  • 0.2‑0.4: Complex tones, voiced speech
  • 0.4‑0.6: Mixed sources, consonants
  • 0.6‑0.8: Noise‑like, unvoiced fricatives
  • 0.8‑1.0: White noise, applause

Path Detection in SSM

🔍 Finding Patterns Manually

Step‑by‑step analysis:

  1. Identify diagonal: Should be brightest line
  2. Look for bright off‑diagonal blocks: These indicate repetitions
  3. Trace block boundaries: Correspond to section boundaries
  4. Check symmetry: Pattern should be symmetric across diagonal
  5. Note parallel lines: Indicate periodic structure
  6. Compare with audio: Click in matrix to hear corresponding frames

Praat interaction:

  • SSM matrix is selectable object
  • Use View & Edit to explore values
  • Row/column numbers correspond to frame indices
  • Convert frame index to time: time = frame_index × time_step

Speed Optimization

Frame Decimation

⏱️ Reduce Temporal Resolution

How it works: Use only every k‑th frame for SSM computation.

Original: N frames with time_step = 0.01s Decimation by k: N' = N/k frames Computation time: O(N²) → O((N/k)²) = O(N²/k²) Speedup = k² Examples: k=2 → 4× faster, N' = N/2 k=5 → 25× faster, N' = N/5 k=10 → 100× faster, N' = N/10

Temporal resolution trade‑off:

DecimationEffective time stepMax detectable frequencyUse case
1 (none)0.01s50 HzDetailed analysis
20.02s25 HzMusic structure
50.05s10 HzSection detection
100.10s5 HzQuick overview

Guideline: For music structure analysis, frequencies above 10 Hz (0.1s periods) are rarely meaningful. Decimation=5 (0.05s resolution) is usually sufficient.

Downsampling

📉 Reduce Sample Rate

How it works: Resample audio to lower rate before feature extraction.

Original: Fs = 44100 Hz Downsample to: Fs' = 22050 Hz Effects: 1. FFT size halves → 4× faster FFT 2. Frequency resolution halves 3. Nyquist = 11025 Hz (still covers most audio) Feature extraction speedup: • MFCC: 2‑4× faster • Spectrogram: 4× faster • Pitch extraction: minimal effect

When to use:

  • Yes: MFCC, Spectral Entropy, Mel Features (preserve timbre up to 11kHz)
  • Maybe: LPC (if analyzing speech below 5kHz)
  • No: Pitch extraction (needs high frequency resolution)

Combined Optimization

🚀 Maximum Speed Strategy

For long audio exploration (>2 minutes):

Strategy: 1. use_downsampling = 1, processing_sample_rate = 22050 2. frame_decimation = 10 3. feature_type = Mel Features (faster than MFCC) Speed improvements: • Downsampling: 4× faster • Decimation=10: 100× faster (O(N²) reduction) • Total: 400× faster than full resolution! Quality trade‑offs: • Time resolution: 0.10s (10Hz max) • Frequency range: 0‑11025Hz • Feature quality: Good for structure, less for details

Practical example: 3‑minute song (180s)

  • Full: N = 180/0.01 = 18,000 frames → 162M comparisons
  • Optimized: N = (180/0.10) = 1,800 frames → 1.62M comparisons
  • Speedup: 100× just from decimation

Memory Considerations

💾 Matrix Memory Usage

Memory formula: N² × 8 bytes (double precision)

Examples: N=1000 → 1000² × 8 = 8,000,000 bytes = 8 MB N=3000 → 9,000,000 × 8 = 72 MB N=6000 → 36,000,000 × 8 = 288 MB N=10000 → 100,000,000 × 8 = 800 MB Praat limits: • Default memory limit: ~1GB • Large matrices may crash Praat

Solutions:

  1. Frame decimation: Reduces N linearly → reduces memory quadratically
  2. Process shorter segments: Analyze 30‑60s chunks instead of whole file
  3. Use 32‑bit floats: Not available in Praat Matrix objects
  4. External computation: For very large files, use Python/MATLAB

Safe limits for Praat: Keep N < 2000 frames (≈20s at 0.01s step, or 200s at 0.10s step).

Parameters & Settings

Feature‑Specific Parameters

🎛️ Pitch Settings (for options 1‑2)

ParameterDefaultRangeDescription
pitch_floor75 Hz50‑200Minimum F0 to consider. Lower = more sensitive but more errors.
pitch_ceiling600 Hz200‑2000Maximum F0. Set to expected vocal/instrument range.

Recommendations:

  • Male speech: 75‑300 Hz
  • Female speech: 100‑500 Hz
  • Singing: 80‑1000 Hz (tenor to soprano)
  • Instrumental: Match instrument range

🎚️ MFCC Settings

ParameterDefaultRangeDescription
number_of_mfcc128‑20Number of MFCC coefficients to keep. First coefficient (energy) always included.

Coefficient meaning:

  • c1: Overall energy (often discarded in speech processing)
  • c2‑c4: Broad spectral shape
  • c5‑c8: Mid‑frequency spectral details
  • c9‑c12: Fine spectral details
  • c13+: Very fine details (often noise)

Standard values: 12‑13 for speech, 20 for music, 8‑10 for fast processing.

📊 Spectral Entropy Settings

ParameterDefaultRangeDescription
freq_min_entropy100 Hz0‑1000Lowest frequency band for entropy calculation.
freq_max_entropy8000 Hz1000‑NyquistHighest frequency band.
num_freq_bands4010‑100Number of frequency bands to divide spectrum into.

Bandwidth effect: More bands = finer frequency resolution but noisier entropy estimates.

Frequency range tips: For speech, 100‑5000 Hz covers most information. For music, 100‑8000 Hz or higher.

🗣️ LPC Settings

ParameterDefaultRangeDescription
lpc_order168‑20Number of LPC coefficients (model order).

Order selection guide:

  • 8‑10: Simple tones, basic formants
  • 12‑14: Speech (2‑3 formants), singing voice
  • 16‑18: Detailed speech (3‑4 formants), complex tones
  • 20+: Very detailed spectral envelope

Rule of thumb: order ≈ sampling_rate/1000 + 2‑4. For 22050 Hz: 22+2=24, but 16‑18 usually sufficient.

🎧 Mel Feature Settings

ParameterDefaultRangeDescription
num_mel_bands4020‑80Number of mel‑spaced frequency bands.
freq_min_mel100 Hz0‑1000Lowest mel band center frequency.
freq_max_mel8000 Hz1000‑NyquistHighest mel band center frequency.

Mel spacing: Bands are spaced equally on mel scale (perceptually uniform), not linear Hz.

Band count trade‑off: More bands = finer frequency resolution but larger feature vectors = slower SSM computation.

General Parameters

ParameterDefaultDescription
time_step0.01sTime between analysis frames (10ms). Smaller = more temporal resolution.
window_length0.025sAnalysis window length (25ms). Longer = better frequency resolution.
use_downsampling1Resample to processing_sample_rate before analysis.
processing_sample_rate22050 HzTarget sample rate if downsampling enabled.
frame_decimation1Use only every k‑th frame (1=all, 2=half, 5=one‑fifth).
draw_matrix1Draw SSM in Picture window after computation.
auto_contrast1Automatically enhance contrast based on mean similarity.

Parameter Interactions

Important relationships:
  • time_step vs window_length: window_length should be 2‑3× time_step for proper overlap.
  • frame_decimation vs time_step: Effective time resolution = time_step × frame_decimation.
  • processing_sample_rate vs freq_max: Ensure freq_max_entropy/freq_max_mel < processing_sample_rate/2.
  • num_mel_bands vs number_of_mfcc: MFCCs compress mel bands via DCT; number_of_mfcc should be ≤ num_mel_bands.
  • lpc_order vs window_length: Need enough samples for stable LPC estimation: window_length × sample_rate > 2 × lpc_order.

Practical Applications

Music Structure Analysis

🎵 Automatic Section Detection

Workflow:

  1. Load song (30‑180 seconds)
  2. Use MFCC features with frame_decimation=2
  3. Compute SSM
  4. Identify bright off‑diagonal blocks
  5. Map block boundaries to time
  6. Label sections (verse, chorus, bridge)

Example: Pop song (3‑4 minutes)

Expected pattern: Intro - Verse - Chorus - Verse - Chorus - Bridge - Chorus - Outro SSM will show: • Bright block: Verse1 vs Verse2 • Bright block: Chorus1 vs Chorus2, Chorus1 vs Chorus3, etc. • Bridge appears as isolated bright square • Intro/Outro may show self‑similarity only

Advanced: Checkerboard detection — ABAB patterns create checkerboard SSM.

Speech & Phonetics

🗣️ Phoneme Repetition Analysis

Workflow:

  1. Use LPC features with order=12‑16
  2. Process short utterance (5‑10 seconds)
  3. Enable high resolution (frame_decimation=1)
  4. Look for small bright blocks corresponding to phonemes

Applications:

  • Pronunciation consistency: Compare multiple repetitions of same word
  • Dialect analysis: Compare vowel realizations across speakers
  • Speech therapy: Track consistency of target phonemes
  • Forensics: Compare speech samples for speaker similarity

Example: "The cat sat on the mat" — bright blocks for each /æ/ sound.

Sound Design & Composition

🎨 Texture Mapping

Using Spectral Entropy SSM:

Goal: Create evolving texture piece Process: 1. Record/Generate various textures: noise, tones, complex sounds 2. Arrange in timeline 3. Compute Spectral Entropy SSM 4. Analyze transitions between textures SSM reveals: • Bright blocks: similar textures • Dark areas: texture boundaries • Gradients: smooth transitions • Patterns: repetitive texture sequences

Composition tool: Use SSM to ensure desired structure — e.g., symmetric patterns for formal structure, chaotic patterns for experimental works.

Music Information Retrieval Research

🔬 Feature Comparison Study

Research question: Which feature best captures musical similarity for genre X?

Methodology:

  1. Select representative audio corpus
  2. Compute SSM with multiple features (Pitch, MFCC, LPC, etc.)
  3. Compare SSM patterns to ground‑truth annotations
  4. Quantify accuracy for section boundary detection
  5. Analyze computational efficiency

Metrics:

  • Boundary F‑score: Precision/recall for detecting section boundaries
  • Pattern consistency: How well SSM captures known repetitions
  • Computation time: Real‑time feasibility
  • Memory usage: Practical limitations

Educational Use

📚 Teaching Musical Form

Classroom activity: Visualizing musical structure

  1. Students bring favorite songs (30‑second excerpts)
  2. Predict structure (draw expected SSM pattern)
  3. Compute actual SSM (MFCC, frame_decimation=5 for speed)
  4. Compare prediction to reality
  5. Discuss why patterns appear/disappear

Learning outcomes:

  • Understand repetition in music
  • Recognize common forms (verse‑chorus, rondo, etc.)
  • Connect auditory perception to visual representation
  • Learn basics of audio feature extraction

Troubleshooting & Tips

Problem: SSM shows only diagonal, no off‑diagonal patterns
Causes: 1) Audio has no repetitions, 2) Contrast too low, 3) Feature mismatch, 4) Decimation too high
Solutions: Check audio for actual repetitions, increase auto_contrast, try different features, reduce frame_decimation
Problem: Computation takes too long
Causes: Long audio, high sample rate, low decimation, MFCC features
Solutions: Use shorter excerpt, enable downsampling, increase frame_decimation, use Pitch or Mel Features
Problem: Praat crashes or runs out of memory
Causes: Too many frames (N too large)
Solutions: Increase frame_decimation dramatically (10‑20), use shorter audio segment, process in chunks externally
Problem: SSM patterns don't match perceived structure
Causes: Wrong feature type for analysis goal
Solutions: Match feature to task: Pitch for melody, MFCC for timbre, Entropy for texture, LPC for speech

Advanced Techniques

For experienced users:
  • Multi‑feature SSM: Combine features (e.g., concatenate MFCC and Pitch vectors) for richer representation
  • Time‑lag matrix: Instead of S(i,j), compute S(i,i+lag) to find periodicities
  • Thresholding: Apply threshold to create binary SSM (similar/dissimilar)
  • Novelty detection: Compute novelty curve from SSM diagonal band
  • Cross‑SSM: Compare two different audio files (not implemented in this script)
  • SSM filtering: Apply smoothing or median filtering to reduce noise