Self-Similarity Spectral Resynthesis: MFCC-Based Audio Transformation

A Praat script that analyzes audio through MFCCs, computes self-similarity matrices, and creates radical spectral transformations based on temporal similarity patterns.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 2.2 (2025) Concept: Self-similarity matrix audio processing Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools
Contents:

What this does

The Self-Similarity Spectral Resynthesis script transforms audio by analyzing its internal temporal patterns. Instead of processing frequencies directly, it uses MFCC features to detect when audio segments sound similar to each other, then applies dynamic amplitude changes based on these similarity patterns.

🔍 Core Process Pipeline

  1. MFCC Extraction: Convert audio to Mel-frequency cepstral coefficients
  2. Self-Similarity Matrix (SSM): Compute how each moment resembles every other moment
  3. Similarity Scoring: Calculate average similarity for each time frame
  4. Creative Transformation: Apply one of 5 creative modes to similarity scores
  5. Gain Mapping: Convert scores to amplitude envelopes
  6. Spatial Processing: Apply in mono, stereo, or spatial modes
  7. Resynthesis: Multiply original audio by dynamic envelope

Key Insight: Audio contains hidden temporal patterns. Repetitive sections get boosted (or attenuated), novel sections get opposite treatment. The result reveals the "skeleton" of the audio's temporal structure.

Quick Start

  1. In Praat, select exactly one Sound object.
  2. Open the script editor and load Self_Similarity_Resynthesis.praat.
  3. Choose a Preset based on desired effect:
    • Glitch Gating: Extreme rhythmic gating
    • Ghost Remix: Subtle spectral enhancement
    • Brutal Novelty: Maximum contrast effects
    • Spectral Mosaic: Frame substitution effect
    • Chaotic Tremolo: Rapid amplitude modulation
  4. Choose Creative_mode (Standard, Inverted, etc.)
  5. Select Speed_mode based on your patience:
    • Full Quality: Slow but accurate
    • Balanced: 16kHz, 2-4× faster
    • Fast: 8kHz, 4-8× faster
  6. Enable Draw_visualization to see SSM and scores.
  7. Click RunRun (or Ctrl+R).
  8. Listen to the transformed audio.
Best Practices for First Use:
  • Start with "Ghost Remix" preset - subtle and musical
  • Use Balanced speed mode for reasonable processing time
  • Enable visualization to understand what's happening
  • Try on vocal recordings first - effects are most audible
  • Adjust Similarity_threshold to control effect intensity
Processing Characteristics:
  • MFCC-based: Works on spectral shapes, not raw audio
  • Frame-based: Processes in 10-30ms frames (adjustable)
  • Non-realtime: Analyzes entire file before processing
  • Memory intensive: SSM computation grows with file length
  • Stereo handling: Multiple spatial modes available
  • Speed tradeoffs: Downsampling reduces quality but speeds up

Self-Similarity Matrix Theory

What is a Self-Similarity Matrix (SSM)?

🎲 Mathematical Definition

Given: A sequence of feature vectors F₁, F₂, ..., Fₙ

SSM: An n×n matrix where entry S(i,j) = similarity(Fᵢ, Fⱼ)

S(i,j) = cosine_similarity(Fᵢ, Fⱼ) = (Fᵢ · Fⱼ) / (‖Fᵢ‖ ‖Fⱼ‖) WHERE: Fᵢ = feature vector at time i (MFCCs) · = dot product ‖·‖ = Euclidean norm Result: -1 to 1, where 1 = identical, 0 = orthogonal, -1 = opposite

Visual interpretation: A heatmap where bright spots indicate similar moments in time.

SSM Properties in Audio

Diagonal: S(i,i) = 1 (every moment is identical to itself)

Symmetry: S(i,j) = S(j,i) (similarity is mutual)

Repetition patterns: Off-diagonal bright lines indicate repetitions

Novelty: Dark rows/columns indicate unique moments

Structure: Block patterns indicate sections

Computational Optimization

OPTIMIZATION STRATEGIES: 1. Sparse Computation: Only compute every 5th row and column Reduces O(n²) to O(n²/25) 2. Early Normalization: Normalize feature vectors before SSM Avoids repeated normalization in similarity calculation 3. Downsampling: Process at 8-16 kHz instead of original sample rate Reduces frame count proportionally ACTUAL IMPLEMENTATION: For n frames, compute ~n²/25 similarity values Memory: store n×n matrix (float, 4 bytes each) Example: 1000 frames → 1,000,000 cells → 4 MB

MFCC Analysis Fundamentals

🎵 Mel-Frequency Cepstral Coefficients

Why MFCCs for similarity?

  • Perceptually relevant: Based on Mel scale (human hearing)
  • Compact representation: 8-13 coefficients capture spectral shape
  • Invariant to amplitude: Normalized, so loudness doesn't affect similarity
  • Robust: Less sensitive to pitch shifts than raw spectra
  • Standard: Widely used in speech/music analysis

MFCC Extraction Pipeline

1. PRE-EMPHASIS: Apply high-pass filter: y[n] = x[n] - 0.97 * x[n-1] Boosts high frequencies 2. FRAMING: Divide into overlapping frames (e.g., 30ms, 10ms hop) 3. WINDOWING: Apply Hamming window to reduce edge effects 4. FFT: Compute power spectrum 5. MEL FILTERBANK: Apply triangular filters on Mel scale (20-40 filters) 6. LOG COMPRESSION: Take logarithm of filterbank energies 7. DCT: Discrete Cosine Transform to decorrelate coefficients Keep first 8-13 coefficients (cepstral coefficients) 8. NORMALIZATION: Scale to unit length for similarity computation

Parameter Effects on Analysis

ParameterEffectTypical Value
Time_step_(s)Frame rate (temporal resolution)0.01 s (100 Hz)
Analysis_frame_length_(s)Window size (frequency resolution)0.03 s (30 ms)
Number_of_MFCCsDimensionality of feature vector8-13 coefficients

Creative Transformation Modes

Mode 1: Standard (Similarity Boost)

📈 Boost Repetitive Sections

Logic: High similarity → high amplitude

score = average_similarity(frame) gain = map(score, 0→1, attenuation→boost)

Effect: Repetitive parts (chorus, loops) get louder. Novel parts (verses, transitions) get quieter.

Musical use: Emphasize recurring themes, create "ghostly" emphasis on repeated material.

Mode 2: Inverted (Novelty Extractor)

🔄 Boost Novel Sections

Logic: High similarity → low amplitude (inverted)

score = 1 - average_similarity(frame) gain = map(score, 0→1, attenuation→boost)

Effect: Unique moments get emphasized. Repetitive material fades to background.

Musical use: Highlight unique events, solos, transitions. Creates "glitch gating" effect.

Mode 3: Diagonal Recurrence

↗️ Emphasize Sequential Patterns

Logic: Weight diagonal neighbors more heavily

score = 0.5 * average_similarity + 0.5 * diagonal_similarity WHERE diagonal_similarity = average of S(i, i±offset)

Effect: Emphasizes sequential repetitions (echoes, rhythmic patterns).

Musical use: Bring out rhythmic motifs, sequential developments.

Mode 4: Hard Quantized (3 levels)

🎚️ Three-Level Amplitude Quantization

Logic: Discretize similarity scores into 3 levels

score = floor(average_similarity * 3) / 3 Result: 0, 0.333, 0.666, or 1.0

Effect: Creates stepped amplitude changes, like a "tremolo with 3 states".

Musical use: Robotic/quantized effects, rhythmic stuttering.

Mode 5: Self-Mosaic (Frame Substitution)

🧩 Replace Frames with Similar Ones

Logic: For each frame, find most similar other frame, copy its audio

FOR each target frame t: Find source frame s with max S(t,s) where |t-s| > 5 Copy audio from source s to target t

Effect: Creates audio "mosaic" where each moment is replaced by its most similar counterpart.

Musical use: Extreme texture transformation, "audio collage" effect.

Preset Configurations

Preset 1: Glitch Gating

⚡ Extreme Rhythmic Gating

ParameterValueEffect
Creative_modeInverted (2)Boost novel sections
Time_step0.005 sHigh temporal resolution
Analysis_frame_length0.02 sShort frames for precision
Similarity_threshold0.6High threshold = strict similarity
Contrast_power6.0Extreme contrast
Hard_threshold_gateOnBinary on/off gating
Gate_threshold0.660% similarity cutoff
High_similarity_boost18 dBStrong boost
Low_similarity_attenuation-36 dBStrong attenuation

Result: Aggressive rhythmic gating effect. Works great on drums, percussion, staccato material.

Preset 2: Ghost Remix

👻 Subtle Spectral Enhancement

ParameterValueEffect
Creative_modeStandard (1)Boost similar sections
Time_step0.008 sMedium resolution
Analysis_frame_length0.025 sStandard frame size
Similarity_threshold0.4Moderate threshold
Contrast_power3.0Gentle contrast
Mask_smoothing_frames5Smooth transitions
High_similarity_boost15 dBModerate boost
Low_similarity_attenuation-20 dBModerate attenuation

Result: Subtle enhancement that brings out repetitive elements. Good for vocals, melodic material.

Preset 3: Brutal Novelty

💥 Maximum Contrast Effects

ParameterValueEffect
Creative_modeInverted (2)Boost novel sections
Time_step0.005 sHigh resolution
Analysis_frame_length0.02 sShort frames
Similarity_threshold0.3Low threshold = more "novelty"
Contrast_power8.0Extreme contrast
Hard_threshold_gateOnBinary gating
Gate_threshold0.770% cutoff
High_similarity_boost24 dBVery strong boost
Low_similarity_attenuation-48 dBNear-silence for similar

Result: Extreme glitch effect that isolates unique moments. Only the most novel material survives.

Preset 4: Spectral Mosaic

🧩 Frame Substitution Effect

ParameterValueEffect
Creative_modeSelf-Mosaic (5)Frame substitution
Time_step0.02 sLonger steps for coherence
Analysis_frame_length0.05 sLong frames
Similarity_threshold0.4Moderate similarity

Result: Audio "collage" where each moment is replaced by its most similar counterpart. Creates dreamlike, repetitive texture.

Preset 5: Chaotic Tremolo

🌀 Rapid Amplitude Modulation

ParameterValueEffect
Creative_modeHard Quantized (4)3-level quantization
Time_step0.003 sVery fast updates
Analysis_frame_length0.015 sShort analysis
Contrast_power10.0High contrast
Hard_threshold_gateOnBinary decisions
Gate_threshold0.550% cutoff
Add_chaos0.25High randomness
High_similarity_boost20 dBStrong boost
Low_similarity_attenuation-40 dBStrong attenuation

Result: Rapid, chaotic amplitude modulation with 3 distinct levels. Creates nervous, animated texture.

Spatial Processing Modes

🔊 Five Spatialization Strategies

ModeAlgorithmEffectBest For
MonoConvert to mono, apply single envelopeCentered, focusedCompatibility, mono systems
Preserve StereoApply same envelope to both channelsStereo image preservedMost stereo material
Stereo WideDifferent EQ per channel + envelopeEnhanced widthCreating spatial interest
RotatingPanning modulation + envelopeCircular motionExperimental, psychedelic
Mid-SideApply envelope mainly to centerFocus on centerVocals, lead elements

Stereo Wide Implementation

IF source is stereo: Extract left and right channels ELSE: Duplicate mono to both channels LEFT CHANNEL: Apply envelope × (1 + 0.3 × exp(-5×t)) Boosts low-mid frequencies over time RIGHT CHANNEL: Apply envelope × (1 + 0.3 × (1 - exp(-5×t))) Boosts high-mid frequencies over time RESULT: Frequency-dependent stereo widening

Rotating Panning Effect

1. Convert to mono if stereo 2. Apply envelope 3. Create left channel: left = processed × (0.6 + 0.4 × cos(2π × 0.2 × t)) 4. Create right channel: right = processed × (0.6 + 0.4 × sin(2π × 0.2 × t)) 5. Combine to stereo EFFECT: 0.2 Hz (5 second period) panning rotation Amplitude modulation creates perceived motion

Parameters Explained

Analysis Parameters

⏱️ Time_step_(s)

Range: 0.001 - 0.05 seconds (1-50 ms)

Default: 0.01 seconds (10 ms, 100 Hz frame rate)

Effect:

  • Small values (1-5 ms): High temporal resolution, smoother envelopes, slower processing
  • Medium values (8-15 ms): Good balance, musical timing
  • Large values (20-50 ms): Coarse resolution, rhythmic stepping, faster processing

Musical correlation: 10 ms ≈ 32nd note at 187.5 BPM

🎵 Analysis_frame_length_(s)

Range: 0.01 - 0.1 seconds (10-100 ms)

Default: 0.03 seconds (30 ms)

Tradeoff: Temporal vs. frequency resolution

  • Short frames (10-20 ms): Good temporal precision, poor frequency resolution
  • Medium frames (25-40 ms): Balanced (default)
  • Long frames (50-100 ms): Good frequency resolution, poor temporal precision

Rule: Frame_length should be 2-4× Time_step for reasonable overlap.

Similarity & Masking Parameters

ParameterRangeDefaultEffect
Similarity_threshold0.1-0.90.5Minimum similarity to count in average
Contrast_power1.0-20.04.0Nonlinear mapping exponent
High_similarity_boost_(dB)0-48 dB12 dBGain for high similarity frames
Low_similarity_attenuation_(dB)-60 to 0 dB-24 dBGain for low similarity frames
Mask_smoothing_frames0-203Moving average smoothing of scores

Experimental Controls

Add_chaos_(0-1): Adds random noise to similarity scores Range: 0.0 (deterministic) to 1.0 (completely random) Effect: Creates unpredictable amplitude variations Hard_threshold_gate: Binary on/off decision instead of smooth fade If similarity < Gate_threshold: silence Else: full volume Gate_threshold_(0-1): Similarity cutoff for hard gating Range: 0.1-0.9, typically 0.5-0.7

Visualization Guide

📊 Understanding the Graphical Output

Five visualization panels:

PanelContentPurpose
TitleScript name, filename, modeContext
Original WaveformInput audio (gray)Reference
Processed WaveformOutput audio (blue)Show effect
Self-Similarity MatrixSSM heatmap (gray scale)Show similarity patterns
Similarity ScoresScore curve over time (purple)Show computed scores
StatisticsParameters and processing infoDocumentation

Reading the SSM Heatmap

Axes: Both X and Y represent time (frame number)

Color coding: Black = similar (SSM value ~1), White = dissimilar (SSM value ~0)

Patterns to look for:

  • Bright diagonal: Always present (frame similar to itself)
  • Off-diagonal lines: Repetitions (e.g., chorus repeating)
  • Blocks: Sections with internal similarity
  • Dark areas: Unique, non-repeating material
  • Symmetry: Should be symmetric across diagonal

Sparse sampling: Only shows every 20th frame (for speed). Complete matrix is computed but not fully displayed.

Interpreting Similarity Scores

X-axis: Frame number (time)

Y-axis: Similarity score (0-1)

What high scores mean: Frame is similar to many other frames (repetitive material)

What low scores mean: Frame is unique (novel material)

Patterns:

  • Constant high: Very repetitive audio (drone, loop)
  • Constant low: Constantly changing (improvisation, noise)
  • Periodic peaks: Repeated sections (verse/chorus structure)
  • Spikes: Unique events (transitions, solo moments)

Creative mode effects: Inverted mode shows opposite pattern.

Applications & Creative Uses

Music Production & Sound Design

🎵 Enhancing Repetitive Elements

Use case: Bringing out hidden patterns in ambient music

  1. Load ambient drone or pad sound
  2. Use Ghost Remix preset (Standard mode)
  3. Similarity_threshold = 0.3 (capture subtle repetitions)
  4. Result: Subtle amplitude modulation that follows internal patterns
  5. Mix 30% with original for enhanced texture

Use case: Creating rhythmic gating from non-rhythmic material

  1. Load vocal phrase or sustained instrument
  2. Use Glitch Gating preset
  3. Adjust Time_step to desired rhythm (e.g., 0.125 s for 8th notes at 120 BPM)
  4. Result: Rhythmic amplitude gating based on phonetic/spectral changes

Audio Restoration & Enhancement

🔍 Noise Reduction via Similarity

Concept: Noise is dissimilar to everything. Signal has self-similarity.

  1. Load noisy recording (speech with background noise)
  2. Use Standard mode with high Similarity_threshold (0.7)
  3. Set High_similarity_boost = 0 dB, Low_similarity_attenuation = -30 dB
  4. Result: Similar (speech) parts preserved, dissimilar (noise) attenuated
  5. Limitation: Works best when noise is spectrally distinct from signal

Vocal breath reduction: Breaths are spectrally similar to silence/noise. Can be attenuated by boosting similar (voiced) sections.

Compositional Analysis

📐 Visualizing Musical Structure

Analytical workflow:

  1. Load complete musical piece
  2. Use Full Quality speed mode
  3. Set Time_step = 0.05 s (coarse for overview)
  4. Enable visualization
  5. Examine SSM for structural patterns:
    • Clear blocks: Distinct sections (verse/chorus)
    • Diagonal lines: Recurring motifs
    • Checkerboard: Alternating patterns
    • Dark areas: Developmental sections

Research applications: Form analysis, repetition detection, style analysis.

Experimental Audio Processing

🧪 Chain Processing for Extreme Effects

EXPERIMENTAL PROCESSING CHAIN: 1. SOURCE: Speech recording 2. PROCESS 1: Self-Similarity → Brutal Novelty preset Result: Only most unique phonemes survive 3. PROCESS 2: Take output → Self-Mosaic preset Result: Each moment replaced by similar moment 4. PROCESS 3: Take output → Chaotic Tremolo preset Result: Rapid amplitude quantization 5. MIX: Blend all three layers with different panning 6. ADD: Global reverb and delay RESULT: Completely transformed audio texture

Creative potential: Endless variations by combining modes and presets.