Self-Similarity Spectral Resynthesis: MFCC-Based Audio Transformation

A Praat script that analyzes audio through MFCCs, computes self-similarity matrices, and creates radical spectral transformations based on temporal similarity patterns.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 2.2 (2025) Concept: Self-similarity matrix audio processing Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools

Contents:

What this does Quick Start Self-Similarity Theory MFCC Analysis Creative Modes Preset Configurations Spatial Processing Parameters Explained Visualization Guide Applications

What this does

The Self-Similarity Spectral Resynthesis script transforms audio by analyzing its internal temporal patterns. Instead of processing frequencies directly, it uses MFCC features to detect when audio segments sound similar to each other, then applies dynamic amplitude changes based on these similarity patterns.

🔍 Core Process Pipeline

MFCC Extraction: Convert audio to Mel-frequency cepstral coefficients
Self-Similarity Matrix (SSM): Compute how each moment resembles every other moment
Similarity Scoring: Calculate average similarity for each time frame
Creative Transformation: Apply one of 5 creative modes to similarity scores
Gain Mapping: Convert scores to amplitude envelopes
Spatial Processing: Apply in mono, stereo, or spatial modes
Resynthesis: Multiply original audio by dynamic envelope

Key Insight: Audio contains hidden temporal patterns. Repetitive sections get boosted (or attenuated), novel sections get opposite treatment. The result reveals the "skeleton" of the audio's temporal structure.

Quick Start

In Praat, select exactly one Sound object.
Open the script editor and load Self_Similarity_Resynthesis.praat.
Choose a Preset based on desired effect:
- Glitch Gating: Extreme rhythmic gating
- Ghost Remix: Subtle spectral enhancement
- Brutal Novelty: Maximum contrast effects
- Spectral Mosaic: Frame substitution effect
- Chaotic Tremolo: Rapid amplitude modulation
Choose Creative_mode (Standard, Inverted, etc.)
Select Speed_mode based on your patience:
- Full Quality: Slow but accurate
- Balanced: 16kHz, 2-4× faster
- Fast: 8kHz, 4-8× faster
Enable Draw_visualization to see SSM and scores.
Click Run → Run (or Ctrl+R).
Listen to the transformed audio.

Best Practices for First Use:

Start with "Ghost Remix" preset - subtle and musical
Use Balanced speed mode for reasonable processing time
Enable visualization to understand what's happening
Try on vocal recordings first - effects are most audible
Adjust Similarity_threshold to control effect intensity

Processing Characteristics:

MFCC-based: Works on spectral shapes, not raw audio
Frame-based: Processes in 10-30ms frames (adjustable)
Non-realtime: Analyzes entire file before processing
Memory intensive: SSM computation grows with file length
Stereo handling: Multiple spatial modes available
Speed tradeoffs: Downsampling reduces quality but speeds up

Self-Similarity Matrix Theory

What is a Self-Similarity Matrix (SSM)?

🎲 Mathematical Definition

Given: A sequence of feature vectors F₁, F₂, ..., Fₙ

SSM: An n×n matrix where entry S(i,j) = similarity(Fᵢ, Fⱼ)

S(i,j) = cosine_similarity(Fᵢ, Fⱼ) = (Fᵢ · Fⱼ) / (‖Fᵢ‖ ‖Fⱼ‖) WHERE: Fᵢ = feature vector at time i (MFCCs) · = dot product ‖·‖ = Euclidean norm Result: -1 to 1, where 1 = identical, 0 = orthogonal, -1 = opposite

Visual interpretation: A heatmap where bright spots indicate similar moments in time.

SSM Properties in Audio

Diagonal: S(i,i) = 1 (every moment is identical to itself)

Symmetry: S(i,j) = S(j,i) (similarity is mutual)

Repetition patterns: Off-diagonal bright lines indicate repetitions

Novelty: Dark rows/columns indicate unique moments

Structure: Block patterns indicate sections

Computational Optimization

OPTIMIZATION STRATEGIES: 1. Sparse Computation: Only compute every 5th row and column Reduces O(n²) to O(n²/25) 2. Early Normalization: Normalize feature vectors before SSM Avoids repeated normalization in similarity calculation 3. Downsampling: Process at 8-16 kHz instead of original sample rate Reduces frame count proportionally ACTUAL IMPLEMENTATION: For n frames, compute ~n²/25 similarity values Memory: store n×n matrix (float, 4 bytes each) Example: 1000 frames → 1,000,000 cells → 4 MB

MFCC Analysis Fundamentals

🎵 Mel-Frequency Cepstral Coefficients

Why MFCCs for similarity?

Perceptually relevant: Based on Mel scale (human hearing)
Compact representation: 8-13 coefficients capture spectral shape
Invariant to amplitude: Normalized, so loudness doesn't affect similarity
Robust: Less sensitive to pitch shifts than raw spectra
Standard: Widely used in speech/music analysis

MFCC Extraction Pipeline

1. PRE-EMPHASIS: Apply high-pass filter: y[n] = x[n] - 0.97 * x[n-1] Boosts high frequencies 2. FRAMING: Divide into overlapping frames (e.g., 30ms, 10ms hop) 3. WINDOWING: Apply Hamming window to reduce edge effects 4. FFT: Compute power spectrum 5. MEL FILTERBANK: Apply triangular filters on Mel scale (20-40 filters) 6. LOG COMPRESSION: Take logarithm of filterbank energies 7. DCT: Discrete Cosine Transform to decorrelate coefficients Keep first 8-13 coefficients (cepstral coefficients) 8. NORMALIZATION: Scale to unit length for similarity computation

Parameter Effects on Analysis

Parameter	Effect	Typical Value
Time_step_(s)	Frame rate (temporal resolution)	0.01 s (100 Hz)
Analysis_frame_length_(s)	Window size (frequency resolution)	0.03 s (30 ms)
Number_of_MFCCs	Dimensionality of feature vector	8-13 coefficients

Creative Transformation Modes

Mode 1: Standard (Similarity Boost)

📈 Boost Repetitive Sections

Logic: High similarity → high amplitude

score = average_similarity(frame) gain = map(score, 0→1, attenuation→boost)

Effect: Repetitive parts (chorus, loops) get louder. Novel parts (verses, transitions) get quieter.

Musical use: Emphasize recurring themes, create "ghostly" emphasis on repeated material.

Mode 2: Inverted (Novelty Extractor)

🔄 Boost Novel Sections

Logic: High similarity → low amplitude (inverted)

score = 1 - average_similarity(frame) gain = map(score, 0→1, attenuation→boost)

Effect: Unique moments get emphasized. Repetitive material fades to background.

Musical use: Highlight unique events, solos, transitions. Creates "glitch gating" effect.

Mode 3: Diagonal Recurrence

↗️ Emphasize Sequential Patterns

Logic: Weight diagonal neighbors more heavily

score = 0.5 * average_similarity + 0.5 * diagonal_similarity WHERE diagonal_similarity = average of S(i, i±offset)

Effect: Emphasizes sequential repetitions (echoes, rhythmic patterns).

Musical use: Bring out rhythmic motifs, sequential developments.

Mode 4: Hard Quantized (3 levels)

🎚️ Three-Level Amplitude Quantization

Logic: Discretize similarity scores into 3 levels

score = floor(average_similarity * 3) / 3 Result: 0, 0.333, 0.666, or 1.0

Effect: Creates stepped amplitude changes, like a "tremolo with 3 states".

Musical use: Robotic/quantized effects, rhythmic stuttering.

Mode 5: Self-Mosaic (Frame Substitution)

🧩 Replace Frames with Similar Ones

Logic: For each frame, find most similar other frame, copy its audio

FOR each target frame t: Find source frame s with max S(t,s) where |t-s| > 5 Copy audio from source s to target t

Effect: Creates audio "mosaic" where each moment is replaced by its most similar counterpart.

Musical use: Extreme texture transformation, "audio collage" effect.

Preset Configurations

Preset 1: Glitch Gating

⚡ Extreme Rhythmic Gating

Parameter	Value	Effect
Creative_mode	Inverted (2)	Boost novel sections
Time_step	0.005 s	High temporal resolution
Analysis_frame_length	0.02 s	Short frames for precision
Similarity_threshold	0.6	High threshold = strict similarity
Contrast_power	6.0	Extreme contrast
Hard_threshold_gate	On	Binary on/off gating
Gate_threshold	0.6	60% similarity cutoff
High_similarity_boost	18 dB	Strong boost
Low_similarity_attenuation	-36 dB	Strong attenuation

Result: Aggressive rhythmic gating effect. Works great on drums, percussion, staccato material.

Preset 2: Ghost Remix

👻 Subtle Spectral Enhancement

Parameter	Value	Effect
Creative_mode	Standard (1)	Boost similar sections
Time_step	0.008 s	Medium resolution
Analysis_frame_length	0.025 s	Standard frame size
Similarity_threshold	0.4	Moderate threshold
Contrast_power	3.0	Gentle contrast
Mask_smoothing_frames	5	Smooth transitions
High_similarity_boost	15 dB	Moderate boost
Low_similarity_attenuation	-20 dB	Moderate attenuation

Result: Subtle enhancement that brings out repetitive elements. Good for vocals, melodic material.

Preset 3: Brutal Novelty

💥 Maximum Contrast Effects

Parameter	Value	Effect
Creative_mode	Inverted (2)	Boost novel sections
Time_step	0.005 s	High resolution
Analysis_frame_length	0.02 s	Short frames
Similarity_threshold	0.3	Low threshold = more "novelty"
Contrast_power	8.0	Extreme contrast
Hard_threshold_gate	On	Binary gating
Gate_threshold	0.7	70% cutoff
High_similarity_boost	24 dB	Very strong boost
Low_similarity_attenuation	-48 dB	Near-silence for similar

Result: Extreme glitch effect that isolates unique moments. Only the most novel material survives.

Preset 4: Spectral Mosaic

🧩 Frame Substitution Effect

Parameter	Value	Effect
Creative_mode	Self-Mosaic (5)	Frame substitution
Time_step	0.02 s	Longer steps for coherence
Analysis_frame_length	0.05 s	Long frames
Similarity_threshold	0.4	Moderate similarity

Result: Audio "collage" where each moment is replaced by its most similar counterpart. Creates dreamlike, repetitive texture.

Preset 5: Chaotic Tremolo

🌀 Rapid Amplitude Modulation

Parameter	Value	Effect
Creative_mode	Hard Quantized (4)	3-level quantization
Time_step	0.003 s	Very fast updates
Analysis_frame_length	0.015 s	Short analysis
Contrast_power	10.0	High contrast
Hard_threshold_gate	On	Binary decisions
Gate_threshold	0.5	50% cutoff
Add_chaos	0.25	High randomness
High_similarity_boost	20 dB	Strong boost
Low_similarity_attenuation	-40 dB	Strong attenuation

Result: Rapid, chaotic amplitude modulation with 3 distinct levels. Creates nervous, animated texture.

Spatial Processing Modes

🔊 Five Spatialization Strategies

Mode	Algorithm	Effect	Best For
Mono	Convert to mono, apply single envelope	Centered, focused	Compatibility, mono systems
Preserve Stereo	Apply same envelope to both channels	Stereo image preserved	Most stereo material
Stereo Wide	Different EQ per channel + envelope	Enhanced width	Creating spatial interest
Rotating	Panning modulation + envelope	Circular motion	Experimental, psychedelic
Mid-Side	Apply envelope mainly to center	Focus on center	Vocals, lead elements

Stereo Wide Implementation

IF source is stereo: Extract left and right channels ELSE: Duplicate mono to both channels LEFT CHANNEL: Apply envelope × (1 + 0.3 × exp(-5×t)) Boosts low-mid frequencies over time RIGHT CHANNEL: Apply envelope × (1 + 0.3 × (1 - exp(-5×t))) Boosts high-mid frequencies over time RESULT: Frequency-dependent stereo widening

Rotating Panning Effect

1. Convert to mono if stereo 2. Apply envelope 3. Create left channel: left = processed × (0.6 + 0.4 × cos(2π × 0.2 × t)) 4. Create right channel: right = processed × (0.6 + 0.4 × sin(2π × 0.2 × t)) 5. Combine to stereo EFFECT: 0.2 Hz (5 second period) panning rotation Amplitude modulation creates perceived motion

Parameters Explained

Analysis Parameters

⏱️ Time_step_(s)

Range: 0.001 - 0.05 seconds (1-50 ms)

Default: 0.01 seconds (10 ms, 100 Hz frame rate)

Effect:

Small values (1-5 ms): High temporal resolution, smoother envelopes, slower processing
Medium values (8-15 ms): Good balance, musical timing
Large values (20-50 ms): Coarse resolution, rhythmic stepping, faster processing

Musical correlation: 10 ms ≈ 32nd note at 187.5 BPM

🎵 Analysis_frame_length_(s)

Range: 0.01 - 0.1 seconds (10-100 ms)

Default: 0.03 seconds (30 ms)

Tradeoff: Temporal vs. frequency resolution

Short frames (10-20 ms): Good temporal precision, poor frequency resolution
Medium frames (25-40 ms): Balanced (default)
Long frames (50-100 ms): Good frequency resolution, poor temporal precision

Rule: Frame_length should be 2-4× Time_step for reasonable overlap.

Similarity & Masking Parameters

Parameter	Range	Default	Effect
Similarity_threshold	0.1-0.9	0.5	Minimum similarity to count in average
Contrast_power	1.0-20.0	4.0	Nonlinear mapping exponent
High_similarity_boost_(dB)	0-48 dB	12 dB	Gain for high similarity frames
Low_similarity_attenuation_(dB)	-60 to 0 dB	-24 dB	Gain for low similarity frames
Mask_smoothing_frames	0-20	3	Moving average smoothing of scores

Experimental Controls

Add_chaos_(0-1): Adds random noise to similarity scores Range: 0.0 (deterministic) to 1.0 (completely random) Effect: Creates unpredictable amplitude variations Hard_threshold_gate: Binary on/off decision instead of smooth fade If similarity < Gate_threshold: silence Else: full volume Gate_threshold_(0-1): Similarity cutoff for hard gating Range: 0.1-0.9, typically 0.5-0.7

Visualization Guide

📊 Understanding the Graphical Output

Five visualization panels:

Panel	Content	Purpose
Title	Script name, filename, mode	Context
Original Waveform	Input audio (gray)	Reference
Processed Waveform	Output audio (blue)	Show effect
Self-Similarity Matrix	SSM heatmap (gray scale)	Show similarity patterns
Similarity Scores	Score curve over time (purple)	Show computed scores
Statistics	Parameters and processing info	Documentation

Reading the SSM Heatmap

Axes: Both X and Y represent time (frame number)

Color coding: Black = similar (SSM value ~1), White = dissimilar (SSM value ~0)

Patterns to look for:

Bright diagonal: Always present (frame similar to itself)
Off-diagonal lines: Repetitions (e.g., chorus repeating)
Blocks: Sections with internal similarity
Dark areas: Unique, non-repeating material
Symmetry: Should be symmetric across diagonal

Sparse sampling: Only shows every 20th frame (for speed). Complete matrix is computed but not fully displayed.

Interpreting Similarity Scores

X-axis: Frame number (time)

Y-axis: Similarity score (0-1)

What high scores mean: Frame is similar to many other frames (repetitive material)

What low scores mean: Frame is unique (novel material)

Patterns:

Constant high: Very repetitive audio (drone, loop)
Constant low: Constantly changing (improvisation, noise)
Periodic peaks: Repeated sections (verse/chorus structure)
Spikes: Unique events (transitions, solo moments)

Creative mode effects: Inverted mode shows opposite pattern.

Applications & Creative Uses

Music Production & Sound Design

🎵 Enhancing Repetitive Elements

Use case: Bringing out hidden patterns in ambient music

Load ambient drone or pad sound
Use Ghost Remix preset (Standard mode)
Similarity_threshold = 0.3 (capture subtle repetitions)
Result: Subtle amplitude modulation that follows internal patterns
Mix 30% with original for enhanced texture

Use case: Creating rhythmic gating from non-rhythmic material

Load vocal phrase or sustained instrument
Use Glitch Gating preset
Adjust Time_step to desired rhythm (e.g., 0.125 s for 8th notes at 120 BPM)
Result: Rhythmic amplitude gating based on phonetic/spectral changes

Audio Restoration & Enhancement

🔍 Noise Reduction via Similarity

Concept: Noise is dissimilar to everything. Signal has self-similarity.

Load noisy recording (speech with background noise)
Use Standard mode with high Similarity_threshold (0.7)
Set High_similarity_boost = 0 dB, Low_similarity_attenuation = -30 dB
Result: Similar (speech) parts preserved, dissimilar (noise) attenuated
Limitation: Works best when noise is spectrally distinct from signal

Vocal breath reduction: Breaths are spectrally similar to silence/noise. Can be attenuated by boosting similar (voiced) sections.

Compositional Analysis

📐 Visualizing Musical Structure

Analytical workflow:

Load complete musical piece
Use Full Quality speed mode
Set Time_step = 0.05 s (coarse for overview)
Enable visualization
Examine SSM for structural patterns:

Clear blocks: Distinct sections (verse/chorus)
Diagonal lines: Recurring motifs
Checkerboard: Alternating patterns
Dark areas: Developmental sections

Research applications: Form analysis, repetition detection, style analysis.

Experimental Audio Processing

🧪 Chain Processing for Extreme Effects

EXPERIMENTAL PROCESSING CHAIN: 1. SOURCE: Speech recording 2. PROCESS 1: Self-Similarity → Brutal Novelty preset Result: Only most unique phonemes survive 3. PROCESS 2: Take output → Self-Mosaic preset Result: Each moment replaced by similar moment 4. PROCESS 3: Take output → Chaotic Tremolo preset Result: Rapid amplitude quantization 5. MIX: Blend all three layers with different panning 6. ADD: Global reverb and delay RESULT: Completely transformed audio texture

Creative potential: Endless variations by combining modes and presets.