Self-Similarity Spectral Resynthesis: MFCC-Based Audio Transformation
A Praat script that analyzes audio through MFCCs, computes self-similarity matrices, and creates radical spectral transformations based on temporal similarity patterns.
What this does
The Self-Similarity Spectral Resynthesis script transforms audio by analyzing its internal temporal patterns. Instead of processing frequencies directly, it uses MFCC features to detect when audio segments sound similar to each other, then applies dynamic amplitude changes based on these similarity patterns.
🔍 Core Process Pipeline
- MFCC Extraction: Convert audio to Mel-frequency cepstral coefficients
- Self-Similarity Matrix (SSM): Compute how each moment resembles every other moment
- Similarity Scoring: Calculate average similarity for each time frame
- Creative Transformation: Apply one of 5 creative modes to similarity scores
- Gain Mapping: Convert scores to amplitude envelopes
- Spatial Processing: Apply in mono, stereo, or spatial modes
- Resynthesis: Multiply original audio by dynamic envelope
Key Insight: Audio contains hidden temporal patterns. Repetitive sections get boosted (or attenuated), novel sections get opposite treatment. The result reveals the "skeleton" of the audio's temporal structure.
Quick Start
- In Praat, select exactly one Sound object.
- Open the script editor and load
Self_Similarity_Resynthesis.praat. - Choose a Preset based on desired effect:
- Glitch Gating: Extreme rhythmic gating
- Ghost Remix: Subtle spectral enhancement
- Brutal Novelty: Maximum contrast effects
- Spectral Mosaic: Frame substitution effect
- Chaotic Tremolo: Rapid amplitude modulation
- Choose Creative_mode (Standard, Inverted, etc.)
- Select Speed_mode based on your patience:
- Full Quality: Slow but accurate
- Balanced: 16kHz, 2-4× faster
- Fast: 8kHz, 4-8× faster
- Enable Draw_visualization to see SSM and scores.
- Click Run → Run (or Ctrl+R).
- Listen to the transformed audio.
- Start with "Ghost Remix" preset - subtle and musical
- Use Balanced speed mode for reasonable processing time
- Enable visualization to understand what's happening
- Try on vocal recordings first - effects are most audible
- Adjust Similarity_threshold to control effect intensity
- MFCC-based: Works on spectral shapes, not raw audio
- Frame-based: Processes in 10-30ms frames (adjustable)
- Non-realtime: Analyzes entire file before processing
- Memory intensive: SSM computation grows with file length
- Stereo handling: Multiple spatial modes available
- Speed tradeoffs: Downsampling reduces quality but speeds up
Self-Similarity Matrix Theory
What is a Self-Similarity Matrix (SSM)?
🎲 Mathematical Definition
Given: A sequence of feature vectors F₁, F₂, ..., Fₙ
SSM: An n×n matrix where entry S(i,j) = similarity(Fᵢ, Fⱼ)
Visual interpretation: A heatmap where bright spots indicate similar moments in time.
SSM Properties in Audio
Diagonal: S(i,i) = 1 (every moment is identical to itself)
Symmetry: S(i,j) = S(j,i) (similarity is mutual)
Repetition patterns: Off-diagonal bright lines indicate repetitions
Novelty: Dark rows/columns indicate unique moments
Structure: Block patterns indicate sections
Computational Optimization
MFCC Analysis Fundamentals
🎵 Mel-Frequency Cepstral Coefficients
Why MFCCs for similarity?
- Perceptually relevant: Based on Mel scale (human hearing)
- Compact representation: 8-13 coefficients capture spectral shape
- Invariant to amplitude: Normalized, so loudness doesn't affect similarity
- Robust: Less sensitive to pitch shifts than raw spectra
- Standard: Widely used in speech/music analysis
MFCC Extraction Pipeline
Parameter Effects on Analysis
| Parameter | Effect | Typical Value |
|---|---|---|
| Time_step_(s) | Frame rate (temporal resolution) | 0.01 s (100 Hz) |
| Analysis_frame_length_(s) | Window size (frequency resolution) | 0.03 s (30 ms) |
| Number_of_MFCCs | Dimensionality of feature vector | 8-13 coefficients |
Creative Transformation Modes
Mode 1: Standard (Similarity Boost)
📈 Boost Repetitive Sections
Logic: High similarity → high amplitude
Effect: Repetitive parts (chorus, loops) get louder. Novel parts (verses, transitions) get quieter.
Musical use: Emphasize recurring themes, create "ghostly" emphasis on repeated material.
Mode 2: Inverted (Novelty Extractor)
🔄 Boost Novel Sections
Logic: High similarity → low amplitude (inverted)
Effect: Unique moments get emphasized. Repetitive material fades to background.
Musical use: Highlight unique events, solos, transitions. Creates "glitch gating" effect.
Mode 3: Diagonal Recurrence
↗️ Emphasize Sequential Patterns
Logic: Weight diagonal neighbors more heavily
Effect: Emphasizes sequential repetitions (echoes, rhythmic patterns).
Musical use: Bring out rhythmic motifs, sequential developments.
Mode 4: Hard Quantized (3 levels)
🎚️ Three-Level Amplitude Quantization
Logic: Discretize similarity scores into 3 levels
Effect: Creates stepped amplitude changes, like a "tremolo with 3 states".
Musical use: Robotic/quantized effects, rhythmic stuttering.
Mode 5: Self-Mosaic (Frame Substitution)
🧩 Replace Frames with Similar Ones
Logic: For each frame, find most similar other frame, copy its audio
Effect: Creates audio "mosaic" where each moment is replaced by its most similar counterpart.
Musical use: Extreme texture transformation, "audio collage" effect.
Preset Configurations
Preset 1: Glitch Gating
⚡ Extreme Rhythmic Gating
| Parameter | Value | Effect |
|---|---|---|
| Creative_mode | Inverted (2) | Boost novel sections |
| Time_step | 0.005 s | High temporal resolution |
| Analysis_frame_length | 0.02 s | Short frames for precision |
| Similarity_threshold | 0.6 | High threshold = strict similarity |
| Contrast_power | 6.0 | Extreme contrast |
| Hard_threshold_gate | On | Binary on/off gating |
| Gate_threshold | 0.6 | 60% similarity cutoff |
| High_similarity_boost | 18 dB | Strong boost |
| Low_similarity_attenuation | -36 dB | Strong attenuation |
Result: Aggressive rhythmic gating effect. Works great on drums, percussion, staccato material.
Preset 2: Ghost Remix
👻 Subtle Spectral Enhancement
| Parameter | Value | Effect |
|---|---|---|
| Creative_mode | Standard (1) | Boost similar sections |
| Time_step | 0.008 s | Medium resolution |
| Analysis_frame_length | 0.025 s | Standard frame size |
| Similarity_threshold | 0.4 | Moderate threshold |
| Contrast_power | 3.0 | Gentle contrast |
| Mask_smoothing_frames | 5 | Smooth transitions |
| High_similarity_boost | 15 dB | Moderate boost |
| Low_similarity_attenuation | -20 dB | Moderate attenuation |
Result: Subtle enhancement that brings out repetitive elements. Good for vocals, melodic material.
Preset 3: Brutal Novelty
💥 Maximum Contrast Effects
| Parameter | Value | Effect |
|---|---|---|
| Creative_mode | Inverted (2) | Boost novel sections |
| Time_step | 0.005 s | High resolution |
| Analysis_frame_length | 0.02 s | Short frames |
| Similarity_threshold | 0.3 | Low threshold = more "novelty" |
| Contrast_power | 8.0 | Extreme contrast |
| Hard_threshold_gate | On | Binary gating |
| Gate_threshold | 0.7 | 70% cutoff |
| High_similarity_boost | 24 dB | Very strong boost |
| Low_similarity_attenuation | -48 dB | Near-silence for similar |
Result: Extreme glitch effect that isolates unique moments. Only the most novel material survives.
Preset 4: Spectral Mosaic
🧩 Frame Substitution Effect
| Parameter | Value | Effect |
|---|---|---|
| Creative_mode | Self-Mosaic (5) | Frame substitution |
| Time_step | 0.02 s | Longer steps for coherence |
| Analysis_frame_length | 0.05 s | Long frames |
| Similarity_threshold | 0.4 | Moderate similarity |
Result: Audio "collage" where each moment is replaced by its most similar counterpart. Creates dreamlike, repetitive texture.
Preset 5: Chaotic Tremolo
🌀 Rapid Amplitude Modulation
| Parameter | Value | Effect |
|---|---|---|
| Creative_mode | Hard Quantized (4) | 3-level quantization |
| Time_step | 0.003 s | Very fast updates |
| Analysis_frame_length | 0.015 s | Short analysis |
| Contrast_power | 10.0 | High contrast |
| Hard_threshold_gate | On | Binary decisions |
| Gate_threshold | 0.5 | 50% cutoff |
| Add_chaos | 0.25 | High randomness |
| High_similarity_boost | 20 dB | Strong boost |
| Low_similarity_attenuation | -40 dB | Strong attenuation |
Result: Rapid, chaotic amplitude modulation with 3 distinct levels. Creates nervous, animated texture.
Spatial Processing Modes
🔊 Five Spatialization Strategies
| Mode | Algorithm | Effect | Best For |
|---|---|---|---|
| Mono | Convert to mono, apply single envelope | Centered, focused | Compatibility, mono systems |
| Preserve Stereo | Apply same envelope to both channels | Stereo image preserved | Most stereo material |
| Stereo Wide | Different EQ per channel + envelope | Enhanced width | Creating spatial interest |
| Rotating | Panning modulation + envelope | Circular motion | Experimental, psychedelic |
| Mid-Side | Apply envelope mainly to center | Focus on center | Vocals, lead elements |
Stereo Wide Implementation
Rotating Panning Effect
Parameters Explained
Analysis Parameters
⏱️ Time_step_(s)
Range: 0.001 - 0.05 seconds (1-50 ms)
Default: 0.01 seconds (10 ms, 100 Hz frame rate)
Effect:
- Small values (1-5 ms): High temporal resolution, smoother envelopes, slower processing
- Medium values (8-15 ms): Good balance, musical timing
- Large values (20-50 ms): Coarse resolution, rhythmic stepping, faster processing
Musical correlation: 10 ms ≈ 32nd note at 187.5 BPM
🎵 Analysis_frame_length_(s)
Range: 0.01 - 0.1 seconds (10-100 ms)
Default: 0.03 seconds (30 ms)
Tradeoff: Temporal vs. frequency resolution
- Short frames (10-20 ms): Good temporal precision, poor frequency resolution
- Medium frames (25-40 ms): Balanced (default)
- Long frames (50-100 ms): Good frequency resolution, poor temporal precision
Rule: Frame_length should be 2-4× Time_step for reasonable overlap.
Similarity & Masking Parameters
| Parameter | Range | Default | Effect |
|---|---|---|---|
| Similarity_threshold | 0.1-0.9 | 0.5 | Minimum similarity to count in average |
| Contrast_power | 1.0-20.0 | 4.0 | Nonlinear mapping exponent |
| High_similarity_boost_(dB) | 0-48 dB | 12 dB | Gain for high similarity frames |
| Low_similarity_attenuation_(dB) | -60 to 0 dB | -24 dB | Gain for low similarity frames |
| Mask_smoothing_frames | 0-20 | 3 | Moving average smoothing of scores |
Experimental Controls
Visualization Guide
📊 Understanding the Graphical Output
Five visualization panels:
| Panel | Content | Purpose |
|---|---|---|
| Title | Script name, filename, mode | Context |
| Original Waveform | Input audio (gray) | Reference |
| Processed Waveform | Output audio (blue) | Show effect |
| Self-Similarity Matrix | SSM heatmap (gray scale) | Show similarity patterns |
| Similarity Scores | Score curve over time (purple) | Show computed scores |
| Statistics | Parameters and processing info | Documentation |
Reading the SSM Heatmap
Axes: Both X and Y represent time (frame number)
Color coding: Black = similar (SSM value ~1), White = dissimilar (SSM value ~0)
Patterns to look for:
- Bright diagonal: Always present (frame similar to itself)
- Off-diagonal lines: Repetitions (e.g., chorus repeating)
- Blocks: Sections with internal similarity
- Dark areas: Unique, non-repeating material
- Symmetry: Should be symmetric across diagonal
Sparse sampling: Only shows every 20th frame (for speed). Complete matrix is computed but not fully displayed.
Interpreting Similarity Scores
X-axis: Frame number (time)
Y-axis: Similarity score (0-1)
What high scores mean: Frame is similar to many other frames (repetitive material)
What low scores mean: Frame is unique (novel material)
Patterns:
- Constant high: Very repetitive audio (drone, loop)
- Constant low: Constantly changing (improvisation, noise)
- Periodic peaks: Repeated sections (verse/chorus structure)
- Spikes: Unique events (transitions, solo moments)
Creative mode effects: Inverted mode shows opposite pattern.
Applications & Creative Uses
Music Production & Sound Design
🎵 Enhancing Repetitive Elements
Use case: Bringing out hidden patterns in ambient music
- Load ambient drone or pad sound
- Use Ghost Remix preset (Standard mode)
- Similarity_threshold = 0.3 (capture subtle repetitions)
- Result: Subtle amplitude modulation that follows internal patterns
- Mix 30% with original for enhanced texture
Use case: Creating rhythmic gating from non-rhythmic material
- Load vocal phrase or sustained instrument
- Use Glitch Gating preset
- Adjust Time_step to desired rhythm (e.g., 0.125 s for 8th notes at 120 BPM)
- Result: Rhythmic amplitude gating based on phonetic/spectral changes
Audio Restoration & Enhancement
🔍 Noise Reduction via Similarity
Concept: Noise is dissimilar to everything. Signal has self-similarity.
- Load noisy recording (speech with background noise)
- Use Standard mode with high Similarity_threshold (0.7)
- Set High_similarity_boost = 0 dB, Low_similarity_attenuation = -30 dB
- Result: Similar (speech) parts preserved, dissimilar (noise) attenuated
- Limitation: Works best when noise is spectrally distinct from signal
Vocal breath reduction: Breaths are spectrally similar to silence/noise. Can be attenuated by boosting similar (voiced) sections.
Compositional Analysis
📐 Visualizing Musical Structure
Analytical workflow:
- Load complete musical piece
- Use Full Quality speed mode
- Set Time_step = 0.05 s (coarse for overview)
- Enable visualization
- Examine SSM for structural patterns:
- Clear blocks: Distinct sections (verse/chorus)
- Diagonal lines: Recurring motifs
- Checkerboard: Alternating patterns
- Dark areas: Developmental sections
Research applications: Form analysis, repetition detection, style analysis.
Experimental Audio Processing
🧪 Chain Processing for Extreme Effects
Creative potential: Endless variations by combining modes and presets.