MFCC Sound Processor — User Guide

Spectral feature-based audio transformation using Mel-Frequency Cepstral Coefficients (MFCCs): 5 algorithms for pitch, duration, and amplitude control derived from spectral analysis.

Author: Based on Shai Cohen's AudioTools Affiliation: Department of Music, Bar-Ilan University, Israel Version: 0.1 (2025) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools
Contents:

What this does

The MFCC Sound Processor implements spectral feature-based audio transformation — using Mel-Frequency Cepstral Coefficients (MFCCs) to analyze and manipulate audio in perceptually meaningful ways. By extracting MFCCs (compact representations of spectral shape), the script creates control signals that drive pitch, duration, and amplitude modifications. Unlike traditional signal processing, this approach transforms audio based on its spectral characteristics rather than direct waveform manipulation, resulting in organic, perceptually coherent transformations that maintain the original sound's identity while creating dramatic variations.

Key Features:

What are MFCCs? Mel-Frequency Cepstral Coefficients are compact representations of spectral shape optimized for human hearing. Developed for speech recognition, MFCCs mimic the human auditory system: (1) Mel scaling: Frequencies warped to approximate human pitch perception (linear below 1kHz, logarithmic above). (2) Cepstral analysis: Spectral envelope separated from fine structure via Fourier transform of log spectrum. (3) Coefficient interpretation: C1 ≈ overall spectral energy, C2 ≈ spectral tilt (brightness), C3-Cn ≈ spectral shape details. In this script: C1 controls pitch, C2 controls amplitude, C3 controls duration. This creates transformations that feel "natural" because they're based on perceptually relevant features rather than arbitrary mathematical functions.

Technical Implementation: (1) MFCC extraction: Sound → MFCC object with 12 coefficients, 15ms windows, 5ms steps. (2) Coefficient selection: First 3 coefficients (C1-C3) extracted for basic control; higher coefficients used for complexity analysis. (3) Normalization: Coefficients scaled to 0–1 range per frame. (4) Algorithm application: Each algorithm maps normalized coefficients to Praat Manipulation parameters (pitch tier, duration tier, amplitude scaling). (5) Resynthesis: Manipulation → Sound via overlap-add synthesis. (6) Preset logic: Pre-configured parameter mappings for specific transformation types. The process maintains temporal coherence while creating spectral-informed transformations.

Quick start

  1. In Praat, select exactly one Sound object (mono or stereo).
  2. Run script…mfcc_sound_processor.praat.
  3. Choose a Preset from the categorized list (easiest way to start).
  4. For manual control, select Custom and choose algorithm.
  5. For Algorithm 1, set Pitch_range and Duration_range (0.3-0.6 typical).
  6. For other algorithms, adjust relevant parameters.
  7. Enable Play_result to hear immediately.
  8. Click OK — MFCC analysis followed by transformation.
Quick tip: Start with presets — they're carefully tuned for specific effects. For subtle variations try "Direct: Subtle" or "Scramble: Subtle". For dramatic transformations try "Direct: Wide Range" or "Complexity: Extreme". For rhythmic effects try "Freeze: Dense". For reversed spectral character try "Reverse: Classic". The first three MFCC coefficients control: C1=pitch (energy), C2=amplitude (spectral tilt), C3=duration (spectral shape). Processing time depends on file length: 1 minute ≈ 10-20 seconds. Output preserves original timing structure unless duration manipulation is applied.
Important: SOURCE DEPENDENT — Results vary significantly with input material. Harmonically rich sounds (speech, music) work best; noise or simple tones may produce less interesting results. MFCC parameters fixed: 12 coefficients, 15ms windows, 5ms steps optimized for speech/music. Processing stages: (1) MFCC extraction (slowest), (2) Coefficient analysis, (3) Manipulation building, (4) Resynthesis. Memory usage: Large files (> 5 minutes) may require significant RAM for MFCC matrix. Stereo handling: Both channels processed identically based on MFCCs of entire signal. Real-time limitation: Not suitable for live processing; designed for offline transformation. Algorithm-specific effects: Some algorithms only affect certain parameters (e.g., Algorithm 3 only affects duration).

MFCC Theory & Coefficient Interpretation

Mel-Frequency Cepstral Coefficients Explained

🎵 Perceptual Audio Representation

Mel Scale: Frequency warping approximating human pitch perception

Cepstral Analysis: Separating source (excitation) from filter (vocal tract)

Coefficient Meanings: C1=energy, C2=spectral tilt, C3+=spectral shape details

Window/Step: 15ms windows capture spectral snapshots, 5ms steps for smoothness

12 Coefficients: Balance between detail and computational efficiency

Coefficient Interpretation for Audio Transformation

Mapping MFCCs to perceptual parameters:

# MFCC COEFFICIENT INTERPRETATION # Based on speech processing literature and perceptual studies C1: LOG ENERGY COEFFICIENT - Represents overall signal energy in frame - High values = loud/energetic moments - Low values = quiet/soft moments - MAPPING: Controls pitch (higher energy → higher pitch) - Normalized range: 0 (quietest frame) to 1 (loudest frame) C2: SPECTRAL TILT COEFFICIENT - Represents balance between low and high frequencies - Positive values = brighter spectrum (high-frequency emphasis) - Negative values = darker spectrum (low-frequency emphasis) - MAPPING: Controls amplitude (brighter → louder, darker → softer) - Normalized range: 0 (darkest) to 1 (brightest) C3: SPECTRAL SHAPE COEFFICIENT - Represents general spectral shape/formant structure - Complex relationship to specific spectral features - MAPPING: Controls duration (complex shape → longer duration) - Normalized range: 0 (simplest shape) to 1 (most complex shape) C4-C12: HIGHER-ORDER COEFFICIENTS - Represent finer spectral details - Used for complexity analysis (Algorithm 3) - Used for spectral distance calculations (Algorithm 4) - Used for trajectory analysis (Algorithm 5)

Processing Pipeline

Complete Transformation Flow

Step-by-step signal processing:

# 1. INPUT PREPARATION Sound → selected, duration measured, sampling rate stored # 2. MFCC EXTRACTION (Fixed Parameters) To MFCC: 12 coefficients, 0.015 window, 0.005 step, 100Hz first, 100 filter distance Result: MFCC object with 12×N matrix (N = number of frames) # 3. COEFFICIENT EXTRACTION & NORMALIZATION FOR each frame i (1 to numFrames): C1[i] = Get value: 1, i C2[i] = Get value: 2, i C3[i] = Get value: 3, i C4-C12[i] = Get values: 4-12, i Normalize each coefficient across frames: C_norm[i] = (C[i] - min(C)) / (max(C) - min(C)) # 4. ALGORITHM-SPECIFIC PROCESSING Based on selected algorithm: - Map normalized coefficients to control parameters - Build Praat Manipulation object - Create pitch tier, duration tier, amplitude scaling # 5. RESYNTHESIS Manipulation → Get resynthesis (overlap-add) → Output Sound Amplitude scaling applied via Formula # 6. OUTPUT Renamed as: originalName_AlgorithmName Normalized to prevent clipping

Temporal Frame Structure

Window, Step, and Frame Alignment

Frame-based processing parameters:

# TEMPORAL PARAMETERS (Fixed in Script) window_length = 0.015 # 15 milliseconds time_step = 0.005 # 5 milliseconds (1/3 overlap) first_filter_freq = 100 # Hz filter_distance = 100 # Hz between filters max_freq = 0 # 0 = Nyquist (half sampling rate) # FRAME CALCULATION numFrames = round(duration / time_step) + 1 # TIME MAPPING frame_time[i] = (i - 1) × time_step + window_length/2 # Example: Frame 1 at 7.5ms, Frame 2 at 12.5ms, etc. # WHY THESE VALUES? - 15ms window: Captures 1-2 pitch periods for typical speech/music - 5ms step: Smooth temporal evolution, 66% overlap for stability - 100Hz first filter: Focus on perceptually relevant frequencies - 12 coefficients: Sufficient for spectral shape without overfitting

Algorithm Details

Algorithm 1: Direct Control

🎛️ Straightforward Coefficient Mapping

Concept: Direct mapping C1→pitch, C2→amplitude, C3→duration

Formula: pitch = map(C1_norm, p_min, p_max), amplitude = map(C2_norm, a_min, a_max), duration = map(C3_norm, d_min, d_max)

Character: Natural spectral-to-parametric transformation

Use: Basic spectral-driven effects, subtle variations

Direct Control Mathematics

# ALGORITHM 1: DIRECT CONTROL MAPPING # Extract and normalize C1, C2, C3 across frames C1_norm[i] = (C1[i] - min(C1)) / (max(C1) - min(C1)) C2_norm[i] = (C2[i] - min(C2)) / (max(C2) - min(C2)) C3_norm[i] = (C3[i] - min(C3)) / (max(C3) - min(C3)) # Map to parameter ranges (user-defined or preset) pitch_factor[i] = pitch_min + C1_norm[i] × (pitch_max - pitch_min) # Example: pitch_range=0.6 → pitch_min=0.7, pitch_max=1.3 # pitch_factor=0.7-1.3 × original pitch amplitude_factor[i] = amp_min + C2_norm[i] × (amp_max - amp_min) # Example: amp_min=0.5, amp_max=1.0 → 50-100% amplitude duration_factor[i] = dur_min + C3_norm[i] × (dur_max - dur_min) # Example: dur_range=0.3 → dur_min=0.85, dur_max=1.15 # Apply via Praat Manipulation Add pitch point at frame_time[i]: pitch_factor[i] × 100Hz Add duration point at frame_time[i]: duration_factor[i] Apply amplitude scaling: sound × amplitude_factor[frame_index]

Algorithm 2: Reverse Control

🔁 Temporal Spectral Reversal

Concept: Reverse MFCC trajectory: last frame controls first output, etc.

Formula: C_reverse[i] = C_original[numFrames - i + 1]

Character: "Inside-out" spectral evolution, reversed spectral character

Use: Abstract transformations, reversed spectral envelopes

Algorithm 3: Complexity Time-Stretch

⏱️ Spectral Complexity-Driven Duration

Concept: Stretch complex spectral moments, compress simple ones

Formula: complexity = sqrt(∑(C4² + C5² + ... + C12²))

Character: Elastic time based on spectral richness

Use: Rhythmic transformations, emphasis on complex moments

Complexity Calculation

# ALGORITHM 3: COMPLEXITY TIME-STRETCH # Calculate spectral complexity from higher coefficients FOR each frame i: sum_squares = 0 FOR coefficient j from 4 to min(12, numCoeffs): sum_squares += mfcc_data[i, j]² complexity[i] = sqrt(sum_squares) # Normalize complexity comp_norm[i] = (complexity[i] - min(complexity)) / (max(complexity) - min(complexity)) # Apply non-linear stretch mapping IF comp_norm[i] > complexity_threshold: # Complex frames: stretch proportionally to complexity stretch_factor = 1 + ((comp_norm[i] - threshold) / (1 - threshold)) × (max_stretch - 1) ELSE: # Simple frames: compress toward minimum stretch_factor = min_stretch + (comp_norm[i] / threshold) × (1 - min_stretch) # Add to duration tier Add point at frame_time[i]: stretch_factor # Result: Complex spectral moments elongated, simple moments shortened # Creates rhythmic emphasis on harmonically rich sections

Algorithm 4: Freeze Spectral Moments

⏸️ Similarity-Based Time Freezing

Concept: Freeze time when spectral content is stable/similar

Formula: distance = sqrt(∑(C_frame[i] - C_frame[i-1])²)

Character: Glitch-like freezes, stuttering on stable sounds

Use: Glitch effects, stutter edits, rhythmic freezing

Algorithm 5: Trajectory Scramble

🎲 Windowed Random Coefficient Reordering

Concept: Scramble MFCCs within temporal windows

Formula: C_scrambled[i] = C_original[random in window(i ± N/2)]

Character: Chaotic yet locally coherent transformations

Use: Experimental textures, granular-like effects

Trajectory Scramble Algorithm

# ALGORITHM 5: TRAJECTORY SCRAMBLE # For each frame i: window_radius = scramble_window / 2 window_start = max(1, i - window_radius) window_end = min(numFrames, i + window_radius) # Random selection within window random_index = random integer between window_start and window_end # Copy MFCC coefficients from random frame FOR coefficient j from 1 to 3: scrambled_data[i, j] = mfcc_data[random_index, j] # Normalize scrambled coefficients FOR each coefficient j: min_val = min(scrambled_data[*, j]) max_val = max(scrambled_data[*, j]) FOR each frame i: scaled[i, j] = (scrambled_data[i, j] - min_val) / (max_val - min_val) # Map to parameters pitch_factor = 0.7 + scaled[i, 1] × 0.6 # 0.7-1.3 range duration_factor = 0.8 + scaled[i, 2] × 0.4 # 0.8-1.2 range # Result: Local spectral scrambling creates chaotic yet coherent effect # Window size controls locality: small=subtle, large=wild

Presets Gallery

🎯 Direct Control Presets

Direct: Subtle — Gentle variations: pitch ±10%, duration ±5%, amplitude 80-100%

Direct: Wide Range — Dramatic: pitch ±50%, duration ±30%, amplitude 30-100%

Direct: Pitch Focus — Pitch-focused: pitch ±50%, minimal duration/amplitude changes

🔁 Reverse Control Presets

Reverse: Classic — Standard spectral reversal, moderate parameter mapping

Reverse: Dramatic — Enhanced reversal with wider parameter ranges

⏱️ Complexity Stretch Presets

Complexity: Moderate — Balanced: threshold=0.5, stretch=0.7-2.0×

Complexity: Extreme — Exaggerated: threshold=0.4, stretch=0.5-4.0×

⏸️ Freeze Moments Presets

Freeze: Sparse — Occasional freezes: duration=0.15s, threshold=0.2

Freeze: Dense — Frequent freezes: duration=0.2s, threshold=0.4

🎲 Trajectory Scramble Presets

Scramble: Subtle — Local scrambling: window=5 frames (25ms context)

Scramble: Wild — Global scrambling: window=30 frames (150ms context)

Parameters

Preset Selection

ParameterTypeDefaultDescription
PresetoptionCustom12 curated presets for instant effects

Manual Algorithm Selection

ParameterTypeDefaultDescription
AlgorithmoptionDirect Control5 algorithms with distinct transformation logic

Direct Control Parameters (Algorithm 1 only)

ParameterTypeDefaultRangeDescription
Pitch_rangereal0.60.1-2.0± range from 1.0 (0.6 = 0.4 to 1.6 × original pitch)
Duration_rangereal0.30.1-1.0± range from 1.0 (0.3 = 0.7 to 1.3 × original duration)

Other Algorithm Parameters

ParameterTypeDefaultRangeAlgorithmDescription
Complexity_thresholdpositive0.50.1-0.93Normalized complexity value above which frames are stretched
Max_stretch_factorpositive2.01.1-10.03Maximum duration stretch for complex frames
Freeze_duration_(s)positive0.20.05-1.04Duration of each freeze moment in seconds
Scramble_window_(frames)positive102-505Window size for trajectory scrambling in frames

Output Control

ParameterTypeDefaultDescription
Play_resultboolean1 (yes)Auto-play processed sound

Fixed MFCC Parameters (Not Adjustable)

ParameterValueDescription
Number of coefficients12MFCCs extracted per frame
Window length0.015s (15ms)Analysis window duration
Time step0.005s (5ms)Frame advancement (66% overlap)
First frequency100HzLowest filter center frequency
Filter distance100HzSpacing between filter center frequencies
Maximum frequency0 (Nyquist)Highest frequency analyzed

Applications

Vocal Transformation & Processing

Use case: Creating spectral-based vocal effects, pitch variations, expressive processing

Recommended algorithms: Direct Control (subtle), Complexity Stretch (rhythmic), Freeze Moments (glitch)

Presets: "Direct: Subtle" for natural variations, "Freeze: Dense" for stutter effects

Musical Composition & Production

Use case: Generating variations, creating evolving textures, rhythmic manipulation

Recommended algorithms: Complexity Stretch (rhythmic interest), Trajectory Scramble (textural)

Workflow:

Sound Design for Media

Use case: Creating unique sound effects, transforming source material

Recommended algorithms: Reverse Control (abstract), Trajectory Scramble (experimental)

Advantages:

Experimental & Electroacoustic Music

Use case: Spectral manipulation, abstract transformations, texture creation

Recommended algorithms: All algorithms with extreme settings

Example: Field recordings processed with Algorithm 5 (wild scramble) for granular textures

Practical Workflow Examples

🎤 Expressive Vocal Processing (Music Production)

Goal: Add spectral-driven expression to vocal tracks

Settings:

  • Algorithm: Direct Control
  • Preset: Direct: Subtle
  • Pitch_range: 0.4 (±20% variation)
  • Duration_range: 0.2 (±10% timing variation)
  • Mix: 50-70% wet (blend with original)

Result: Natural-sounding vocal expression driven by spectral energy

🎵 Rhythmic Drum Processing (Electronic Music)

Goal: Create complexity-based rhythmic variations

Settings:

  • Algorithm: Complexity Time-Stretch
  • Preset: Complexity: Moderate
  • Source: Drum loop or percussion track
  • Post-process: Compression to maintain consistent level

Result: Rhythmic pattern with stretched complex moments (cymbals, fills)

🎬 Abstract Sound Design (Film/Game)

Goal: Transform ordinary sounds into unusual textures

Settings:

  • Algorithm: Trajectory Scramble
  • Preset: Scramble: Wild
  • Scramble_window: 30 frames (150ms context)
  • Source: Environmental sounds, mechanical noises

Result: Chaotic yet locally coherent abstract textures

Advanced Techniques

Multi-algorithm processing chains:
  • Complexity → Freeze: Stretch complex sections, then freeze stable moments
  • Direct → Scramble: Apply spectral control, then scramble trajectory
  • Reverse → Direct: Create inside-out spectral character with natural mapping
  • Parallel processing: Run different algorithms, mix results

Process sound with one algorithm, rename output, process again with different algorithm

Source material optimization:
  • Speech: Clear MFCC patterns, works well with all algorithms
  • Music: Harmonic richness creates interesting complexity patterns
  • Percussion: Transient-rich, good for freeze and complexity algorithms
  • Ambient sounds: Smooth spectra work well with direct control
  • Noise: Limited MFCC variation, less dramatic results

Different source materials highlight different algorithm characteristics

Troubleshooting Common Issues

Problem: Processing very slow or memory intensive
Cause: Long source file, high sampling rate
Solution: Use shorter excerpts (1-2 minutes max), resample to lower rate if possible
Problem: Little to no audible effect
Cause: Source with limited spectral variation, or subtle preset
Solution: Try more dramatic presets, use harmonically rich source material
Problem: Artifacts or distortion in output
Cause: Extreme parameter settings, or source with sharp transients
Solution: Reduce parameter ranges, try different algorithm, pre-smooth source
Problem: Output timing feels "wrong" or unnatural
Cause: Duration manipulation too extreme, or algorithm mismatch with material
Solution: Reduce duration_range (Algorithm 1) or max_stretch_factor (Algorithm 3)

Technical Reference

MFCC Extraction Details

Praat's MFCC Implementation

Underlying algorithm parameters:

# PRAAT MFCC EXTRACTION PARAMETERS (Fixed) number_of_coefficients = 12 window_length = 0.015 # seconds time_step = 0.005 # seconds first_filter_frequency = 100 # Hz filter_distance = 100 # Hz maximum_frequency = 0 # 0 = Nyquist frequency # IMPLEMENTATION STEPS (inside Praat): 1. PRE-EMPHASIS: High-frequency boost (1 - 0.97z⁻¹ filter) 2. WINDOWING: Hamming window applied to each frame 3. FFT: 512-point FFT (for 44.1kHz: ≈ 86Hz frequency resolution) 4. MEL FILTERBANK: 26 triangular filters spaced on mel scale mel(f) = 2595 × log₁₀(1 + f/700) 5. LOG COMPRESSION: log₁₀ of filterbank energies 6. DCT: Discrete Cosine Transform to decorrelate → MFCCs 7. LIFTERING: Optional, not applied in this script # RESULTING COEFFICIENTS: C0: log energy (often discarded in speech recognition) C1-C12: Cepstral coefficients representing spectral envelope Higher coefficients represent finer spectral details

Spectral Distance Calculation (Algorithm 4)

Euclidean Distance in MFCC Space

Measuring spectral similarity between frames:

# ALGORITHM 4: SPECTRAL DISTANCE CALCULATION # For frames i and i-1: distance = 0 FOR coefficient j from 1 to min(6, numCoeffs): diff = mfcc_data[i, j] - mfcc_data[i-1, j] distance += diff × diff spectral_distance[i] = sqrt(distance) # NORMALIZATION ACROSS FRAMES: max_distance = maximum(spectral_distance[2..numFrames]) normalized_distance[i] = spectral_distance[i] / max_distance # FREEZE DECISION: IF normalized_distance[i] < similarity_threshold: # Spectral content similar → FREEZE this moment Add duration tier points to create freeze: - Before freeze: duration factor = 1.0 - Start freeze: duration factor = 5.0 (time slows 5×) - During freeze: duration factor = 5.0 - End freeze: duration factor = 1.0 # Interpretation: - Low distance = similar spectral shape = stable sound - High distance = different spectral shape = changing sound - Freezing occurs during stable, unchanging spectral moments