Phonetic Tremolo/Glitch Effect — User Guide

Intelligent audio processing: automatically detects speech sounds and applies targeted effects—tremolo on vowels, glitches on fricatives, attenuation on silence—based on acoustic phonetic analysis.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 1.0 (2025) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools
Contents:

What this does

This script implements an intelligent phonetic effect processor that automatically analyzes speech sounds and applies different audio effects based on phonetic classification. Using acoustic feature extraction (pitch, intensity, formants, harmonicity), the system identifies vowels, fricatives, silence, and other sounds, then applies targeted processing: tremolo modulation to vowels, time-shift glitches to fricatives, heavy attenuation to silence, and mild amplification to other sounds.

Key Features:

What is phonetic effect processing? Traditional audio effects apply uniformly to all sounds. Phonetic effect processing uses speech science principles to intelligently target specific types of sounds. By analyzing acoustic features like harmonicity (voicing), formant structure (vowel quality), and intensity (loudness), the system can distinguish between vowels, fricatives, silence, and other sounds, applying customized effects to each category. This creates musically intelligent processing that respects the phonetic character of the audio material.

Technical Implementation: (1) Feature extraction: Creates Pitch, Intensity, Formant, and Harmonicity objects from the input sound. (2) Frame-based analysis: Processes audio in small time frames (default 10ms). (3) Classification: Uses acoustic thresholds to categorize each frame as vowel, fricative, silence, or other. (4) Targeted processing: Applies tremolo to vowels, glitch effects to fricatives, attenuation to silence, and mild boost to other sounds. (5) Debug output: Provides classification statistics for tuning and analysis.

Quick start

  1. In Praat, select exactly one Sound object (preferably speech).
  2. Run script…phonetic_tremolo_glitch_effect.praat.
  3. Adjust Effect Parameters:
    • tremolo_rate_hz: Speed of vowel modulation (2-15 Hz)
    • tremolo_depth: Intensity of tremolo (0.3-0.9)
    • shift_amount_seconds: Glitch time shift (0.005-0.03 s)
  4. Set Feature Extraction parameters for analysis quality.
  5. Adjust Classification Thresholds to tune sound detection.
  6. Click OK — effect applied, result named "originalname_glitched".
  7. Check Info window for classification statistics and debug information.
Quick tip: Start with default settings for general speech. Use higher tremolo_rate_hz (8-12) for rhythmic effects, lower rates (3-6) for subtle modulation. Increase shift_amount_seconds for more dramatic glitch effects. For clean speech, use default classification thresholds. For noisy recordings, adjust vowel_hnr_threshold and fricative_hnr_max to improve detection. Check the debug output to see how many frames were classified into each category.
Important: Works best with speech material — may produce unexpected results on music or noise. Very short frame_step_seconds values (<0.005) can cause processing artifacts. Extreme shift_amount_seconds (>0.03) may cause audible clicks or distortion. Classification thresholds may need adjustment for different speakers or recording conditions. The effect is destructive — original amplitude relationships are altered. Processing time increases with smaller frame steps and longer audio files.

Phonetic Analysis Theory

Acoustic Feature Extraction

Fundamental Acoustic Properties

The system analyzes four key acoustic features:

1. PITCH (Fundamental Frequency - F0) - Measures: voicing, vocal fold vibration - High F0 = voiced sounds (vowels, sonorants) - Zero F0 = unvoiced sounds (fricatives, stops, silence) 2. INTENSITY (Amplitude/Energy) - Measures: loudness, acoustic energy - High intensity = prominent sounds - Low intensity = weak sounds or silence 3. FORMANT FREQUENCIES (Spectral Peaks) - F1: First formant - vowel height/openness - High F1 = open vowels (/a/, /æ/) - Low F1 = close vowels (/i/, /u/) 4. HARMONICITY (Harmonics-to-Noise Ratio) - Measures: periodicity, voicing quality - High HNR = clean voicing (vowels) - Low HNR = noise-like (fricatives, breath)

Praat Analysis Objects

Feature extraction pipeline:

# Create analysis objects To Pitch: 0, 75, 600 # F0 tracking (75-600 Hz range) To Intensity: 75, 0, "yes" # Intensity (75 Hz cutoff) To Formant (burg): 0, 5, max_formant_hz, 0.025, 50 # Formants To Harmonicity (cc): frame_step_seconds, 75, 0.1, 1.0 # HNR # Extract values at each time point pitch_value = Get value at time: time, "Hertz", "Linear" intensity_value = Get value at time: time, "Cubic" formant1_value = Get value at time: 1, time, "Hertz", "Linear" hnr_value = Get value at time: time, "Cubic"
Acoustic Feature Timeline
    Time: 0.0s     0.5s     1.0s     1.5s     2.0s
    Text:  "S"     "A"      "Y"      "SH"     "EE"
    
    Pitch:  ___    /¯¯¯\     /¯¯\     ___     /¯¯¯\
           0 Hz    120 Hz   130 Hz   0 Hz    125 Hz
    
    HNR:    ___    /¯¯¯\     /¯¯\     ___     /¯¯¯\
           2 dB    18 dB    16 dB    1 dB    20 dB
    
    F1:     ___    /¯¯¯\     \___/    ___     \___/
           0 Hz    600 Hz   300 Hz   0 Hz    280 Hz
    
    Class: Fric    Vowel    Vowel   Fric     Vowel
    Effect:Glitch  Tremolo  Tremolo Glitch   Tremolo
    

Different acoustic features trigger different effects

Frame-Based Processing

Temporal Analysis Windows

The audio is processed in small, overlapping frames:

frame_step_seconds = 0.01 # 10ms frame advance n_frames = floor(duration / frame_step_seconds) For each frame i from 1 to n_frames: time = i × frame_step_seconds t_start = max(0, time - frame_step_seconds/2) t_end = min(duration, time + frame_step_seconds/2) # Extract features at center of frame # Apply effect to entire frame region Formula (part): t_start, t_end, 1, 1, effect_formula

Why Frame-Based Processing?

Advantages of frame-based approach:

Temporal precision: Effects can change rapidly with phonetic context
Smooth transitions: Overlapping frames prevent abrupt effect changes
Feature stability: Acoustic features are more reliable over 10-20ms windows
Computational efficiency: Redundant analysis of every sample

Frame step trade-offs:
Small step (5ms): Higher temporal resolution, more processing time
Large step (20ms): Lower resolution, faster processing, potential artifacts

Complete Processing Pipeline

SETUP: Select Sound object Calculate duration, sampling rate Verify valid audio data FEATURE EXTRACTION: Create Pitch object (75-600 Hz range) Create Intensity object (75 Hz cutoff) Create Formant object (5 formants, burg method) Create Harmonicity object (cross-correlation) FRAME PROCESSING LOOP: FOR each frame from 1 to n_frames: STEP 1: Calculate frame time bounds time = frame_index × frame_step_seconds t_start = time - frame_step_seconds/2 t_end = time + frame_step_seconds/2 STEP 2: Extract acoustic features pitch_value = Pitch at time intensity_value = Intensity at time formant1_value = Formant 1 at time hnr_value = Harmonicity at time STEP 3: Classify sound type IF vowel_conditions → Class 1 (Vowel) ELSIF fricative_conditions → Class 2 (Fricative) ELSIF silence_conditions → Class 4 (Silence) ELSE → Class 3 (Other) STEP 4: Apply targeted effect Class 1: Tremolo modulation Class 2: Time-shift glitch Class 3: Mild amplification Class 4: Heavy attenuation FINALIZATION: Normalize output peak to 0.99 Play result Display classification statistics OUTPUT: "originalname_glitched" with phonetic effects Debug info with frame counts per class

Sound Classification System

Classification Logic

🎯 Phonetic Classification Algorithm

Class 1: VOWELS (Tremolo Effect)

Conditions: hnr_val > vowel_hnr_threshold AND # High harmonicity f0_val > 0 AND # Voiced (pitch present) f1_val > vowel_f1_min_hz # Vowel-like F1 frequency Effect: Amplitude tremolo tremolo_phase = (time × rate_hz) mod 1 trem_amount = 0.3 + 0.7 × sin(2π × tremolo_phase) output = input × trem_amount

Class 2: FRICATIVES (Glitch Effect)

Conditions: int_val > silence_intensity_threshold AND # Not silent hnr_val < fricative_hnr_max AND # Low harmonicity (noisy) f0_val = 0 # Unvoiced Effect: Time-shift glitch output = input[x + shift_amount_seconds] × 1.5

Class 4: SILENCE (Attenuation)

Condition: int_val < silence_intensity_threshold # Low intensity Effect: Heavy attenuation output = input × 0.05 # 95% reduction

Class 3: OTHER SOUNDS (Boost)

Condition: Everything else not classified above Effect: Mild amplification output = input × 1.3 # 30% boost

Acoustic Thresholds Explained

ThresholdTypical RangeDefaultPurposeAffects
vowel_hnr_threshold3-10 dB5.0Minimum HNR for vowelsVowel detection sensitivity
vowel_f1_min_hz200-400 Hz300Minimum F1 for vowelsExcludes glides, weak vowels
fricative_hnr_max2-5 dB3.0Maximum HNR for fricativesFricative detection sensitivity
silence_intensity_threshold40-55 dB45Maximum intensity for silenceSilence detection threshold

Effect Parameters

ParameterRangeDefaultEffectMusical Result
tremolo_rate_hz2-15 Hz8.0Vowel modulation speedSlow: subtle, Fast: rhythmic
tremolo_depth0.3-0.90.7Vowel modulation depthLow: gentle, High: dramatic
shift_amount_seconds0.005-0.03 s0.015Fricative time shiftShort: subtle, Long: obvious glitch

Debugging and Tuning

Interpreting Classification Output

Typical classification distribution for clean speech:

Vowels (Class 1): 30-50% of frames ← Tremolo effect
Fricatives (Class 2): 15-25% of frames ← Glitch effect
Other (Class 3): 20-35% of frames ← Boost effect
Silence (Class 4): 10-20% of frames ← Attenuation

Warning signs:
- Vowels < 20%: vowel_hnr_threshold too high
- Fricatives < 10%: fricative_hnr_max too low
- Silence > 40%: silence_intensity_threshold too low
- Other > 50%: thresholds need adjustment

Tuning for Different Audio Material

For clean studio speech: Use default thresholds. The system should detect vowels and fricatives accurately.

For noisy recordings: Increase vowel_hnr_threshold to 6-8, decrease fricative_hnr_max to 2.0-2.5 to reduce false detections.

For whispered speech: Set vowel_f1_min_hz to 200 (weaker vowels), increase silence_intensity_threshold to 50 (quieter overall).

For musical vocals: Use lower tremolo_rate_hz (3-5 Hz) for subtle modulation, reduce shift_amount_seconds for less obvious glitches.

Parameters

Effect Parameters

ParameterTypeRangeDefaultDescription
tremolo_rate_hzpositive2.0-15.08.0Tremolo speed on vowels (Hz)
tremolo_depthpositive0.3-0.90.7Tremolo intensity on vowels
shift_amount_secondspositive0.005-0.030.015Time shift for fricative glitches

Feature Extraction

ParameterTypeRangeDefaultDescription
frame_step_secondspositive0.005-0.020.01Analysis frame step size
max_formant_hzpositive4000-80005500Maximum formant frequency for analysis

Classification Thresholds

ParameterTypeRangeDefaultDescription
vowel_hnr_thresholdpositive3.0-10.05.0Minimum HNR for vowel detection
vowel_f1_min_hzpositive200-400300Minimum F1 frequency for vowels
fricative_hnr_maxpositive2.0-5.03.0Maximum HNR for fricative detection
silence_intensity_thresholdpositive40-5545Maximum intensity for silence detection

Applications

Creative Vocal Processing

Use case: Adding rhythmic interest to vocal tracks

Technique: Use musical tremolo rates synchronized to tempo

Settings:

Result: Vocals with rhythmic pulsation on vowels and subtle texture on consonants

Experimental Sound Design

Use case: Creating glitchy, fragmented vocal textures

Technique: Use extreme settings for dramatic effects

Settings:

Result: Highly processed, glitch-art vocal effects

Speech Enhancement and Effects

Use case: Adding character to spoken word

Technique: Use subtle effects with careful threshold tuning

Settings:

Result: Enhanced speech with added character while maintaining intelligibility

Audio Restoration and Noise Control

Use case: Reducing background noise in speech recordings

Technique: Use aggressive silence detection and attenuation

Settings:

Result: Cleaner speech with reduced background noise during pauses

Practical Workflow Examples

🎤 Vocal Rhythm Effect

Goal: Add tempo-synchronized pulsation to singing

Settings:

  • tremolo_rate_hz: 2.0 (120 BPM = 2 Hz)
  • tremolo_depth: 0.6
  • shift_amount_seconds: 0.01
  • All thresholds: default

Result: Singing with gentle beat-synchronized tremolo

🔊 Glitch Vocal Effect

Goal: Create experimental glitch vocals

Settings:

  • tremolo_rate_hz: 12.0
  • tremolo_depth: 0.85
  • shift_amount_seconds: 0.025
  • fricative_hnr_max: 2.5 (more fricative detection)

Result: Intensely processed glitch-art vocals

🎙️ Speech Enhancement

Goal: Add subtle character to spoken word

Settings:

  • tremolo_rate_hz: 4.0
  • tremolo_depth: 0.4
  • shift_amount_seconds: 0.008
  • vowel_f1_min_hz: 250 (catch weaker vowels)

Result: Enhanced speech with subtle texture

Advanced Techniques

Layered processing:
  • Apply the effect multiple times with different settings
  • First pass: subtle tremolo only
  • Second pass: glitch effects only
  • Creates complex, evolving textures
Selective processing by phonetic context:
  • Process only specific phoneme types by adjusting thresholds
  • High vowel_hnr_threshold + low vowel_f1_min_hz = only clear vowels
  • Low fricative_hnr_max = only strong fricatives
  • Create custom phonetic effect profiles

Troubleshooting Common Issues

Problem: Too many frames classified as "Other"
Cause: Classification thresholds too strict
Solution: Lower vowel_hnr_threshold, raise fricative_hnr_max, adjust vowel_f1_min_hz
Problem: Unwanted tremolo on non-vowel sounds
Cause: vowel_hnr_threshold too low, detecting noise as vowels
Solution: Increase vowel_hnr_threshold, check debug output
Problem: Glitch effects causing clicks or distortion
Cause: shift_amount_seconds too large for frame size
Solution: Reduce shift_amount_seconds, increase frame_step_seconds
Problem: Effect sounds "choppy" or artificial
Cause: frame_step_seconds too large, creating abrupt changes
Solution: Decrease frame_step_seconds for smoother transitions

Technical Deep Dive

Acoustic Phonetics Basis

Phonetic Class Acoustic Signatures

Typical acoustic values for different sound classes:

VOWELS (/a/, /i/, /u/): HNR: 8-25 dB (high periodicity) F0: 80-300 Hz (voiced) F1: 250-1000 Hz (vowel height) Intensity: Medium-High FRICATIVES (/s/, /ʃ/, /f/): HNR: 0-3 dB (low periodicity, noisy) F0: 0 Hz (unvoiced) F1: N/A or very high Intensity: Variable (/s/=high, /f/=low) SONORANTS (/m/, /n/, /l/): HNR: 5-15 dB (moderate periodicity) F0: Voiced (like vowels) F1: Lower than vowels Intensity: Medium SILENCE/PAUSES: HNR: Undefined or very low F0: 0 Hz Intensity: < 45 dB (background noise)

Praat Analysis Methods

Technical implementation details: The script uses Praat's built-in analysis methods: Burg method for formant analysis (accurate for speech), cross-correlation for harmonicity (robust HNR estimation), and acoustic intensity with 75Hz pre-emphasis (emphasizes speech frequencies). The frame-based approach ensures temporal alignment between different analysis objects, with cubic interpolation for smooth value extraction between analysis frames.

Effect Algorithm Details

Tremolo Implementation

The tremolo effect uses a sine-wave amplitude modulation:

tremolo_phase = (time × tremolo_rate_hz) mod 1 trem_amount = 0.3 + 0.7 × sin(2π × tremolo_phase) Where: time = current frame center time tremolo_rate_hz = modulation frequency 0.3 = minimum amplitude (30% of original) 0.7 = modulation depth (70% peak-to-peak) This creates amplitude variation between 30% and 100% of original The 0.3 offset prevents complete silence during modulation dips

Glitch Effect Implementation

The glitch effect uses time-domain shifting:

Formula: "self [x + 'shift_amount_seconds'] * 1.5" Where: x = current time coordinate shift_amount_seconds = time displacement 1.5 = amplification factor (50% boost) This reads samples from a slightly later time position Creates comb-filtering and phase effects The amplification compensates for potential volume loss

Performance Considerations

Computational Complexity

Processing time factors:

Audio duration: Linear increase with longer files
Frame step: Inverse relationship (smaller step = more frames)
Analysis parameters: Higher max_formant_hz increases formant analysis time
Hardware: CPU speed and available RAM

Typical processing times:
1-minute speech, default settings: 10-30 seconds
5-minute speech, default settings: 1-3 minutes
Very small frame_step (5ms): 2-3× longer processing

Memory Usage

The script creates multiple analysis objects:

Memory required ≈ 4 × original sound memory Objects created: Original Sound Output Sound (copy) Pitch object Intensity object Formant object Harmonicity object For long files, consider processing in segments Praat may show "out of memory" for very long, high-sample-rate files