Phonetic Tremolo/Glitch Effect — User Guide

Intelligent audio processing: automatically detects speech sounds and applies targeted effects—tremolo on vowels, glitches on fricatives, attenuation on silence—based on acoustic phonetic analysis.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 1.0 (2025) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools

Contents:

What this does Quick start Phonetic Analysis Theory Sound Classification Parameters Applications

What this does

This script implements an intelligent phonetic effect processor that automatically analyzes speech sounds and applies different audio effects based on phonetic classification. Using acoustic feature extraction (pitch, intensity, formants, harmonicity), the system identifies vowels, fricatives, silence, and other sounds, then applies targeted processing: tremolo modulation to vowels, time-shift glitches to fricatives, heavy attenuation to silence, and mild amplification to other sounds.

Key Features:

Automatic Phonetic Classification — Real-time detection of speech sound types
Targeted Effect Application — Different effects for different sound categories
Acoustic Feature Analysis — Uses pitch, formants, harmonicity, intensity
Frame-based Processing — Precise temporal control over effect application
Debug Output — Shows classification statistics and processing results

What is phonetic effect processing? Traditional audio effects apply uniformly to all sounds. Phonetic effect processing uses speech science principles to intelligently target specific types of sounds. By analyzing acoustic features like harmonicity (voicing), formant structure (vowel quality), and intensity (loudness), the system can distinguish between vowels, fricatives, silence, and other sounds, applying customized effects to each category. This creates musically intelligent processing that respects the phonetic character of the audio material.

Technical Implementation: (1) Feature extraction: Creates Pitch, Intensity, Formant, and Harmonicity objects from the input sound. (2) Frame-based analysis: Processes audio in small time frames (default 10ms). (3) Classification: Uses acoustic thresholds to categorize each frame as vowel, fricative, silence, or other. (4) Targeted processing: Applies tremolo to vowels, glitch effects to fricatives, attenuation to silence, and mild boost to other sounds. (5) Debug output: Provides classification statistics for tuning and analysis.

Quick start

In Praat, select exactly one Sound object (preferably speech).
Run script… → phonetic_tremolo_glitch_effect.praat.
Adjust Effect Parameters:
- tremolo_rate_hz: Speed of vowel modulation (2-15 Hz)
- tremolo_depth: Intensity of tremolo (0.3-0.9)
- shift_amount_seconds: Glitch time shift (0.005-0.03 s)
Set Feature Extraction parameters for analysis quality.
Adjust Classification Thresholds to tune sound detection.
Click OK — effect applied, result named "originalname_glitched".
Check Info window for classification statistics and debug information.

Quick tip: Start with default settings for general speech. Use higher tremolo_rate_hz (8-12) for rhythmic effects, lower rates (3-6) for subtle modulation. Increase shift_amount_seconds for more dramatic glitch effects. For clean speech, use default classification thresholds. For noisy recordings, adjust vowel_hnr_threshold and fricative_hnr_max to improve detection. Check the debug output to see how many frames were classified into each category.

Important: Works best with speech material — may produce unexpected results on music or noise. Very short frame_step_seconds values (<0.005) can cause processing artifacts. Extreme shift_amount_seconds (>0.03) may cause audible clicks or distortion. Classification thresholds may need adjustment for different speakers or recording conditions. The effect is destructive — original amplitude relationships are altered. Processing time increases with smaller frame steps and longer audio files.

Phonetic Analysis Theory

Acoustic Feature Extraction

Fundamental Acoustic Properties

The system analyzes four key acoustic features:

1. PITCH (Fundamental Frequency - F0) - Measures: voicing, vocal fold vibration - High F0 = voiced sounds (vowels, sonorants) - Zero F0 = unvoiced sounds (fricatives, stops, silence) 2. INTENSITY (Amplitude/Energy) - Measures: loudness, acoustic energy - High intensity = prominent sounds - Low intensity = weak sounds or silence 3. FORMANT FREQUENCIES (Spectral Peaks) - F1: First formant - vowel height/openness - High F1 = open vowels (/a/, /æ/) - Low F1 = close vowels (/i/, /u/) 4. HARMONICITY (Harmonics-to-Noise Ratio) - Measures: periodicity, voicing quality - High HNR = clean voicing (vowels) - Low HNR = noise-like (fricatives, breath)

Praat Analysis Objects

Feature extraction pipeline:

# Create analysis objects To Pitch: 0, 75, 600 # F0 tracking (75-600 Hz range) To Intensity: 75, 0, "yes" # Intensity (75 Hz cutoff) To Formant (burg): 0, 5, max_formant_hz, 0.025, 50 # Formants To Harmonicity (cc): frame_step_seconds, 75, 0.1, 1.0 # HNR # Extract values at each time point pitch_value = Get value at time: time, "Hertz", "Linear" intensity_value = Get value at time: time, "Cubic" formant1_value = Get value at time: 1, time, "Hertz", "Linear" hnr_value = Get value at time: time, "Cubic"

Acoustic Feature Timeline

    Time: 0.0s     0.5s     1.0s     1.5s     2.0s
    Text:  "S"     "A"      "Y"      "SH"     "EE"
    
    Pitch:  ___    /¯¯¯\     /¯¯\     ___     /¯¯¯\
           0 Hz    120 Hz   130 Hz   0 Hz    125 Hz
    
    HNR:    ___    /¯¯¯\     /¯¯\     ___     /¯¯¯\
           2 dB    18 dB    16 dB    1 dB    20 dB
    
    F1:     ___    /¯¯¯\     \___/    ___     \___/
           0 Hz    600 Hz   300 Hz   0 Hz    280 Hz
    
    Class: Fric    Vowel    Vowel   Fric     Vowel
    Effect:Glitch  Tremolo  Tremolo Glitch   Tremolo

Different acoustic features trigger different effects

Frame-Based Processing

Temporal Analysis Windows

The audio is processed in small, overlapping frames:

frame_step_seconds = 0.01 # 10ms frame advance n_frames = floor(duration / frame_step_seconds) For each frame i from 1 to n_frames: time = i × frame_step_seconds t_start = max(0, time - frame_step_seconds/2) t_end = min(duration, time + frame_step_seconds/2) # Extract features at center of frame # Apply effect to entire frame region Formula (part): t_start, t_end, 1, 1, effect_formula

Why Frame-Based Processing?

Advantages of frame-based approach:

Temporal precision: Effects can change rapidly with phonetic context
Smooth transitions: Overlapping frames prevent abrupt effect changes
Feature stability: Acoustic features are more reliable over 10-20ms windows
Computational efficiency: Redundant analysis of every sample

Frame step trade-offs:
Small step (5ms): Higher temporal resolution, more processing time
Large step (20ms): Lower resolution, faster processing, potential artifacts

Complete Processing Pipeline

SETUP: Select Sound object Calculate duration, sampling rate Verify valid audio data FEATURE EXTRACTION: Create Pitch object (75-600 Hz range) Create Intensity object (75 Hz cutoff) Create Formant object (5 formants, burg method) Create Harmonicity object (cross-correlation) FRAME PROCESSING LOOP: FOR each frame from 1 to n_frames: STEP 1: Calculate frame time bounds time = frame_index × frame_step_seconds t_start = time - frame_step_seconds/2 t_end = time + frame_step_seconds/2 STEP 2: Extract acoustic features pitch_value = Pitch at time intensity_value = Intensity at time formant1_value = Formant 1 at time hnr_value = Harmonicity at time STEP 3: Classify sound type IF vowel_conditions → Class 1 (Vowel) ELSIF fricative_conditions → Class 2 (Fricative) ELSIF silence_conditions → Class 4 (Silence) ELSE → Class 3 (Other) STEP 4: Apply targeted effect Class 1: Tremolo modulation Class 2: Time-shift glitch Class 3: Mild amplification Class 4: Heavy attenuation FINALIZATION: Normalize output peak to 0.99 Play result Display classification statistics OUTPUT: "originalname_glitched" with phonetic effects Debug info with frame counts per class

Sound Classification System

Classification Logic

🎯 Phonetic Classification Algorithm

Class 1: VOWELS (Tremolo Effect)

Conditions: hnr_val > vowel_hnr_threshold AND # High harmonicity f0_val > 0 AND # Voiced (pitch present) f1_val > vowel_f1_min_hz # Vowel-like F1 frequency Effect: Amplitude tremolo tremolo_phase = (time × rate_hz) mod 1 trem_amount = 0.3 + 0.7 × sin(2π × tremolo_phase) output = input × trem_amount

Class 2: FRICATIVES (Glitch Effect)

Conditions: int_val > silence_intensity_threshold AND # Not silent hnr_val < fricative_hnr_max AND # Low harmonicity (noisy) f0_val = 0 # Unvoiced Effect: Time-shift glitch output = input[x + shift_amount_seconds] × 1.5

Class 4: SILENCE (Attenuation)

Condition: int_val < silence_intensity_threshold # Low intensity Effect: Heavy attenuation output = input × 0.05 # 95% reduction

Class 3: OTHER SOUNDS (Boost)

Condition: Everything else not classified above Effect: Mild amplification output = input × 1.3 # 30% boost

Acoustic Thresholds Explained

Threshold	Typical Range	Default	Purpose	Affects
vowel_hnr_threshold	3-10 dB	5.0	Minimum HNR for vowels	Vowel detection sensitivity
vowel_f1_min_hz	200-400 Hz	300	Minimum F1 for vowels	Excludes glides, weak vowels
fricative_hnr_max	2-5 dB	3.0	Maximum HNR for fricatives	Fricative detection sensitivity
silence_intensity_threshold	40-55 dB	45	Maximum intensity for silence	Silence detection threshold

Effect Parameters

Parameter	Range	Default	Effect	Musical Result
tremolo_rate_hz	2-15 Hz	8.0	Vowel modulation speed	Slow: subtle, Fast: rhythmic
tremolo_depth	0.3-0.9	0.7	Vowel modulation depth	Low: gentle, High: dramatic
shift_amount_seconds	0.005-0.03 s	0.015	Fricative time shift	Short: subtle, Long: obvious glitch

Debugging and Tuning

Interpreting Classification Output

Typical classification distribution for clean speech:

Vowels (Class 1): 30-50% of frames ← Tremolo effect
Fricatives (Class 2): 15-25% of frames ← Glitch effect
Other (Class 3): 20-35% of frames ← Boost effect
Silence (Class 4): 10-20% of frames ← Attenuation

Warning signs:
- Vowels < 20%: vowel_hnr_threshold too high
- Fricatives < 10%: fricative_hnr_max too low
- Silence > 40%: silence_intensity_threshold too low
- Other > 50%: thresholds need adjustment

Tuning for Different Audio Material

For clean studio speech: Use default thresholds. The system should detect vowels and fricatives accurately.

For noisy recordings: Increase vowel_hnr_threshold to 6-8, decrease fricative_hnr_max to 2.0-2.5 to reduce false detections.

For whispered speech: Set vowel_f1_min_hz to 200 (weaker vowels), increase silence_intensity_threshold to 50 (quieter overall).

For musical vocals: Use lower tremolo_rate_hz (3-5 Hz) for subtle modulation, reduce shift_amount_seconds for less obvious glitches.

Parameters

Effect Parameters

Parameter	Type	Range	Default	Description
tremolo_rate_hz	positive	2.0-15.0	8.0	Tremolo speed on vowels (Hz)
tremolo_depth	positive	0.3-0.9	0.7	Tremolo intensity on vowels
shift_amount_seconds	positive	0.005-0.03	0.015	Time shift for fricative glitches

Feature Extraction

Parameter	Type	Range	Default	Description
frame_step_seconds	positive	0.005-0.02	0.01	Analysis frame step size
max_formant_hz	positive	4000-8000	5500	Maximum formant frequency for analysis

Classification Thresholds

Parameter	Type	Range	Default	Description
vowel_hnr_threshold	positive	3.0-10.0	5.0	Minimum HNR for vowel detection
vowel_f1_min_hz	positive	200-400	300	Minimum F1 frequency for vowels
fricative_hnr_max	positive	2.0-5.0	3.0	Maximum HNR for fricative detection
silence_intensity_threshold	positive	40-55	45	Maximum intensity for silence detection

Applications

Creative Vocal Processing

Use case: Adding rhythmic interest to vocal tracks

Technique: Use musical tremolo rates synchronized to tempo

Settings:

tremolo_rate_hz = BPM/60 (one cycle per beat)
Medium tremolo_depth (0.5-0.7) for clear modulation
Subtle shift_amount_seconds (0.008-0.012) for gentle glitches
Default classification thresholds for clean vocals

Result: Vocals with rhythmic pulsation on vowels and subtle texture on consonants

Experimental Sound Design

Use case: Creating glitchy, fragmented vocal textures

Technique: Use extreme settings for dramatic effects

Settings:

Fast tremolo_rate_hz (10-15 Hz) for intense modulation
Large shift_amount_seconds (0.02-0.03) for obvious glitches
High tremolo_depth (0.8-0.9) for dramatic volume swings
Adjust thresholds to target specific sound types

Result: Highly processed, glitch-art vocal effects

Speech Enhancement and Effects

Use case: Adding character to spoken word

Technique: Use subtle effects with careful threshold tuning

Settings:

Slow tremolo_rate_hz (3-5 Hz) for gentle movement
Low tremolo_depth (0.3-0.5) for subtle effect
Small shift_amount_seconds (0.005-0.01) for texture
Tune thresholds for specific speaker characteristics

Result: Enhanced speech with added character while maintaining intelligibility

Audio Restoration and Noise Control

Use case: Reducing background noise in speech recordings

Technique: Use aggressive silence detection and attenuation

Settings:

Increase silence_intensity_threshold to detect more noise as silence
Use default or lower tremolo settings
Minimal glitch effects (small shift_amount_seconds)
Adjust vowel_hnr_threshold higher to avoid detecting noise as vowels

Result: Cleaner speech with reduced background noise during pauses

Practical Workflow Examples

🎤 Vocal Rhythm Effect

Goal: Add tempo-synchronized pulsation to singing

Settings:

tremolo_rate_hz: 2.0 (120 BPM = 2 Hz)
tremolo_depth: 0.6
shift_amount_seconds: 0.01
All thresholds: default

Result: Singing with gentle beat-synchronized tremolo

🔊 Glitch Vocal Effect

Goal: Create experimental glitch vocals

Settings:

tremolo_rate_hz: 12.0
tremolo_depth: 0.85
shift_amount_seconds: 0.025
fricative_hnr_max: 2.5 (more fricative detection)

Result: Intensely processed glitch-art vocals

🎙️ Speech Enhancement

Goal: Add subtle character to spoken word

Settings:

tremolo_rate_hz: 4.0
tremolo_depth: 0.4
shift_amount_seconds: 0.008
vowel_f1_min_hz: 250 (catch weaker vowels)

Result: Enhanced speech with subtle texture

Advanced Techniques

Layered processing:

Apply the effect multiple times with different settings
First pass: subtle tremolo only
Second pass: glitch effects only
Creates complex, evolving textures

Selective processing by phonetic context:

Process only specific phoneme types by adjusting thresholds
High vowel_hnr_threshold + low vowel_f1_min_hz = only clear vowels
Low fricative_hnr_max = only strong fricatives
Create custom phonetic effect profiles

Troubleshooting Common Issues

Problem: Too many frames classified as "Other"
Cause: Classification thresholds too strict
Solution: Lower vowel_hnr_threshold, raise fricative_hnr_max, adjust vowel_f1_min_hz

Problem: Unwanted tremolo on non-vowel sounds
Cause: vowel_hnr_threshold too low, detecting noise as vowels
Solution: Increase vowel_hnr_threshold, check debug output

Problem: Glitch effects causing clicks or distortion
Cause: shift_amount_seconds too large for frame size
Solution: Reduce shift_amount_seconds, increase frame_step_seconds

Problem: Effect sounds "choppy" or artificial
Cause: frame_step_seconds too large, creating abrupt changes
Solution: Decrease frame_step_seconds for smoother transitions

Technical Deep Dive

Acoustic Phonetics Basis

Phonetic Class Acoustic Signatures

Typical acoustic values for different sound classes:

VOWELS (/a/, /i/, /u/): HNR: 8-25 dB (high periodicity) F0: 80-300 Hz (voiced) F1: 250-1000 Hz (vowel height) Intensity: Medium-High FRICATIVES (/s/, /ʃ/, /f/): HNR: 0-3 dB (low periodicity, noisy) F0: 0 Hz (unvoiced) F1: N/A or very high Intensity: Variable (/s/=high, /f/=low) SONORANTS (/m/, /n/, /l/): HNR: 5-15 dB (moderate periodicity) F0: Voiced (like vowels) F1: Lower than vowels Intensity: Medium SILENCE/PAUSES: HNR: Undefined or very low F0: 0 Hz Intensity: < 45 dB (background noise)

Praat Analysis Methods

Technical implementation details: The script uses Praat's built-in analysis methods: Burg method for formant analysis (accurate for speech), cross-correlation for harmonicity (robust HNR estimation), and acoustic intensity with 75Hz pre-emphasis (emphasizes speech frequencies). The frame-based approach ensures temporal alignment between different analysis objects, with cubic interpolation for smooth value extraction between analysis frames.

Effect Algorithm Details

Tremolo Implementation

The tremolo effect uses a sine-wave amplitude modulation:

tremolo_phase = (time × tremolo_rate_hz) mod 1 trem_amount = 0.3 + 0.7 × sin(2π × tremolo_phase) Where: time = current frame center time tremolo_rate_hz = modulation frequency 0.3 = minimum amplitude (30% of original) 0.7 = modulation depth (70% peak-to-peak) This creates amplitude variation between 30% and 100% of original The 0.3 offset prevents complete silence during modulation dips

Glitch Effect Implementation

The glitch effect uses time-domain shifting:

Formula: "self [x + 'shift_amount_seconds'] * 1.5" Where: x = current time coordinate shift_amount_seconds = time displacement 1.5 = amplification factor (50% boost) This reads samples from a slightly later time position Creates comb-filtering and phase effects The amplification compensates for potential volume loss

Performance Considerations

Computational Complexity

Processing time factors:

Audio duration: Linear increase with longer files
Frame step: Inverse relationship (smaller step = more frames)
Analysis parameters: Higher max_formant_hz increases formant analysis time
Hardware: CPU speed and available RAM

Typical processing times:
1-minute speech, default settings: 10-30 seconds
5-minute speech, default settings: 1-3 minutes
Very small frame_step (5ms): 2-3× longer processing

Memory Usage

The script creates multiple analysis objects:

Memory required ≈ 4 × original sound memory Objects created: Original Sound Output Sound (copy) Pitch object Intensity object Formant object Harmonicity object For long files, consider processing in segments Praat may show "out of memory" for very long, high-sample-rate files