Intelligent audio processing that applies stereo vibrato selectively to vocal vowels while preserving consonants and non-vocal sounds using neural network-based phonetic classification.
This script implements neural network-based phonetic vibrato processing — an intelligent audio effect that applies stereo vibrato selectively to vocal vowels while leaving consonants, fricatives, and non-vocal sounds untouched. Unlike traditional vibrato effects that indiscriminately modulate the entire signal, this system uses a trained neural network to classify audio frames into phonetic categories (vowel, fricative, silence, other), then generates "liquid masks" that smoothly blend between dry and vibrato-processed signals based on phonetic content. The result is a natural-sounding vocal enhancement that adds rich stereo vibrato to sustained vowels without affecting speech intelligibility or introducing unnatural modulation to consonants.
Key Features:
Neural Phonetic Classification — 4-layer FFN trained on 18 acoustic features
Selective Processing — Vibrato only on vowels, dry elsewhere
Liquid Mask Generation — Smooth, continuous mixing between signals
Phase-Inverted Stereo Vibrato — Left/right channels 180° out of phase
Adaptive Voicedness Boost — More vibrato on strongly voiced segments
5 Curated Presets — From subtle thickening to dreamy washes
Why Phonetic-Selective Processing Matters: Traditional vibrato effects apply uniform modulation across all audio content, which can degrade speech intelligibility and sound unnatural on consonants. The human voice naturally exhibits vibrato primarily on sustained vowels during singing or emotional speech, while consonants (especially plosives and fricatives) remain relatively stable. This script mimics this natural behavior by: (1) Analyzing acoustic features: Formants, harmonics, intensity, MFCCs to identify vowel regions. (2) Neural classification: Training a feedforward network to distinguish vowel/non-vowel frames. (3) Soft mixing: Creating smooth gain masks rather than abrupt switching. (4) Stereo enhancement: Applying phase-inverted vibrato for spatial width. The result preserves the natural character of speech/singing while adding musical enhancement where appropriate.
Technical Implementation: (1) Feature extraction: Extract 18 acoustic features per 10ms frame (MFCCs, formants, pitch, intensity, harmonicity). (2) Neural training: Train 24-unit feedforward network on 1000 iterations to classify frames as vowel/fricative/silence/other. (3) Mask generation: Apply temperature-scaled softmax to network outputs, create smooth intensity tier masks. (4) Parallel processing: Create three signal paths: dry, left vibrato (0° phase), right vibrato (180° phase). (5) Mask application: Multiply signals by their respective masks, sum results. (6) Stereo synthesis: Combine left/right channels with width control. The system runs completely within Praat using built-in neural network and signal processing capabilities.
Quick start
In Praat, select exactly one Sound object (mono or stereo).
Run script… → neural_phonetic_vibrato.praat.
Choose a Preset (start with "Lush Chorus" for default settings).
Configure neural parameters: Confidence_threshold (lower = more vibrato).
Set Temperature (lower = sharper classification, 0.45 default).
Adjust Voiced_boost (0-1, boosts vibrato on strongly voiced segments).
Set Stereo_width (0-1, 0.9 default for wide stereo).
Enable Play_result to hear output immediately.
Click OK — neural training, analysis, and processing will run.
Quick tip: Start with "Lush Chorus" preset for balanced vocal enhancement. For spoken word, use "Subtle Thickener" with lower depth (1-2ms). For singing/sustained vocals, try "Dreamy Wash" with deeper modulation. The neural network trains automatically on your specific audio (1000 iterations, ~10-30 seconds). Confidence_threshold controls how "sure" the network must be before applying vibrato: lower values = more aggressive vibrato application. Temperature controls classification sharpness: lower values = crisper decisions between vowel/non-vowel. Processing stages: (1) Feature extraction, (2) Neural training, (3) Mask generation, (4) Parallel processing, (5) Stereo mixing. For best results, use clean vocal recordings without heavy background noise.
Important:MONO CONVERSION: Input is converted to mono for analysis (stereo restored in output). PROCESSING TIME: Neural training takes time (10-60 seconds depending on file length). FEATURE EXTRACTION: Requires Praat's pitch, formant, intensity, MFCC, harmonicity analyses. PHONETIC CLASSIFICATION ACCURACY: Neural network performance depends on audio quality and feature separability. REAL-TIME LIMITATION: Not suitable for live processing; designed for offline enhancement. VIBRATO ARTIFACTS: Very high depth or rate values may cause unnatural sounding modulation. STEREO PHASE: Left/right channels 180° out of phase creates width but may cause mono compatibility issues. TEMPERATURE EFFECTS: Very low temperature (< 0.1) may create abrupt mask transitions. VOICED_BOOST: High values may over-emphasize vibrato on already prominent vowels.
Neural Classification Theory
Feedforward Neural Network Architecture
🧠 4-Layer FFN for Phonetic Classification
Input layer: 18 acoustic features (normalized 0-1)
Hidden layer: 24 units with sigmoid activation
Output layer: 4 units (vowel, fricative, other, silence)
Activation: Softmax with temperature scaling for smooth masks
Network Architecture & Training
# NEURAL NETWORK ARCHITECTURE IN PRAAT
# Created via: To FFNet: hidden_units, 0
# Where hidden_units = 24 (script parameter)
# LAYER STRUCTURE:
Input: 18 features → Hidden: 24 sigmoid units → Output: 4 softmax units
# ACTIVATION FUNCTIONS:
Hidden layer: sigmoid(x) = 1 / (1 + e^{-x})
Output layer: softmax(z)_i = e^{z_i/T} / Σ_j e^{z_j/T}
where T = temperature parameter
# TRAINING PARAMETERS:
Training_iterations = 1000
Train_chunk = 100 (samples per update)
Learning_rate = 0.001
Loss_function = Minimum-squared-error (MSE)
# TRAINING PROCEDURE:
WHILE total_trained < training_iterations:
Learn: train_chunk, learning_rate, "Minimum-squared-error"
total_trained = total_trained + train_chunk
# FEEDFORWARD PROCESS (INFERENCE):
Given input feature vector x:
h = sigmoid(W1·x + b1) # Hidden layer
z = W2·h + b2 # Output logits
y = softmax(z/T) # Temperature-scaled probabilities
# WHERE:
W1: 24×18 weight matrix (input→hidden)
b1: 24×1 bias vector (hidden)
W2: 4×24 weight matrix (hidden→output)
b2: 4×1 bias vector (output)
# PRAAT IMPLEMENTATION:
ffnet = selected("FFNet")
plusObject: pattern
plusObject: output_categories
Learn: train_chunk, learning_rate, "Minimum-squared-error"
Phonetic Category Definitions
Rule-Based Training Data Generation
Creating ground truth labels from acoustic features:
# PHONETIC CATEGORY RULES (for training data)
# Applied to each 10ms analysis frame:
# 1. SILENCE: Very low intensity
IF intensity < silence_intensity_threshold (45 dB):
category = "silence"
# 2. VOWEL: High harmonicity, has pitch, has formant structure
ELSIF harmonicity > vowel_hnr_threshold (5.0 dB) AND
pitch > 0 AND
F1 > 300 Hz:
category = "vowel"
# 3. FRICATIVE: Moderate intensity, low harmonicity, no pitch
ELSIF intensity > silence_intensity_threshold AND
harmonicity < fricative_hnr_max (3.0 dB) AND
pitch = 0:
category = "fricative"
# 4. OTHER: Everything else (nasals, approximants, transitions)
ELSE:
category = "other"
# PARAMETER SETTINGS:
silence_intensity_threshold = 45 dB
vowel_hnr_threshold = 5.0 dB # Harmonic-to-noise ratio threshold
fricative_hnr_max = 3.0 dB # Maximum HNR for fricatives
# TRAINING DATA GENERATION:
# Create Categories object with one label per frame
Create Categories: "output_categories"
FOR each frame i:
Apply above rules → Append category
# THIS CREATES SUPERVISED TRAINING DATA FOR NEURAL NETWORK
# Network learns to replicate these rules but with smooth probabilities
Temperature-Scaled Softmax
Controlling Classification Certainty
🌡️ Smooth Probability Distributions
Standard softmax: Sharp probabilities (winner-takes-all)
Temperature scaling: Softens probability distribution
Effect: Higher temperature = more uniform probabilities
Application: Creates smooth transitions between categories
Formula: softmax(z/T) where T = temperature parameter
Temperature Scaling Mathematics
# TEMPERATURE-SCALED SOFTMAX
# Standard softmax:
softmax(z)_i = e^{z_i} / Σ_j e^{z_j}
# Temperature-scaled softmax:
softmax(z, T)_i = e^{z_i / T} / Σ_j e^{z_j / T}
# WHERE:
z = [z1, z2, z3, z4] are network output logits
T = temperature parameter (T > 0)
# PROPERTIES:
- T = 1: Standard softmax
- T > 1: Probabilities become more uniform
- T < 1: Probabilities become more peaked
- T → ∞: All probabilities approach 1/4 (uniform)
- T → 0: Approaches one-hot encoding (winner takes all)
# IMPLEMENTATION IN SCRIPT:
tdiv = temperature
if tdiv <= 0.0001: tdiv = 0.0001
# Numerical stability (subtract max):
max_a = max(a1, max(a2, max(a3, a4)))
e1 = exp((a1 - max_a) / tdiv)
e2 = exp((a2 - max_a) / tdiv)
e3 = exp((a3 - max_a) / tdiv)
e4 = exp((a4 - max_a) / tdiv)
denom = e1 + e2 + e3 + e4
if denom <= 0: denom = 1e-12
w_vowel = e1 / denom # Probability of vowel class
w_rest = (e2 + e3 + e4) / denom # Probability of non-vowel classes
# EFFECT ON MASK GENERATION:
High temperature (e.g., 0.8): Smooth, gradual mask transitions
Low temperature (e.g., 0.2): Sharp, abrupt mask transitions
Default temperature (0.45): Balanced, natural transitions
Acoustic Feature Extraction
18-Dimensional Feature Vector
📊 Comprehensive Phonetic Characterization
MFCC features (12): Spectral envelope representation
Formant frequencies (3): F1, F2, F3 normalized to kHz
Intensity (1): Normalized to dB scale
Harmonicity (1): Harmonic-to-noise ratio
Pitch (1): Normalized fundamental frequency
Total: 18 features per 10ms frame
Feature Extraction Details
# 18 ACOUSTIC FEATURES EXTRACTED PER FRAME (10ms steps)
# 1. MFCCs (12 coefficients, columns 1-3, 10-18)
MFCC extraction parameters:
Number of coefficients: 12
Window length: 0.025s (25ms)
Time step: 0.01s (10ms)
First filter: 100 Hz
Filter distance: 100 Hz
Features: MFCC 1-12 (cepstral coefficients 1-12)
# 2. FORMANT FREQUENCIES (3 features, columns 4-6)
Formant extraction parameters:
Maximum formant: 5500 Hz
Number of formants: 5
Window length: 0.025s
Time step: 0.01s
Features:
F1_norm = F1 / 1000 # kHz normalized
F2_norm = F2 / 1000
F3_norm = F3 / 1000
# 3. INTENSITY (1 feature, column 7)
Intensity extraction parameters:
Minimum pitch: 75 Hz
Time step: 0.01s
Feature: intensity_norm = (intensity_dB - 60) / 20
# Normalized so 60 dB → 0, 80 dB → 1, 40 dB → -1
# 4. HARMONICITY (1 feature, column 8)
Harmonicity extraction parameters:
Time step: 0.01s
Minimum pitch: 75 Hz
Silence threshold: 0.1
Periods per window: 1.0
Feature: harmonicity_norm = HNR_dB / 20
# Normalized so 20 dB HNR → 1, 0 dB → 0
# 5. PITCH (1 feature, column 9)
Pitch extraction parameters:
Time step: 0.0 (automatic)
Pitch floor: 75 Hz
Pitch ceiling: 600 Hz
Feature: pitch_norm = F0_Hz / 500
# Normalized so 500 Hz → 1, 0 Hz → 0.5 (for unvoiced)
# NORMALIZATION (per feature across entire file):
FOR each feature j (1 to 18):
min_j = minimum(value_j across all frames)
max_j = maximum(value_j across all frames)
range_j = max_j - min_j
IF range_j = 0: range_j = 1
FOR each frame i:
normalized[i,j] = (value[i,j] - min_j) / range_j
# RESULT: 18×N matrix where N = number of frames
# Each row: 18 normalized features for one 10ms frame
# Used as input to neural network
Frame-Based Processing Parameters
Temporal Resolution & Analysis Windows
# TEMPORAL PROCESSING PARAMETERS
frame_step_seconds = 0.01 # 10ms frame advancement
rows_target = nFrames = duration / frame_step_seconds
# ANALYSIS WINDOWS (different for different features):
MFCC window: 25ms (2.5× frame step, 60% overlap)
Formant window: 25ms
Intensity window: adaptive, default similar
Harmonicity window: 10ms (matches frame step)
Pitch window: adaptive (Praat's default)
# FRAME ALIGNMENT:
All features extracted at time t_i = (i - 0.5) × frame_step_seconds
Example: Frame 1 at 5ms, Frame 2 at 15ms, etc.
# WHY 10MS FRAMES?
- Sufficient temporal resolution for phonetic transitions
- 100 Hz frame rate provides smooth mask generation
- Compatible with most Praat analysis routines
- Balances accuracy and computational load
# FEATURE MATRIX CONSTRUCTION:
Create TableOfReal: "features", rows_target, 18
FOR i = 1 to rows_target:
t = (i - 0.5) × 0.01
Extract all 18 features at time t
Store in row i of feature matrix
# HANDLING UNDEFINED VALUES:
IF feature = undefined (e.g., pitch in unvoiced regions):
Use default/reasonable value:
F1 = 500 Hz, F2 = 1500 Hz, F3 = 2500 Hz
Intensity = 60 dB, Harmonicity = 0 dB
Pitch = 0 Hz → normalized to 0.5
Liquid Mask Generation
From Neural Probabilities to Smooth Masks
🎭 Continuous Gain Control
Input: Neural network class probabilities
Processing: Temperature scaling, adaptive boosting
Output: Two IntensityTier masks (vibrato and dry)
Smoothing: Natural transitions between categories
Application: Multiply audio signals by mask values
Mask Generation Algorithm
# LIQUID MASK GENERATION ALGORITHM
# INPUT: For each frame i at time t_i
a1 = neural output for "vowel" class
a2 = neural output for "fricative" class
a3 = neural output for "other" class
a4 = neural output for "silence" class
# STEP 1: TEMPERATURE-SCALED SOFTMAX
tdiv = temperature (default 0.45)
max_a = max(a1, a2, a3, a4)
e1 = exp((a1 - max_a) / tdiv)
e2 = exp((a2 - max_a) / tdiv)
e3 = exp((a3 - max_a) / tdiv)
e4 = exp((a4 - max_a) / tdiv)
denom = e1 + e2 + e3 + e4
w_vowel = e1 / denom # Probability of vowel
w_rest = (e2 + e3 + e4) / denom # Probability of non-vowel
# STEP 2: ADAPTIVE VOICEDNESS BOOST
# Get normalized harmonicity and pitch from features
norm_hnr = Get value: i, 8 # Harmonicity feature (0-1 normalized)
norm_f0 = Get value: i, 9 # Pitch feature (0-1 normalized)
voicedness = (norm_hnr × 0.5) + (norm_f0 × 0.5) # 0-1 measure
adapt_weight = 1 + voiced_boost × (voicedness - 0.5) × 2
# Apply boost to vowel probability
w_vowel = w_vowel × adapt_weight
# STEP 3: NORMALIZE TO SUM TO 1.0
total = w_vowel + w_rest
w_vowel = w_vowel / total
w_dry = 1.0 - w_vowel # Complement
# STEP 4: CONVERT TO DECIBELS FOR INTENSITYTIER
floor_w = 0.001 # Minimum probability to avoid -inf dB
db_vib = if w_vowel < floor_w then -100 else 20 × log10(w_vowel) fi
db_dry = if w_dry < floor_w then -100 else 20 × log10(w_dry) fi
# STEP 5: CREATE INTENSITYTIER MASKS
Create IntensityTier: "Mask_Vibrato", 0, duration
Create IntensityTier: "Mask_Dry", 0, duration
FOR each frame i:
t = (i - 0.5) × frame_step_seconds
Add point to Mask_Vibrato: t, db_vib[i]
Add point to Mask_Dry: t, db_dry[i]
# RESULT: Two smooth gain envelopes
# Mask_Vibrato: Gain for vibrato-processed signal (high on vowels)
# Mask_Dry: Gain for dry signal (high on non-vowels)
# Masks sum approximately to 0 dB for consistent loudness
Result: Pronounced, ethereal vibrato with wide stereo
🌀 Experimental Warble (Preset 4)
Purpose: Creative sound design
Settings:
Vibrato rate: 8.0 Hz
Vibrato depth: 3.0 ms
Confidence threshold: 0.1
Temperature: 0.25
Stereo width: 0.7
Voiced boost: 0.8
Result: Pronounced, rhythmic modulation effects
🎯 Precision Vowel Enhancer (Preset 5)
Purpose: Surgical vowel-only vibrato
Settings:
Vibrato rate: 5.5 Hz
Vibrato depth: 1.8 ms
Confidence threshold: 0.5
Temperature: 0.3
Stereo width: 0.5
Voiced boost: 0.3
Result: Surgical application only to clear vowel regions
Advanced Techniques
Cascading with other effects:
Reverb after vibrato: Creates spacious, ethereal vocals
Compression before vibrato: Evens out dynamics for consistent modulation
EQ before classification: Enhance formants for better vowel detection
Parallel processing: Blend dry/wet versions for control
Processing order matters: Neural analysis works best on clean, unprocessed audio
Parameter interaction guidelines:
Depth vs. Rate: Higher rates need lower depth for natural sound
Temperature vs. Threshold: Lower temperature = sharper masks, may need higher threshold
Voiced boost vs. Depth: High boost + high depth = very pronounced vibrato on strong vowels
Stereo width vs. Mono compatibility: Width > 0.5 may cause phase issues in mono
Always listen in both stereo and mono to check compatibility
Troubleshooting Common Issues
Problem: Vibrato applied to consonants or breath sounds Cause: Confidence threshold too low, or poor feature separation Solution: Increase confidence_threshold, check audio quality
Problem: Abrupt transitions between vowel/non-vowel Cause: Temperature too low, creating binary masks Solution: Increase temperature (0.5-0.8) for smoother transitions
Problem: Weak or no vibrato on clear vowels Cause: Confidence threshold too high, or voiced_boost too low Solution: Decrease confidence_threshold, increase voiced_boost
Problem: Stereo image collapses in mono Cause: Phase cancellation from 180° inverted vibrato Solution: Reduce stereo_width, check mono compatibility
Problem: Processing time too long Cause: Long audio file, or many training iterations Solution: Process shorter segments, or reduce training_iterations