Neural Phonetic Vibrato — User Guide

Intelligent audio processing that applies stereo vibrato selectively to vocal vowels while preserving consonants and non-vocal sounds using neural network-based phonetic classification.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 0.1 (Neural Edition, 2025) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools

Contents:

What this does Quick start Neural Classification Theory Feature Extraction Liquid Mask Generation Stereo Vibrato Processing Parameters Applications

What this does

This script implements neural network-based phonetic vibrato processing — an intelligent audio effect that applies stereo vibrato selectively to vocal vowels while leaving consonants, fricatives, and non-vocal sounds untouched. Unlike traditional vibrato effects that indiscriminately modulate the entire signal, this system uses a trained neural network to classify audio frames into phonetic categories (vowel, fricative, silence, other), then generates "liquid masks" that smoothly blend between dry and vibrato-processed signals based on phonetic content. The result is a natural-sounding vocal enhancement that adds rich stereo vibrato to sustained vowels without affecting speech intelligibility or introducing unnatural modulation to consonants.

Key Features:

Neural Phonetic Classification — 4-layer FFN trained on 18 acoustic features
Selective Processing — Vibrato only on vowels, dry elsewhere
Liquid Mask Generation — Smooth, continuous mixing between signals
Phase-Inverted Stereo Vibrato — Left/right channels 180° out of phase
Adaptive Voicedness Boost — More vibrato on strongly voiced segments
5 Curated Presets — From subtle thickening to dreamy washes

Why Phonetic-Selective Processing Matters: Traditional vibrato effects apply uniform modulation across all audio content, which can degrade speech intelligibility and sound unnatural on consonants. The human voice naturally exhibits vibrato primarily on sustained vowels during singing or emotional speech, while consonants (especially plosives and fricatives) remain relatively stable. This script mimics this natural behavior by: (1) Analyzing acoustic features: Formants, harmonics, intensity, MFCCs to identify vowel regions. (2) Neural classification: Training a feedforward network to distinguish vowel/non-vowel frames. (3) Soft mixing: Creating smooth gain masks rather than abrupt switching. (4) Stereo enhancement: Applying phase-inverted vibrato for spatial width. The result preserves the natural character of speech/singing while adding musical enhancement where appropriate.

Technical Implementation: (1) Feature extraction: Extract 18 acoustic features per 10ms frame (MFCCs, formants, pitch, intensity, harmonicity). (2) Neural training: Train 24-unit feedforward network on 1000 iterations to classify frames as vowel/fricative/silence/other. (3) Mask generation: Apply temperature-scaled softmax to network outputs, create smooth intensity tier masks. (4) Parallel processing: Create three signal paths: dry, left vibrato (0° phase), right vibrato (180° phase). (5) Mask application: Multiply signals by their respective masks, sum results. (6) Stereo synthesis: Combine left/right channels with width control. The system runs completely within Praat using built-in neural network and signal processing capabilities.

Quick start

In Praat, select exactly one Sound object (mono or stereo).
Run script… → neural_phonetic_vibrato.praat.
Choose a Preset (start with "Lush Chorus" for default settings).
Adjust Vibrato_rate_hz (2-10Hz typical, 6Hz default).
Set Vibrato_depth_ms (0.5-5ms, 2.5ms default).
Configure neural parameters: Confidence_threshold (lower = more vibrato).
Set Temperature (lower = sharper classification, 0.45 default).
Adjust Voiced_boost (0-1, boosts vibrato on strongly voiced segments).
Set Stereo_width (0-1, 0.9 default for wide stereo).
Enable Play_result to hear output immediately.
Click OK — neural training, analysis, and processing will run.

Quick tip: Start with "Lush Chorus" preset for balanced vocal enhancement. For spoken word, use "Subtle Thickener" with lower depth (1-2ms). For singing/sustained vocals, try "Dreamy Wash" with deeper modulation. The neural network trains automatically on your specific audio (1000 iterations, ~10-30 seconds). Confidence_threshold controls how "sure" the network must be before applying vibrato: lower values = more aggressive vibrato application. Temperature controls classification sharpness: lower values = crisper decisions between vowel/non-vowel. Processing stages: (1) Feature extraction, (2) Neural training, (3) Mask generation, (4) Parallel processing, (5) Stereo mixing. For best results, use clean vocal recordings without heavy background noise.

Important: MONO CONVERSION: Input is converted to mono for analysis (stereo restored in output). PROCESSING TIME: Neural training takes time (10-60 seconds depending on file length). FEATURE EXTRACTION: Requires Praat's pitch, formant, intensity, MFCC, harmonicity analyses. PHONETIC CLASSIFICATION ACCURACY: Neural network performance depends on audio quality and feature separability. REAL-TIME LIMITATION: Not suitable for live processing; designed for offline enhancement. VIBRATO ARTIFACTS: Very high depth or rate values may cause unnatural sounding modulation. STEREO PHASE: Left/right channels 180° out of phase creates width but may cause mono compatibility issues. TEMPERATURE EFFECTS: Very low temperature (< 0.1) may create abrupt mask transitions. VOICED_BOOST: High values may over-emphasize vibrato on already prominent vowels.

Neural Classification Theory

Feedforward Neural Network Architecture

🧠 4-Layer FFN for Phonetic Classification

Input layer: 18 acoustic features (normalized 0-1)

Hidden layer: 24 units with sigmoid activation

Output layer: 4 units (vowel, fricative, other, silence)

Training: 1000 iterations, MSE loss, 0.001 learning rate

Activation: Softmax with temperature scaling for smooth masks

Network Architecture & Training

# NEURAL NETWORK ARCHITECTURE IN PRAAT # Created via: To FFNet: hidden_units, 0 # Where hidden_units = 24 (script parameter) # LAYER STRUCTURE: Input: 18 features → Hidden: 24 sigmoid units → Output: 4 softmax units # ACTIVATION FUNCTIONS: Hidden layer: sigmoid(x) = 1 / (1 + e^{-x}) Output layer: softmax(z)_i = e^{z_i/T} / Σ_j e^{z_j/T} where T = temperature parameter # TRAINING PARAMETERS: Training_iterations = 1000 Train_chunk = 100 (samples per update) Learning_rate = 0.001 Loss_function = Minimum-squared-error (MSE) # TRAINING PROCEDURE: WHILE total_trained < training_iterations: Learn: train_chunk, learning_rate, "Minimum-squared-error" total_trained = total_trained + train_chunk # FEEDFORWARD PROCESS (INFERENCE): Given input feature vector x: h = sigmoid(W1·x + b1) # Hidden layer z = W2·h + b2 # Output logits y = softmax(z/T) # Temperature-scaled probabilities # WHERE: W1: 24×18 weight matrix (input→hidden) b1: 24×1 bias vector (hidden) W2: 4×24 weight matrix (hidden→output) b2: 4×1 bias vector (output) # PRAAT IMPLEMENTATION: ffnet = selected("FFNet") plusObject: pattern plusObject: output_categories Learn: train_chunk, learning_rate, "Minimum-squared-error"

Phonetic Category Definitions

Rule-Based Training Data Generation

Creating ground truth labels from acoustic features:

# PHONETIC CATEGORY RULES (for training data) # Applied to each 10ms analysis frame: # 1. SILENCE: Very low intensity IF intensity < silence_intensity_threshold (45 dB): category = "silence" # 2. VOWEL: High harmonicity, has pitch, has formant structure ELSIF harmonicity > vowel_hnr_threshold (5.0 dB) AND pitch > 0 AND F1 > 300 Hz: category = "vowel" # 3. FRICATIVE: Moderate intensity, low harmonicity, no pitch ELSIF intensity > silence_intensity_threshold AND harmonicity < fricative_hnr_max (3.0 dB) AND pitch = 0: category = "fricative" # 4. OTHER: Everything else (nasals, approximants, transitions) ELSE: category = "other" # PARAMETER SETTINGS: silence_intensity_threshold = 45 dB vowel_hnr_threshold = 5.0 dB # Harmonic-to-noise ratio threshold fricative_hnr_max = 3.0 dB # Maximum HNR for fricatives # TRAINING DATA GENERATION: # Create Categories object with one label per frame Create Categories: "output_categories" FOR each frame i: Apply above rules → Append category # THIS CREATES SUPERVISED TRAINING DATA FOR NEURAL NETWORK # Network learns to replicate these rules but with smooth probabilities

Temperature-Scaled Softmax

Controlling Classification Certainty

🌡️ Smooth Probability Distributions

Standard softmax: Sharp probabilities (winner-takes-all)

Temperature scaling: Softens probability distribution

Effect: Higher temperature = more uniform probabilities

Application: Creates smooth transitions between categories

Formula: softmax(z/T) where T = temperature parameter

Temperature Scaling Mathematics

# TEMPERATURE-SCALED SOFTMAX # Standard softmax: softmax(z)_i = e^{z_i} / Σ_j e^{z_j} # Temperature-scaled softmax: softmax(z, T)_i = e^{z_i / T} / Σ_j e^{z_j / T} # WHERE: z = [z1, z2, z3, z4] are network output logits T = temperature parameter (T > 0) # PROPERTIES: - T = 1: Standard softmax - T > 1: Probabilities become more uniform - T < 1: Probabilities become more peaked - T → ∞: All probabilities approach 1/4 (uniform) - T → 0: Approaches one-hot encoding (winner takes all) # IMPLEMENTATION IN SCRIPT: tdiv = temperature if tdiv <= 0.0001: tdiv = 0.0001 # Numerical stability (subtract max): max_a = max(a1, max(a2, max(a3, a4))) e1 = exp((a1 - max_a) / tdiv) e2 = exp((a2 - max_a) / tdiv) e3 = exp((a3 - max_a) / tdiv) e4 = exp((a4 - max_a) / tdiv) denom = e1 + e2 + e3 + e4 if denom <= 0: denom = 1e-12 w_vowel = e1 / denom # Probability of vowel class w_rest = (e2 + e3 + e4) / denom # Probability of non-vowel classes # EFFECT ON MASK GENERATION: High temperature (e.g., 0.8): Smooth, gradual mask transitions Low temperature (e.g., 0.2): Sharp, abrupt mask transitions Default temperature (0.45): Balanced, natural transitions

Acoustic Feature Extraction

18-Dimensional Feature Vector

📊 Comprehensive Phonetic Characterization

MFCC features (12): Spectral envelope representation

Formant frequencies (3): F1, F2, F3 normalized to kHz

Intensity (1): Normalized to dB scale

Harmonicity (1): Harmonic-to-noise ratio

Pitch (1): Normalized fundamental frequency

Total: 18 features per 10ms frame

Feature Extraction Details

# 18 ACOUSTIC FEATURES EXTRACTED PER FRAME (10ms steps) # 1. MFCCs (12 coefficients, columns 1-3, 10-18) MFCC extraction parameters: Number of coefficients: 12 Window length: 0.025s (25ms) Time step: 0.01s (10ms) First filter: 100 Hz Filter distance: 100 Hz Features: MFCC 1-12 (cepstral coefficients 1-12) # 2. FORMANT FREQUENCIES (3 features, columns 4-6) Formant extraction parameters: Maximum formant: 5500 Hz Number of formants: 5 Window length: 0.025s Time step: 0.01s Features: F1_norm = F1 / 1000 # kHz normalized F2_norm = F2 / 1000 F3_norm = F3 / 1000 # 3. INTENSITY (1 feature, column 7) Intensity extraction parameters: Minimum pitch: 75 Hz Time step: 0.01s Feature: intensity_norm = (intensity_dB - 60) / 20 # Normalized so 60 dB → 0, 80 dB → 1, 40 dB → -1 # 4. HARMONICITY (1 feature, column 8) Harmonicity extraction parameters: Time step: 0.01s Minimum pitch: 75 Hz Silence threshold: 0.1 Periods per window: 1.0 Feature: harmonicity_norm = HNR_dB / 20 # Normalized so 20 dB HNR → 1, 0 dB → 0 # 5. PITCH (1 feature, column 9) Pitch extraction parameters: Time step: 0.0 (automatic) Pitch floor: 75 Hz Pitch ceiling: 600 Hz Feature: pitch_norm = F0_Hz / 500 # Normalized so 500 Hz → 1, 0 Hz → 0.5 (for unvoiced) # NORMALIZATION (per feature across entire file): FOR each feature j (1 to 18): min_j = minimum(value_j across all frames) max_j = maximum(value_j across all frames) range_j = max_j - min_j IF range_j = 0: range_j = 1 FOR each frame i: normalized[i,j] = (value[i,j] - min_j) / range_j # RESULT: 18×N matrix where N = number of frames # Each row: 18 normalized features for one 10ms frame # Used as input to neural network

Frame-Based Processing Parameters

Temporal Resolution & Analysis Windows

# TEMPORAL PROCESSING PARAMETERS frame_step_seconds = 0.01 # 10ms frame advancement rows_target = nFrames = duration / frame_step_seconds # ANALYSIS WINDOWS (different for different features): MFCC window: 25ms (2.5× frame step, 60% overlap) Formant window: 25ms Intensity window: adaptive, default similar Harmonicity window: 10ms (matches frame step) Pitch window: adaptive (Praat's default) # FRAME ALIGNMENT: All features extracted at time t_i = (i - 0.5) × frame_step_seconds Example: Frame 1 at 5ms, Frame 2 at 15ms, etc. # WHY 10MS FRAMES? - Sufficient temporal resolution for phonetic transitions - 100 Hz frame rate provides smooth mask generation - Compatible with most Praat analysis routines - Balances accuracy and computational load # FEATURE MATRIX CONSTRUCTION: Create TableOfReal: "features", rows_target, 18 FOR i = 1 to rows_target: t = (i - 0.5) × 0.01 Extract all 18 features at time t Store in row i of feature matrix # HANDLING UNDEFINED VALUES: IF feature = undefined (e.g., pitch in unvoiced regions): Use default/reasonable value: F1 = 500 Hz, F2 = 1500 Hz, F3 = 2500 Hz Intensity = 60 dB, Harmonicity = 0 dB Pitch = 0 Hz → normalized to 0.5

Liquid Mask Generation

From Neural Probabilities to Smooth Masks

🎭 Continuous Gain Control

Input: Neural network class probabilities

Processing: Temperature scaling, adaptive boosting

Output: Two IntensityTier masks (vibrato and dry)

Smoothing: Natural transitions between categories

Application: Multiply audio signals by mask values

Mask Generation Algorithm

# LIQUID MASK GENERATION ALGORITHM # INPUT: For each frame i at time t_i a1 = neural output for "vowel" class a2 = neural output for "fricative" class a3 = neural output for "other" class a4 = neural output for "silence" class # STEP 1: TEMPERATURE-SCALED SOFTMAX tdiv = temperature (default 0.45) max_a = max(a1, a2, a3, a4) e1 = exp((a1 - max_a) / tdiv) e2 = exp((a2 - max_a) / tdiv) e3 = exp((a3 - max_a) / tdiv) e4 = exp((a4 - max_a) / tdiv) denom = e1 + e2 + e3 + e4 w_vowel = e1 / denom # Probability of vowel w_rest = (e2 + e3 + e4) / denom # Probability of non-vowel # STEP 2: ADAPTIVE VOICEDNESS BOOST # Get normalized harmonicity and pitch from features norm_hnr = Get value: i, 8 # Harmonicity feature (0-1 normalized) norm_f0 = Get value: i, 9 # Pitch feature (0-1 normalized) voicedness = (norm_hnr × 0.5) + (norm_f0 × 0.5) # 0-1 measure adapt_weight = 1 + voiced_boost × (voicedness - 0.5) × 2 # Apply boost to vowel probability w_vowel = w_vowel × adapt_weight # STEP 3: NORMALIZE TO SUM TO 1.0 total = w_vowel + w_rest w_vowel = w_vowel / total w_dry = 1.0 - w_vowel # Complement # STEP 4: CONVERT TO DECIBELS FOR INTENSITYTIER floor_w = 0.001 # Minimum probability to avoid -inf dB db_vib = if w_vowel < floor_w then -100 else 20 × log10(w_vowel) fi db_dry = if w_dry < floor_w then -100 else 20 × log10(w_dry) fi # STEP 5: CREATE INTENSITYTIER MASKS Create IntensityTier: "Mask_Vibrato", 0, duration Create IntensityTier: "Mask_Dry", 0, duration FOR each frame i: t = (i - 0.5) × frame_step_seconds Add point to Mask_Vibrato: t, db_vib[i] Add point to Mask_Dry: t, db_dry[i] # RESULT: Two smooth gain envelopes # Mask_Vibrato: Gain for vibrato-processed signal (high on vowels) # Mask_Dry: Gain for dry signal (high on non-vowels) # Masks sum approximately to 0 dB for consistent loudness

Adaptive Voicedness Boosting

Enhancing Strongly Voiced Segments

Dynamic adjustment based on voice quality:

# ADAPTIVE VOICEDNESS BOOST MECHANISM # PURPOSE: Apply more vibrato to clearly voiced segments, # less vibrato to marginally voiced or noisy segments # CALCULATION: voicedness = (norm_hnr × 0.5) + (norm_f0 × 0.5) WHERE: norm_hnr = normalized harmonicity (0-1) 0 = no harmonics (noise), 1 = strong harmonics (clean voice) norm_f0 = normalized pitch (0-1) 0 = unvoiced, 1 = high pitched voiced voicedness ranges from 0 to 1: 0 = completely unvoiced/noisy 0.5 = average voicing 1 = strongly voiced, clean harmonic structure # BOOST FORMULA: adapt_weight = 1 + voiced_boost × (voicedness - 0.5) × 2 PARAMETER voiced_boost (default 0.4): voiced_boost = 0: No adaptive boosting (constant weight) voiced_boost = 0.4: Moderate boosting (example below) voiced_boost = 1.0: Strong boosting # EXAMPLES (voiced_boost = 0.4): voicedness = 0.0 (unvoiced) → adapt_weight = 1 + 0.4×(-0.5)×2 = 0.6 voicedness = 0.5 (average) → adapt_weight = 1 + 0.4×(0)×2 = 1.0 voicedness = 1.0 (strong) → adapt_weight = 1 + 0.4×(0.5)×2 = 1.4 # APPLICATION: w_vowel_boosted = w_vowel × adapt_weight # EFFECT: - Strongly voiced vowels get up to 40% more vibrato intensity - Weakly voiced segments get reduced vibrato intensity - Creates more natural, expressive vibrato application - Prevents vibrato on breathy or noisy vocalizations

Stereo Vibrato Processing

Phase-Inverted Stereo Vibrato

🎧 Width Enhancement Technique

Left channel: Standard vibrato (0° phase)

Right channel: Inverted vibrato (180° phase)

Effect: Creates stereo width without panning

Physics: Time-varying interaural time differences

Mono compatibility: Cancels to dry signal when summed

Vibrato Signal Generation

# STEREO VIBRATO GENERATION # VIBRATO PARAMETERS: vib_rate = vibrato_rate_hz # Frequency of modulation (Hz) vib_depth_sec = vibrato_depth_ms / 1000 # Depth in seconds # LEFT CHANNEL (0° phase): Formula: "Sound_Analysis_Copy(x + depth × sin(2π × rate × x))" WHERE depth = vib_depth_sec, rate = vib_rate Expanded: s_left(t) = original(t + depth × sin(2π × rate × t)) # RIGHT CHANNEL (180° phase = π radians): Formula: "Sound_Analysis_Copy(x + depth × sin(2π × rate × x + π))" WHERE π = 3.14159 Expanded: s_right(t) = original(t + depth × sin(2π × rate × t + π)) = original(t - depth × sin(2π × rate × t)) # MATHEMATICAL INTERPRETATION: Vibrato = time-varying delay: τ(t) = depth × sin(2π × rate × t) Left: delay = +τ(t) Right: delay = -τ(t) (180° phase shift = inverted modulation) # MONO COMPATIBILITY: When left and right are summed to mono: mono(t) = original(t + τ(t)) + original(t - τ(t)) ≈ 2 × original(t) (for small τ) Thus, mono listening hears primarily dry signal with slight modulation # STEREO WIDTH PERCEPTION: Brain perceives time differences between ears as spatial information Left/right channels with opposite modulation creates "wide" impression Even though both channels contain same spectral content # PRAAT IMPLEMENTATION: # Create left vibrato channel Copy: "Vib_Left" Formula: "Sound_Analysis_Copy(x + " + string$(vib_depth_sec) + " * sin(2*pi*" + string$(vib_rate) + "*x))" # Create right vibrato channel Copy: "Vib_Right" Formula: "Sound_Analysis_Copy(x + " + string$(vib_depth_sec) + " * sin(2*pi*" + string$(vib_rate) + "*x + 3.14159))"

Parallel Signal Processing Architecture

Three-Path Mixing System

Independent processing paths with mask blending:

# PARALLEL PROCESSING ARCHITECTURE # PATH 1: DRY SIGNAL (unprocessed) s_dry = Copy of original sound Processing: None (clean signal) Mask: Mask_Dry (high on non-vowels) # PATH 2: LEFT VIBRATO CHANNEL (0° phase) s_left_vib = Formula-based vibrato (sin phase) Processing: Time-varying delay τ(t) = depth × sin(2π × rate × t) Mask: Mask_Vibrato (high on vowels) # PATH 3: RIGHT VIBRATO CHANNEL (180° phase) s_right_vib = Formula-based vibrato (sin phase + π) Processing: Time-varying delay τ(t) = -depth × sin(2π × rate × t) Mask: Mask_Vibrato (high on vowels) # MASK APPLICATION (per channel): s_dry_masked = s_dry × 10^(Mask_Dry/20) s_left_masked = s_left_vib × 10^(Mask_Vibrato/20) s_right_masked = s_right_vib × 10^(Mask_Vibrato/20) # FINAL MIXING: left_out = s_dry_masked + s_left_masked right_out = s_dry_masked + s_right_masked # STEREO WIDTH CONTROL: width = stereo_width (0 to 1) mid = (left_out + right_out) / 2 side = (left_out - right_out) / 2 left_final = mid + side × width right_final = mid - side × width # OUTPUT: Stereo sound with enhanced width

Stereo Width Control

🔊 Mid-Side Processing

Mid channel: Average of left and right

Side channel: Difference between left and right

Width control: Scale side channel (0-1)

Mono safety: Width = 0 collapses to mono mid

Full width: Width = 1 preserves original stereo

Parameters & Specifications

Core Vibrato Parameters

Parameter	Type	Default	Range	Description
Vibrato_rate_hz	positive	6.0	0.5 - 20.0	Frequency of pitch modulation (Hz)
Vibrato_depth_ms	positive	2.5	0.1 - 10.0	Peak-to-peak time modulation (milliseconds)

Neural Classification Parameters

Parameter	Type	Default	Range	Description
Confidence_threshold	positive	0.3	0.01 - 0.99	Minimum neural probability for vibrato application
Temperature	positive	0.45	0.1 - 2.0	Softmax temperature (lower = sharper classification)
Voiced_boost	positive	0.4	0.0 - 1.0	Boost vibrato on strongly voiced segments (0 = no boost)

Stereo & Output Parameters

Parameter	Type	Default	Range	Description
Stereo_width	positive	0.9	0.0 - 1.0	Stereo image width (0 = mono, 1 = full width)
Play_result	boolean	1 (yes)	0/1	Auto-play processed audio

Neural Network Architecture

Parameter	Value	Description
Input layer size	18 units	Acoustic features per frame
Hidden layer size	24 units	Sigmoid activation
Output layer size	4 units	Softmax (vowel/fricative/other/silence)
Training iterations	1000	Epochs for network training
Learning rate	0.001	Gradient descent step size
Loss function	MSE	Mean squared error

Acoustic Analysis Parameters

Parameter	Value	Description
Frame step	0.01 s (10 ms)	Temporal resolution for analysis
MFCC coefficients	12	Mel-frequency cepstral coefficients
Formant count	5	Maximum formants to track
Pitch range	75-600 Hz	Fundamental frequency search range
Harmonicity threshold	0.1	Minimum periodicity for voiced detection

Performance Characteristics

Characteristic	Typical Value	Dependence	Notes
Processing time	10-60 seconds	File length, CPU speed	Mostly neural training time
Memory usage	O(N×18)	Linear in file duration	Feature matrix storage
Frequency resolution	100 Hz	Frame rate (100 fps)	Sufficient for phonetic transitions
Latency	~25 ms	Analysis window size	Not suitable for live use
Classification accuracy	85-95%	Audio quality, voice type	On clean vocal recordings

Applications

Vocal Enhancement & Mixing

Use case: Adding professional vibrato to vocals in music production

Recommended preset: "Lush Chorus" or "Dreamy Wash"

Typical settings: Rate 5-7 Hz, Depth 2-4 ms, Width 0.8-1.0

Speech Processing & Podcast Enhancement

Use case: Subtle vocal thickening without affecting intelligibility

Recommended preset: "Subtle Thickener"

Workflow:

Set low depth (0.5-1.5 ms) for natural feel
Increase confidence threshold to apply vibrato only to clear vowels
Use moderate stereo width (0.3-0.6) for subtle enhancement
Monitor consonant preservation (fricatives should remain dry)

Creative Sound Design

Use case: Experimental vocal processing, texture creation

Recommended preset: "Experimental Warble" or "Dreamy Wash"

Advantages:

Selective processing creates evolving textures
Stereo width adds spatial interest
Neural adaptation responds to vocal characteristics
Can be chained with other effects

Practical Workflow Examples

🎵 Subtle Thickener (Preset 1)

Purpose: Gentle vocal enhancement for speech

Settings:

Vibrato rate: 5.0 Hz
Vibrato depth: 1.2 ms
Confidence threshold: 0.4
Temperature: 0.5
Stereo width: 0.4
Voiced boost: 0.2

Result: Barely noticeable thickening, preserves speech clarity

🎤 Lush Chorus (Preset 2)

Purpose: Default setting for singing vocals

Settings:

Vibrato rate: 6.0 Hz
Vibrato depth: 2.5 ms
Confidence threshold: 0.3
Temperature: 0.45
Stereo width: 0.9
Voiced boost: 0.4

Result: Rich, musical vibrato on sustained notes

🌊 Dreamy Wash (Preset 3)

Purpose: Atmospheric vocal processing

Settings:

Vibrato rate: 4.5 Hz
Vibrato depth: 4.0 ms
Confidence threshold: 0.2
Temperature: 0.35
Stereo width: 1.0
Voiced boost: 0.6

Result: Pronounced, ethereal vibrato with wide stereo

🌀 Experimental Warble (Preset 4)

Purpose: Creative sound design

Settings:

Vibrato rate: 8.0 Hz
Vibrato depth: 3.0 ms
Confidence threshold: 0.1
Temperature: 0.25
Stereo width: 0.7
Voiced boost: 0.8

Result: Pronounced, rhythmic modulation effects

🎯 Precision Vowel Enhancer (Preset 5)

Purpose: Surgical vowel-only vibrato

Settings:

Vibrato rate: 5.5 Hz
Vibrato depth: 1.8 ms
Confidence threshold: 0.5
Temperature: 0.3
Stereo width: 0.5
Voiced boost: 0.3

Result: Surgical application only to clear vowel regions

Advanced Techniques

Cascading with other effects:

Reverb after vibrato: Creates spacious, ethereal vocals
Compression before vibrato: Evens out dynamics for consistent modulation
EQ before classification: Enhance formants for better vowel detection
Parallel processing: Blend dry/wet versions for control

Processing order matters: Neural analysis works best on clean, unprocessed audio

Parameter interaction guidelines:

Depth vs. Rate: Higher rates need lower depth for natural sound
Temperature vs. Threshold: Lower temperature = sharper masks, may need higher threshold
Voiced boost vs. Depth: High boost + high depth = very pronounced vibrato on strong vowels
Stereo width vs. Mono compatibility: Width > 0.5 may cause phase issues in mono

Always listen in both stereo and mono to check compatibility

Troubleshooting Common Issues

Problem: Vibrato applied to consonants or breath sounds
Cause: Confidence threshold too low, or poor feature separation
Solution: Increase confidence_threshold, check audio quality

Problem: Abrupt transitions between vowel/non-vowel
Cause: Temperature too low, creating binary masks
Solution: Increase temperature (0.5-0.8) for smoother transitions

Problem: Weak or no vibrato on clear vowels
Cause: Confidence threshold too high, or voiced_boost too low
Solution: Decrease confidence_threshold, increase voiced_boost

Problem: Stereo image collapses in mono
Cause: Phase cancellation from 180° inverted vibrato
Solution: Reduce stereo_width, check mono compatibility

Problem: Processing time too long
Cause: Long audio file, or many training iterations
Solution: Process shorter segments, or reduce training_iterations