Neural Phonetic Speed Mapper – User Guide

Uses a feedforward neural network to intelligently speed up or slow down different phonetic segments of speech based on acoustic features, applying differential time-scaling to vowels, fricatives, consonants, and silence.

Category: Speech Analysis / Processing Praat Script: FFNet_Adaptive_Speed.praat

Contents:

What this does Quick start Parameters Outputs Technical details

What this does

This Praat script implements an adaptive speech speed modification system powered by a neural network that learns to classify phonetic segments and applies differential time-scaling based on sound type. Unlike traditional uniform time-stretching, this script extracts 18 acoustic features (MFCCs, formants, pitch, intensity, harmonicity) every 5 milliseconds, trains a feedforward neural network to classify each frame as vowel, fricative, other consonant, or silence, then applies independent speed multipliers to each category. The result is speech that can be dramatically accelerated while preserving intelligibility by giving critical sounds (especially fricatives) more relative time. The script uses early-stopping training, voicing-adaptive boosting, confidence-based mixing, temporal smoothing, and PSOLA resynthesis to produce natural-sounding output with configurable speeds ranging from extreme slow-down to 3× acceleration per phonetic class.

Quick start

In Praat, select exactly one Sound object.
Run script… → FFNet_Adaptive_Speed.praat.
Adjust the four speed parameters (speed_vowel, speed_fric, speed_other, speed_silence) to control how each phonetic category is time-scaled.
Optionally modify neural network settings (hidden_units, training_iterations) or smoothing parameters (smooth_ms, min_seg_ms).
Click OK.
The output object, named [OriginalName]_ffnet_speeded_adaptive_es, is created. Detailed training and processing statistics appear in the Info window.

Parameters (form fields)

Speed Multipliers

Name (GUI)	Type	Default	Description
speed_vowel	positive	3.2	Time-scale factor for vowel segments. Values >1 speed up, <1 slow down.
speed_fric	positive	0.30	Time-scale factor for fricative consonants (s, f, sh, th, etc.).
speed_other	positive	2.4	Time-scale factor for other consonants (stops, nasals, approximants).
speed_silence	positive	1.0	Time-scale factor for silence/pauses (typically left at 1.0).

Enable/Disable Phonetic Classes

Name (GUI)	Type	Default	Description
enable_vowel	boolean	1	Include vowels in adaptive speed modification.
enable_fric	boolean	1	Include fricatives in adaptive speed modification.
enable_other	boolean	1	Include other consonants in adaptive speed modification.
enable_silence	boolean	1	Include silence in adaptive speed modification.

Neural Network Configuration

Name (GUI)	Type	Default	Description
hidden_units	integer	24	Number of neurons in the hidden layer of the feedforward network.
training_iterations	integer	1000	Maximum number of training epochs (subject to early stopping).
train_chunk	integer	100	Number of iterations per training batch.
learning_rate	positive	0.001	Neural network learning rate for backpropagation.
early_stop_delta	positive	0.005	Activation change threshold below which training is considered converged.
early_stop_patience	integer	3	Number of consecutive unchanged chunks before early stopping triggers.

Smoothing & Segmentation

Name (GUI)	Type	Default	Description
smooth_ms	positive	20	Temporal smoothing window in milliseconds to prevent abrupt speed transitions.
change_tolerance	positive	0.03	Cumulative speed factor change required before writing a new duration point.
min_seg_ms	positive	35	Minimum segment duration in milliseconds.
max_gap_ms	positive	100	Maximum time gap before forcing a new duration point.
force_every_frames	integer	2	Force a duration point every N frames (when min_seg_ms is met).

Adaptive Weighting

Name (GUI)	Type	Default	Description
confidence_threshold	positive	0.10	Minimum network activation required to apply speed modification (lower = more aggressive).
contrast_gain	positive	2.2	Amplifies speed modifications for high-confidence predictions.
temperature	positive	0.5	Softmax temperature for converting activations to probabilities (lower = sharper).
voiced_boost	positive	0.35	Additional weighting for voiced segments based on harmonicity and F0.

Feature Extraction Settings

Name (GUI)	Type	Default	Description
frame_step_seconds	positive	0.005	Time step between acoustic analysis frames (5ms standard).
max_formant_hz	positive	5500	Maximum formant frequency for Burg algorithm (5500 Hz typical for adults).
vowel_hnr_threshold	positive	5.0	Minimum harmonics-to-noise ratio (dB) for vowel classification.
fricative_hnr_max	positive	3.0	Maximum HNR (dB) for fricative classification.
silence_intensity_threshold	positive	45	Intensity threshold (dB) below which frames are classified as silence.

Playback

Name (GUI)	Type	Default	Description
play_result	boolean	0	Automatically play the output sound after processing.

Outputs

Object name: [OriginalName]_ffnet_speeded_adaptive_es
Type: Sound (same number of channels as input).
Feedback: Comprehensive training statistics printed to the Praat Info window, including:
- Number of frames analyzed and MFCC frames extracted
- Number of duration segments written
- Smoothing window size and segmentation parameters
- Speed multipliers for each phonetic class
- Training progress (chunks, epochs, early stopping status)
Playback: Optional automatic playback if play_result is enabled.
Note: All intermediate analysis objects (Pitch, Intensity, Formant, MFCC, Harmonicity, FFNet, etc.) are automatically cleaned up after processing.

Technical details

Feature Extraction

The script extracts 18 acoustic features per 5ms frame:

MFCCs 1-12: Mel-frequency cepstral coefficients capturing spectral envelope
Formants F1, F2, F3: Vocal tract resonances (normalized to kHz)
Intensity: Frame energy (normalized: (dB - 60) / 20)
Harmonicity (HNR): Periodicity measure (normalized: dB / 20)
F0: Fundamental frequency/pitch (normalized: Hz / 500)

Ground Truth Classification

Rule-based labeling creates training targets:

Silence: Intensity < 45 dB
Vowel: HNR > 5.0 dB, F0 > 0 Hz, F1 > 300 Hz
Fricative: Intensity > 45 dB, HNR < 3.0 dB, F0 = 0 Hz
Other: Everything else (stops, nasals, approximants)

Neural Network Architecture

Single-hidden-layer feedforward network:

Input layer: 18 normalized features
Hidden layer: 24 neurons (configurable)
Output layer: 4 neurons (phonetic classes)
Training: Minimum squared error, early stopping based on activation convergence

Speed Application Algorithm

Softmax: Network activations converted to probabilities using temperature scaling
Weighted speed: weighted_speed = w₁×speed_vowel + w₂×speed_fric + w₃×speed_other + w₄×speed_silence
Voicing boost: Amplifies modifications for voiced segments based on HNR and F0
Confidence mixing: Blends with neutral speed (1.0) when network confidence is low
Contrast gain: Amplifies deviations from 1.0 for high-confidence predictions
Temporal smoothing: Moving average over 20ms window prevents clicks
Segment consolidation: Enforces minimum duration and maximum gap constraints
PSOLA resynthesis: Pitch-synchronous overlap-add via Manipulation object preserves pitch while modifying duration

Use Cases

Accessibility: Slow down critical consonants for hearing-impaired listeners
Language learning: Emphasize difficult phonemes by reducing their speed
Accelerated listening: Speed up speech while maintaining intelligibility by preserving consonant timing
Speech research: Investigate perceptual effects of differential time-scaling on phonetic categories