Neural Phonetic Speed Mapper – User Guide
Uses a feedforward neural network to intelligently speed up or slow down different phonetic segments of speech based on acoustic features, applying differential time-scaling to vowels, fricatives, consonants, and silence.
What this does
This Praat script implements an adaptive speech speed modification system powered by a neural network that learns to classify phonetic segments and applies differential time-scaling based on sound type. Unlike traditional uniform time-stretching, this script extracts 18 acoustic features (MFCCs, formants, pitch, intensity, harmonicity) every 5 milliseconds, trains a feedforward neural network to classify each frame as vowel, fricative, other consonant, or silence, then applies independent speed multipliers to each category. The result is speech that can be dramatically accelerated while preserving intelligibility by giving critical sounds (especially fricatives) more relative time. The script uses early-stopping training, voicing-adaptive boosting, confidence-based mixing, temporal smoothing, and PSOLA resynthesis to produce natural-sounding output with configurable speeds ranging from extreme slow-down to 3× acceleration per phonetic class.
Quick start
- In Praat, select exactly one Sound object.
- Run script… →
FFNet_Adaptive_Speed.praat. - Adjust the four speed parameters (speed_vowel, speed_fric, speed_other, speed_silence) to control how each phonetic category is time-scaled.
- Optionally modify neural network settings (hidden_units, training_iterations) or smoothing parameters (smooth_ms, min_seg_ms).
- Click OK.
- The output object, named
[OriginalName]_ffnet_speeded_adaptive_es, is created. Detailed training and processing statistics appear in the Info window.
Parameters (form fields)
Speed Multipliers
| Name (GUI) | Type | Default | Description |
|---|---|---|---|
| speed_vowel | positive | 3.2 | Time-scale factor for vowel segments. Values >1 speed up, <1 slow down. |
| speed_fric | positive | 0.30 | Time-scale factor for fricative consonants (s, f, sh, th, etc.). |
| speed_other | positive | 2.4 | Time-scale factor for other consonants (stops, nasals, approximants). |
| speed_silence | positive | 1.0 | Time-scale factor for silence/pauses (typically left at 1.0). |
Enable/Disable Phonetic Classes
| Name (GUI) | Type | Default | Description |
|---|---|---|---|
| enable_vowel | boolean | 1 | Include vowels in adaptive speed modification. |
| enable_fric | boolean | 1 | Include fricatives in adaptive speed modification. |
| enable_other | boolean | 1 | Include other consonants in adaptive speed modification. |
| enable_silence | boolean | 1 | Include silence in adaptive speed modification. |
Neural Network Configuration
| Name (GUI) | Type | Default | Description |
|---|---|---|---|
| hidden_units | integer | 24 | Number of neurons in the hidden layer of the feedforward network. |
| training_iterations | integer | 1000 | Maximum number of training epochs (subject to early stopping). |
| train_chunk | integer | 100 | Number of iterations per training batch. |
| learning_rate | positive | 0.001 | Neural network learning rate for backpropagation. |
| early_stop_delta | positive | 0.005 | Activation change threshold below which training is considered converged. |
| early_stop_patience | integer | 3 | Number of consecutive unchanged chunks before early stopping triggers. |
Smoothing & Segmentation
| Name (GUI) | Type | Default | Description |
|---|---|---|---|
| smooth_ms | positive | 20 | Temporal smoothing window in milliseconds to prevent abrupt speed transitions. |
| change_tolerance | positive | 0.03 | Cumulative speed factor change required before writing a new duration point. |
| min_seg_ms | positive | 35 | Minimum segment duration in milliseconds. |
| max_gap_ms | positive | 100 | Maximum time gap before forcing a new duration point. |
| force_every_frames | integer | 2 | Force a duration point every N frames (when min_seg_ms is met). |
Adaptive Weighting
| Name (GUI) | Type | Default | Description |
|---|---|---|---|
| confidence_threshold | positive | 0.10 | Minimum network activation required to apply speed modification (lower = more aggressive). |
| contrast_gain | positive | 2.2 | Amplifies speed modifications for high-confidence predictions. |
| temperature | positive | 0.5 | Softmax temperature for converting activations to probabilities (lower = sharper). |
| voiced_boost | positive | 0.35 | Additional weighting for voiced segments based on harmonicity and F0. |
Feature Extraction Settings
| Name (GUI) | Type | Default | Description |
|---|---|---|---|
| frame_step_seconds | positive | 0.005 | Time step between acoustic analysis frames (5ms standard). |
| max_formant_hz | positive | 5500 | Maximum formant frequency for Burg algorithm (5500 Hz typical for adults). |
| vowel_hnr_threshold | positive | 5.0 | Minimum harmonics-to-noise ratio (dB) for vowel classification. |
| fricative_hnr_max | positive | 3.0 | Maximum HNR (dB) for fricative classification. |
| silence_intensity_threshold | positive | 45 | Intensity threshold (dB) below which frames are classified as silence. |
Playback
| Name (GUI) | Type | Default | Description |
|---|---|---|---|
| play_result | boolean | 0 | Automatically play the output sound after processing. |
Outputs
- Object name:
[OriginalName]_ffnet_speeded_adaptive_es - Type: Sound (same number of channels as input).
- Feedback: Comprehensive training statistics printed to the Praat Info window, including:
- Number of frames analyzed and MFCC frames extracted
- Number of duration segments written
- Smoothing window size and segmentation parameters
- Speed multipliers for each phonetic class
- Training progress (chunks, epochs, early stopping status)
- Playback: Optional automatic playback if
play_resultis enabled. - Note: All intermediate analysis objects (Pitch, Intensity, Formant, MFCC, Harmonicity, FFNet, etc.) are automatically cleaned up after processing.
Technical details
Feature Extraction
The script extracts 18 acoustic features per 5ms frame:
- MFCCs 1-12: Mel-frequency cepstral coefficients capturing spectral envelope
- Formants F1, F2, F3: Vocal tract resonances (normalized to kHz)
- Intensity: Frame energy (normalized: (dB - 60) / 20)
- Harmonicity (HNR): Periodicity measure (normalized: dB / 20)
- F0: Fundamental frequency/pitch (normalized: Hz / 500)
Ground Truth Classification
Rule-based labeling creates training targets:
- Silence: Intensity < 45 dB
- Vowel: HNR > 5.0 dB, F0 > 0 Hz, F1 > 300 Hz
- Fricative: Intensity > 45 dB, HNR < 3.0 dB, F0 = 0 Hz
- Other: Everything else (stops, nasals, approximants)
Neural Network Architecture
Single-hidden-layer feedforward network:
- Input layer: 18 normalized features
- Hidden layer: 24 neurons (configurable)
- Output layer: 4 neurons (phonetic classes)
- Training: Minimum squared error, early stopping based on activation convergence
Speed Application Algorithm
- Softmax: Network activations converted to probabilities using temperature scaling
- Weighted speed: weighted_speed = w₁×speed_vowel + w₂×speed_fric + w₃×speed_other + w₄×speed_silence
- Voicing boost: Amplifies modifications for voiced segments based on HNR and F0
- Confidence mixing: Blends with neutral speed (1.0) when network confidence is low
- Contrast gain: Amplifies deviations from 1.0 for high-confidence predictions
- Temporal smoothing: Moving average over 20ms window prevents clicks
- Segment consolidation: Enforces minimum duration and maximum gap constraints
- PSOLA resynthesis: Pitch-synchronous overlap-add via Manipulation object preserves pitch while modifying duration
Use Cases
- Accessibility: Slow down critical consonants for hearing-impaired listeners
- Language learning: Emphasize difficult phonemes by reducing their speed
- Accelerated listening: Speed up speech while maintaining intelligibility by preserving consonant timing
- Speech research: Investigate perceptual effects of differential time-scaling on phonetic categories