Neural Phonetic Speed Mapper – User Guide

Uses a feedforward neural network to intelligently speed up or slow down different phonetic segments of speech based on acoustic features, applying differential time-scaling to vowels, fricatives, consonants, and silence.

Category: Speech Analysis / Processing Praat Script: FFNet_Adaptive_Speed.praat
Contents:

What this does

This Praat script implements an adaptive speech speed modification system powered by a neural network that learns to classify phonetic segments and applies differential time-scaling based on sound type. Unlike traditional uniform time-stretching, this script extracts 18 acoustic features (MFCCs, formants, pitch, intensity, harmonicity) every 5 milliseconds, trains a feedforward neural network to classify each frame as vowel, fricative, other consonant, or silence, then applies independent speed multipliers to each category. The result is speech that can be dramatically accelerated while preserving intelligibility by giving critical sounds (especially fricatives) more relative time. The script uses early-stopping training, voicing-adaptive boosting, confidence-based mixing, temporal smoothing, and PSOLA resynthesis to produce natural-sounding output with configurable speeds ranging from extreme slow-down to 3× acceleration per phonetic class.

Quick start

  1. In Praat, select exactly one Sound object.
  2. Run script…FFNet_Adaptive_Speed.praat.
  3. Adjust the four speed parameters (speed_vowel, speed_fric, speed_other, speed_silence) to control how each phonetic category is time-scaled.
  4. Optionally modify neural network settings (hidden_units, training_iterations) or smoothing parameters (smooth_ms, min_seg_ms).
  5. Click OK.
  6. The output object, named [OriginalName]_ffnet_speeded_adaptive_es, is created. Detailed training and processing statistics appear in the Info window.

Parameters (form fields)

Speed Multipliers

Name (GUI)TypeDefaultDescription
speed_vowelpositive3.2Time-scale factor for vowel segments. Values >1 speed up, <1 slow down.
speed_fricpositive0.30Time-scale factor for fricative consonants (s, f, sh, th, etc.).
speed_otherpositive2.4Time-scale factor for other consonants (stops, nasals, approximants).
speed_silencepositive1.0Time-scale factor for silence/pauses (typically left at 1.0).

Enable/Disable Phonetic Classes

Name (GUI)TypeDefaultDescription
enable_vowelboolean1Include vowels in adaptive speed modification.
enable_fricboolean1Include fricatives in adaptive speed modification.
enable_otherboolean1Include other consonants in adaptive speed modification.
enable_silenceboolean1Include silence in adaptive speed modification.

Neural Network Configuration

Name (GUI)TypeDefaultDescription
hidden_unitsinteger24Number of neurons in the hidden layer of the feedforward network.
training_iterationsinteger1000Maximum number of training epochs (subject to early stopping).
train_chunkinteger100Number of iterations per training batch.
learning_ratepositive0.001Neural network learning rate for backpropagation.
early_stop_deltapositive0.005Activation change threshold below which training is considered converged.
early_stop_patienceinteger3Number of consecutive unchanged chunks before early stopping triggers.

Smoothing & Segmentation

Name (GUI)TypeDefaultDescription
smooth_mspositive20Temporal smoothing window in milliseconds to prevent abrupt speed transitions.
change_tolerancepositive0.03Cumulative speed factor change required before writing a new duration point.
min_seg_mspositive35Minimum segment duration in milliseconds.
max_gap_mspositive100Maximum time gap before forcing a new duration point.
force_every_framesinteger2Force a duration point every N frames (when min_seg_ms is met).

Adaptive Weighting

Name (GUI)TypeDefaultDescription
confidence_thresholdpositive0.10Minimum network activation required to apply speed modification (lower = more aggressive).
contrast_gainpositive2.2Amplifies speed modifications for high-confidence predictions.
temperaturepositive0.5Softmax temperature for converting activations to probabilities (lower = sharper).
voiced_boostpositive0.35Additional weighting for voiced segments based on harmonicity and F0.

Feature Extraction Settings

Name (GUI)TypeDefaultDescription
frame_step_secondspositive0.005Time step between acoustic analysis frames (5ms standard).
max_formant_hzpositive5500Maximum formant frequency for Burg algorithm (5500 Hz typical for adults).
vowel_hnr_thresholdpositive5.0Minimum harmonics-to-noise ratio (dB) for vowel classification.
fricative_hnr_maxpositive3.0Maximum HNR (dB) for fricative classification.
silence_intensity_thresholdpositive45Intensity threshold (dB) below which frames are classified as silence.

Playback

Name (GUI)TypeDefaultDescription
play_resultboolean0Automatically play the output sound after processing.

Outputs

Technical details

Feature Extraction

The script extracts 18 acoustic features per 5ms frame:

Ground Truth Classification

Rule-based labeling creates training targets:

Neural Network Architecture

Single-hidden-layer feedforward network:

Speed Application Algorithm

  1. Softmax: Network activations converted to probabilities using temperature scaling
  2. Weighted speed: weighted_speed = w₁×speed_vowel + w₂×speed_fric + w₃×speed_other + w₄×speed_silence
  3. Voicing boost: Amplifies modifications for voiced segments based on HNR and F0
  4. Confidence mixing: Blends with neutral speed (1.0) when network confidence is low
  5. Contrast gain: Amplifies deviations from 1.0 for high-confidence predictions
  6. Temporal smoothing: Moving average over 20ms window prevents clicks
  7. Segment consolidation: Enforces minimum duration and maximum gap constraints
  8. PSOLA resynthesis: Pitch-synchronous overlap-add via Manipulation object preserves pitch while modifying duration

Use Cases