Neural Phonetic Harmonizer – User Guide

Uses a neural network classifier to identify phonetic classes (vowels, fricatives, other sounds, silence) frame-by-frame, then creates adaptive harmony by pitch-shifting and mixing four parallel voices with class-specific intervals and intensities.

Category: Synthesis / Processing Praat Script: Neural Phonetic Harmonizer.praat

Contents:

What this does Quick start Parameters Outputs

What this does

This Praat script implements an intelligent harmonization system that automatically classifies speech or audio into four phonetic categories and applies different pitch-shifted harmony voices to each. The system first extracts 18 acoustic features per frame, including 12 MFCC coefficients, three formant frequencies (F1, F2, F3), normalized intensity, harmonics-to-noise ratio (HNR), and fundamental frequency (F0). Audio frames are pre-classified into four categories using acoustic heuristics: vowels (high HNR, pitched, prominent F1), fricatives (low HNR, unpitched, sufficient intensity), other sounds (everything else above silence threshold), and silence (very low intensity). A feedforward neural network (FFNet) with configurable hidden units is then trained to learn these classifications from the feature vectors, using early stopping to prevent overtraining. After training, the network generates per-frame probability distributions across the four classes, which are converted to mixing weights using temperature-scaled softmax with an optional voiced boost that increases vowel harmony during pitched, harmonic segments. Four parallel copies of the original audio are created, each pitch-shifted by a user-defined interval in semitones using manipulation and overlap-add resynthesis. These harmony voices are then dynamically mixed using IntensityTiers that modulate the amplitude of each voice according to the neural network's frame-by-frame classification confidence and user-specified mix levels, creating an adaptive harmonization where vowel passages might be harmonized at a perfect fifth (+7 semitones), fricatives at a perfect fourth below (-5 semitones), and other sounds at a major third (+4 semitones), with all mixing amounts responding intelligently to the evolving phonetic content of the audio.

Quick start

In Praat, select exactly one Sound object.
Run script… → Neural Phonetic Harmonizer.praat.
Set harmony intervals in semitones for each phonetic class (defaults: vowel +7, fricative -5, other +4, silence 0).
Adjust mix levels (0-1+) to control how prominently each harmony voice appears (higher = stronger).
Set confidence threshold (default 0.20) to filter which frames receive harmonization.
Configure neural network parameters if desired (defaults work well for most cases).
Enable create_stereo for side-by-side comparison (original left, harmonized right).
Enable play_result to hear the output immediately.
Click OK and wait for processing (may take a minute for longer audio files).
The output object, named [OriginalName]_harmonized or [OriginalName]_harmonized_stereo, is created.

Parameters (form fields)

Name (GUI)	Type	Default	Description
vowel_semitones	real	7.0	Pitch shift interval in semitones for vowel-classified frames (default: perfect fifth up).
fric_semitones	real	-5.0	Pitch shift interval in semitones for fricative-classified frames (default: perfect fourth down).
other_semitones	real	4.0	Pitch shift interval in semitones for other sound frames (default: major third up).
silence_semitones	real	0.0	Pitch shift interval in semitones for silence-classified frames (default: no shift).
vowel_mix	positive	1.2	Base mixing level (0.0-2.0+) for vowel harmony voice. Values above 1.0 emphasize the harmony; below 1.0 reduce it.
fric_mix	positive	0.9	Base mixing level for fricative harmony voice.
other_mix	positive	1.0	Base mixing level for other sounds harmony voice.
silence_mix	positive	0.2	Base mixing level for silence frames (typically kept low).
confidence_threshold	positive	0.20	Minimum classification probability (0.0-1.0) required for a frame to receive harmonization. Lower values harmonize more frames; higher values only harmonize confident classifications.
temperature	positive	0.3	Softmax temperature parameter (0.1-2.0). Lower values create sharper class distinctions; higher values create smoother blending between classes.
voiced_boost	positive	0.2	Additional weight (0.0-1.0) applied to vowel harmony during voiced, harmonic segments. Increases vowel harmony prominence for pitched sounds.
hidden_units	integer	24	Number of hidden layer neurons in the feedforward neural network (typical range: 12-48).
training_iterations	integer	3000	Maximum number of training iterations for the neural network. Training may stop early if convergence is detected.
learning_rate	positive	0.001	Learning rate for neural network training (typical range: 0.0001-0.01).
create_stereo	boolean	1 (true)	If enabled, creates stereo output with original audio in left channel and harmonized audio in right channel. If disabled, outputs mono harmonized audio only.
play_result	boolean	1 (true)	If enabled, automatically plays the processed audio after completion.

Outputs

Object name: [OriginalName]_harmonized (mono) or [OriginalName]_harmonized_stereo (stereo)
Type: Sound (mono or stereo depending on create_stereo setting).
Feedback: Comprehensive processing report printed to Praat Info window, including:
- Total frames analyzed
- Neural network training details (chunks, epochs, early stopping status)
- Configuration parameters (confidence threshold, temperature, voiced boost)
- Harmony intervals and mix levels for all four classes
- Frame counts showing how many frames were assigned to each phonetic class
- Total number of frames that received harmonization
- Progress messages during voice creation and mixing
Peak scaling: Output is automatically scaled to 99% peak amplitude to prevent clipping.
Playback: Plays automatically if play_result is enabled.
Processing time: May take 30 seconds to several minutes depending on audio duration and training iterations.