Neural Phonetic Harmonizer – User Guide

Uses a neural network classifier to identify phonetic classes (vowels, fricatives, other sounds, silence) frame-by-frame, then creates adaptive harmony by pitch-shifting and mixing four parallel voices with class-specific intervals and intensities.

Category: Synthesis / Processing Praat Script: Neural Phonetic Harmonizer.praat
Contents:

What this does

This Praat script implements an intelligent harmonization system that automatically classifies speech or audio into four phonetic categories and applies different pitch-shifted harmony voices to each. The system first extracts 18 acoustic features per frame, including 12 MFCC coefficients, three formant frequencies (F1, F2, F3), normalized intensity, harmonics-to-noise ratio (HNR), and fundamental frequency (F0). Audio frames are pre-classified into four categories using acoustic heuristics: vowels (high HNR, pitched, prominent F1), fricatives (low HNR, unpitched, sufficient intensity), other sounds (everything else above silence threshold), and silence (very low intensity). A feedforward neural network (FFNet) with configurable hidden units is then trained to learn these classifications from the feature vectors, using early stopping to prevent overtraining. After training, the network generates per-frame probability distributions across the four classes, which are converted to mixing weights using temperature-scaled softmax with an optional voiced boost that increases vowel harmony during pitched, harmonic segments. Four parallel copies of the original audio are created, each pitch-shifted by a user-defined interval in semitones using manipulation and overlap-add resynthesis. These harmony voices are then dynamically mixed using IntensityTiers that modulate the amplitude of each voice according to the neural network's frame-by-frame classification confidence and user-specified mix levels, creating an adaptive harmonization where vowel passages might be harmonized at a perfect fifth (+7 semitones), fricatives at a perfect fourth below (-5 semitones), and other sounds at a major third (+4 semitones), with all mixing amounts responding intelligently to the evolving phonetic content of the audio.

Quick start

  1. In Praat, select exactly one Sound object.
  2. Run script…Neural Phonetic Harmonizer.praat.
  3. Set harmony intervals in semitones for each phonetic class (defaults: vowel +7, fricative -5, other +4, silence 0).
  4. Adjust mix levels (0-1+) to control how prominently each harmony voice appears (higher = stronger).
  5. Set confidence threshold (default 0.20) to filter which frames receive harmonization.
  6. Configure neural network parameters if desired (defaults work well for most cases).
  7. Enable create_stereo for side-by-side comparison (original left, harmonized right).
  8. Enable play_result to hear the output immediately.
  9. Click OK and wait for processing (may take a minute for longer audio files).
  10. The output object, named [OriginalName]_harmonized or [OriginalName]_harmonized_stereo, is created.

Parameters (form fields)

Name (GUI)TypeDefaultDescription
vowel_semitonesreal7.0Pitch shift interval in semitones for vowel-classified frames (default: perfect fifth up).
fric_semitonesreal-5.0Pitch shift interval in semitones for fricative-classified frames (default: perfect fourth down).
other_semitonesreal4.0Pitch shift interval in semitones for other sound frames (default: major third up).
silence_semitonesreal0.0Pitch shift interval in semitones for silence-classified frames (default: no shift).
vowel_mixpositive1.2Base mixing level (0.0-2.0+) for vowel harmony voice. Values above 1.0 emphasize the harmony; below 1.0 reduce it.
fric_mixpositive0.9Base mixing level for fricative harmony voice.
other_mixpositive1.0Base mixing level for other sounds harmony voice.
silence_mixpositive0.2Base mixing level for silence frames (typically kept low).
confidence_thresholdpositive0.20Minimum classification probability (0.0-1.0) required for a frame to receive harmonization. Lower values harmonize more frames; higher values only harmonize confident classifications.
temperaturepositive0.3Softmax temperature parameter (0.1-2.0). Lower values create sharper class distinctions; higher values create smoother blending between classes.
voiced_boostpositive0.2Additional weight (0.0-1.0) applied to vowel harmony during voiced, harmonic segments. Increases vowel harmony prominence for pitched sounds.
hidden_unitsinteger24Number of hidden layer neurons in the feedforward neural network (typical range: 12-48).
training_iterationsinteger3000Maximum number of training iterations for the neural network. Training may stop early if convergence is detected.
learning_ratepositive0.001Learning rate for neural network training (typical range: 0.0001-0.01).
create_stereoboolean1 (true)If enabled, creates stereo output with original audio in left channel and harmonized audio in right channel. If disabled, outputs mono harmonized audio only.
play_resultboolean1 (true)If enabled, automatically plays the processed audio after completion.

Outputs