Noise Vocoder — User Guide

Multi-band spectral analysis-resynthesis: extracts spectral envelope from source audio via Bark-scale filterbank, applies envelope to noise carrier for robotic, whispered, or synthetic vocal textures.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 0.1 (2025) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools

Contents:

What this does Quick start Vocoder Theory Parameters & Presets Applications

What this does

This script implements a noise vocoder — a classic audio effect that analyzes the spectral envelope (formant structure) of a source sound and imposes it onto a noise carrier. The source is divided into frequency bands (default 16) spaced according to the Bark scale (perceptually uniform). For each band: (1) Extract intensity contour from source (envelope tracking), (2) Generate white noise and filter to same band, (3) Multiply noise by intensity contour (amplitude modulation), (4) Sum all bands. Result: robotic, whispered speech; synthetic vocal textures; gender-neutral voices; privacy-preserving speech (intelligible but unrecognizable); experimental sound design. Unlike pitch-tracking vocoders, this uses noise only = pitch-invariant, creating characteristic "whispered robot" sound.

Key Features:

Bark-Scale Filterbank — Perceptually uniform frequency division (16-32 bands typical)
Intensity Envelope Tracking — Captures temporal dynamics per band
Noise Carrier — White noise replaces harmonic excitation
Band Summation — Reconstructs full spectrum from modulated bands
5 Presets — Default, More Bands, Wider Range, Stronger Noise, Smoother Filter
Adjustable Parameters — Band count, frequency range, smoothing, noise level

What is a vocoder? Vocoder = voice encoder, developed 1930s-1940s (Homer Dudley, Bell Labs) for speech compression/encryption. Principle: separate excitation (pitch/noise source) from filter (spectral envelope/formants). Analysis: extract envelope from modulator. Synthesis: apply envelope to carrier. Types: (1) Channel vocoder: filterbank analysis (this script). (2) Phase vocoder: FFT-based (different technique). (3) LPC vocoder: Linear Predictive Coding. (4) Formant vocoder: tracks resonances directly. Musical history: 1970s electronic music (Kraftwerk, ELO), funk/disco "robot voices," modern EDM/pop vocal effects. This script: noise carrier = unvoiced/whispered character, versus pitched carrier = singing robot (Daft Punk style, requires two inputs).

Technical Implementation: (1) Bark scale band calculation: Convert frequency limits to Bark scale (perceptually linear), Divide range into number_of_bands equal steps in Bark, Convert back to Hz for each band boundary, Creates narrower bands at low freq, wider at high freq (matches ear sensitivity). (2) Per-band processing (iterative): Filter source audio to band (Hann band-pass), Extract RMS and intensity contour (temporal envelope), Generate white noise at full duration, Filter noise to same band, Multiply noise by intensity contour (AM), Scale to match source RMS (amplitude calibration), Add to previous bands (accumulation). (3) Output: Sum of all modulated noise bands = vocoded result. Key insight: Spectral envelope preserved, pitch removed. Intelligibility maintained (formants carry phonetic info) but voice identity lost. Processing time: 2-10 seconds per band (total time = bands × duration dependency).

Quick start

In Praat, select exactly one Sound object (preferably voice or pitched instrument).
Run script… → Vocoding.praat.
Choose Preset: Default (16 bands), More Bands (24), Wider Frequency Range, Stronger Noise, or Smoother Filter.
If Custom: adjust number_of_bands (8-32), frequency limits (50-11000 Hz), filter_smoothing, noise_amplitude.
Click OK — processing analyzes bands, displays progress in Info window, auto-plays vocoded result.

Quick tip: Start with Default (16 bands) for classic vocoder sound. Try More Bands (24) for higher intelligibility and smoother sound. Use Wider Frequency Range for full-spectrum processing (20-15000 Hz). Stronger Noise increases carrier level (more "hiss"). Smoother Filter reduces band isolation (more blending). Processing time: ~5-20 seconds depending on duration and band count (each band processed sequentially). Info window shows band boundaries in Hz during processing. Output named "originalname_vocoded". Works best on speech or sustained tones — percussive material may lose definition.

Important: LONG PROCESSING TIME — iterative per-band processing, not instant. More bands = longer processing (linear scaling). Very long files (>1 minute) may take minutes to process. Consider trimming to representative section for experimentation. Very few bands (<8) creates poor intelligibility, robotic artifacts. Very many bands (>32) increases processing time with diminishing returns. Very low noise_amplitude (<0.05) creates weak output. Very high (>0.3) overwhelms envelope, sounds like pure noise. Filter_smoothing affects band isolation: too low (<25 Hz) = harsh transitions, too high (>150 Hz) = bands overlap excessively. Best results on clear, pitched, sustained material (speech, vocals, winds). Percussive sounds lose attack transients.

Vocoder Theory

Source-Filter Model of Sound

Fundamental Concept

Any sound = Source × Filter

Source-Filter Theory (speech example): SOURCE (excitation): - Voiced sounds: vocal fold vibration (periodic, pitched) - Unvoiced sounds: turbulent air (noise) - Provides fundamental energy FILTER (vocal tract): - Resonant cavities (throat, mouth, nose) - Shapes spectrum via formants - Creates phonetic identity SOUND = Source × Filter - /a/ = buzz × /a/-shaped resonances - /s/ = noise × /s/-shaped filtering Vocoder principle: Separate source and filter, then recombine with different source

Vocoder Operation

Analysis stage (modulator):

Divide spectrum into bands
Extract envelope (loudness over time) per band
Result: spectral envelope = filter characteristics

Synthesis stage (carrier + envelopes):

Generate carrier signal (noise, tone, or audio)
Divide carrier into same bands
Modulate each band by extracted envelope
Sum all bands = reconstructed sound

This script: noise carrier = replaces pitch with noise

Bark Scale Filterbank

Why Bark Scale?

Problem with linear Hz spacing:

Human hearing not linear in frequency
More sensitive to changes below 1000 Hz
Less sensitive above 1000 Hz
Linear bands waste resolution at high freq, underresolve low freq

Bark scale solution:

Perceptually uniform scale (1 Bark ≈ one critical band)
Linear in Bark = perceptually equal steps
Results in narrower Hz bands at low freq, wider at high freq
Matches ear's frequency resolution

🎵 Bark Scale Mathematics

Conversion formulas (approximate):

Hertz to Bark: Bark = 13 × atan(0.00076 × Hz) + 3.5 × atan((Hz/7500)²) Bark to Hertz: Hz = 1960 × (Bark + 0.53) / (26.28 - Bark) Example conversions: 100 Hz ≈ 1.0 Bark 500 Hz ≈ 4.8 Bark 1000 Hz ≈ 8.5 Bark 5000 Hz ≈ 18.5 Bark 10000 Hz ≈ 22.5 Bark Total audible range: ~24 Bark (20-16000 Hz)

Band Distribution Example

16 bands, 50-11000 Hz (Default preset): Bark range: 0.5 → 21.5 Bark (21 Bark total) Step size: 21/16 = 1.31 Bark per band Band Bark Range Hz Range Bandwidth 1 0.5-1.81 50-196 146 Hz 2 1.81-3.12 196-360 164 Hz 3 3.12-4.44 360-548 188 Hz 4 4.44-5.75 548-764 216 Hz 5 5.75-7.06 764-1012 248 Hz ... 12 15.5-16.8 2884-3580 696 Hz 13 16.8-18.1 3580-4484 904 Hz 14 18.1-19.4 4484-5659 1175 Hz 15 19.4-20.7 5659-7337 1678 Hz 16 20.7-22.0 7337-9891 2554 Hz Note: Bandwidth increases with frequency Perceptually equal spacing in Bark = unequal Hz

Intensity Envelope Extraction

What is Intensity?

Intensity = perceptual loudness measure (dB SPL)

Not instantaneous amplitude (too fast, includes pitch oscillations)
Smoothed energy over short windows (~10-100 ms)
Captures envelope = slow amplitude changes
Represents "how loud" over time, ignoring "what pitch"

Intensity calculation (per band): 1. Filter source to band (isolate frequency region) 2. Calculate intensity contour: - Window size: related to minimum_pitch (100 Hz) - Time step: 0.1 seconds (10 Hz sampling rate) - Result: intensity values over time 3. Convert to IntensityTier: - Time-stamped amplitude values - Interpolated between points - Used to modulate noise carrier 4. Multiply noise by intensity: - Noise amplitude × intensity_envelope - Preserves temporal dynamics - Removes source pitch, keeps envelope

Why This Preserves Intelligibility

Speech intelligibility depends on:

Formants: Spectral envelope peaks (vowel identity)
Temporal envelope: Amplitude changes (consonant onsets, rhythm)
Not pitch: Fundamental frequency less critical for understanding

Vocoder preserves:

✓ Formant structure (via band envelopes)
✓ Temporal dynamics (via intensity contours)
✗ Pitch information (replaced by noise)

Result: Whispered but intelligible speech

Noise Carrier Generation

White Noise Characteristics

Properties:

Equal power at all frequencies (flat spectrum)
Random amplitude values (Gaussian distribution)
No periodic structure (unpitched)
Provides neutral excitation for filtering

Noise generation (per band): 1. Create noise sound: Duration: match source duration Sample rate: 2 × upper_freq + 1000 Formula: randomGauss(0, noise_amplitude) 2. Filter to band: Same frequency limits as source band Hann window filter (smooth edges) Smoothing: filter_smoothing parameter 3. Result: band-limited noise Only frequencies within band present Ready for envelope modulation

Why Noise (Not Other Carriers)?

Noise advantages:

No inherent pitch = neutral
Broadband = provides energy for all band filters
Simple to generate (no pitch tracking needed)
Creates "whispered" character (unvoiced)

Alternative carriers (not this script):

Pitched tone: Singing robot (requires pitch tracking)
Another audio: Cross-synthesis (modulator A, carrier B)
Mixed: Pitch + noise (voiced/unvoiced distinction)

Band Summation and Reconstruction

Additive Synthesis

Principle: Filtered bands are linearly combined

Band accumulation process: Band 1: [modulated noise 50-196 Hz] ↓ Band 2: [modulated noise 196-360 Hz] ↓ (add to Band 1) Combined: [Bands 1+2] ↓ Band 3: [modulated noise 360-548 Hz] ↓ (add to Bands 1+2) Combined: [Bands 1+2+3] ↓ ... ↓ Band 16: [modulated noise 7337-9891 Hz] ↓ (add to Bands 1-15) FINAL: [All 16 bands summed] Output = full spectrum reconstructed from parts

Amplitude Calibration

RMS matching per band:

Problem: Noise bands may have different RMS than source Solution (per band): 1. Measure source band RMS: rms_SOURCE 2. Measure modulated noise RMS: rms_IS 3. Scale noise: noise × (rms_SOURCE / rms_IS) Result: Each band has same overall energy as source Preserves spectral balance across bands Prevents some bands from dominating

Complete Processing Pipeline

FOR EACH BAND (1 to number_of_bands): STEP 1: Calculate band limits - Convert Bark boundaries to Hz - Add edge buffer for smoothing STEP 2: Analyze source - Filter source to band - Measure RMS (overall energy) - Extract intensity contour (envelope) STEP 3: Generate carrier - Create white noise (full duration) - Filter noise to same band STEP 4: Apply envelope - Multiply noise by intensity contour - Scale to match source RMS STEP 5: Accumulate - Add to previous bands - Store as "band'i'" END FOR RESULT: Sum of all modulated bands = vocoded output

Comparison to Other Speech Effects

Effect	Method	Character	Intelligibility
Noise Vocoder	Envelope + noise	Whispered robot	High (formants preserved)
Pitch Vocoder	Envelope + tone	Singing robot	Very high
Whisper	Unvoiced excitation	Natural whisper	Moderate (no pitch cues)
Bitcrusher	Resolution reduction	Digital/lo-fi	Variable (artifact-dependent)
Ring Modulator	Frequency shifting	Metallic/inharmonic	Low (spectrum distorted)
Heavy EQ	Frequency filtering	Telephone/radio	High (if mids preserved)

Parameters & Presets

Preset Options

🎵 Default

Parameters: 16 bands, 50-11000 Hz, smoothing 50 Hz, noise 0.1

Character: Balanced vocoder, classic robotic speech

Best for: General use, speech vocoding, experimenting

📊 More Bands

Parameters: 24 bands (other defaults)

Character: Higher resolution, smoother, more intelligible

Best for: Clear speech, minimal artifacts, quality priority

🌊 Wider Frequency Range

Parameters: 20-15000 Hz (vs 50-11000 Hz default)

Character: Full spectrum, includes sub-bass and air

Best for: Music vocoding, full-bandwidth processing

💨 Stronger Noise

Parameters: Noise amplitude 0.2 (vs 0.1 default)

Character: More prominent "hiss," breathier quality

Best for: Emphasizing noise carrier, experimental textures

✨ Smoother Filter

Parameters: Smoothing 100 Hz (vs 50 Hz default)

Character: Blended bands, less isolation, softer edges

Best for: Reducing harshness, subtle vocoding

Custom Parameters

Parameter	Type	Default	Description
preset	option	Default	Choose preset configuration
number_of_bands	natural	16	Filterbank resolution (8-32 practical)
lower_frequency_limit	positive	50	Lowest band edge (Hz)
upper_frequency_limit	positive	11000	Highest band edge (Hz)
minimum_pitch	positive	100	For intensity extraction windowing
time_step	positive	0.1	Intensity contour sampling (seconds)
filter_smoothing	positive	50	Filter edge smoothness (Hz)
filter_edge_buffer	positive	25	Trim from band edges (Hz)
noise_amplitude	positive	0.1	Carrier noise level
play_after_processing	boolean	yes	Auto-play vocoded result
keep_intermediate_objects	boolean	no	Retain band objects for inspection

Parameter Details

number_of_bands

Range: 8-32 (practical), 4-64 (extreme)

Default: 16

Effect:

8-12: Low resolution, robotic, obvious artifacts
12-20: Good balance, classic vocoder sound
20-28: High resolution, smooth, natural
>28: Diminishing returns, long processing time

Trade-off: More bands = better quality but slower processing

lower_frequency_limit & upper_frequency_limit

Range: 20-20000 Hz (audible range)

Defaults: 50 Hz (bass), 11000 Hz (treble)

Effect:

Narrow range (300-3000 Hz): Telephone bandwidth
Speech range (80-8000 Hz): Intelligibility-focused
Default (50-11000 Hz): Full speech + some music
Wide range (20-15000 Hz): Full spectrum, music vocoding

filter_smoothing

Range: 10-200 Hz

Default: 50 Hz

Effect:

Low (10-30 Hz): Sharp band edges, isolated bands, harsh
Medium (30-70 Hz): Balanced, some overlap
High (70-150 Hz): Smooth transitions, blended bands
Very high (>150 Hz): Excessive overlap, muddy