Audio Descriptions Analysis — User Guide
Comprehensive batch audio feature extraction: pitch, intensity, spectral characteristics, voice quality metrics, and perceptual descriptors for multiple sounds simultaneously.
What this does
This script performs comprehensive audio feature extraction on multiple sound files simultaneously, computing 19 quantitative descriptors covering pitch statistics, intensity dynamics, voice quality metrics (jitter, shimmer, harmonicity), and advanced spectral characteristics (centroid, spread, rolloff, flatness, roughness). Output: structured table with one row per sound file, ready for statistical analysis, machine learning, or comparative studies. Perfect for: corpus analysis, instrument classification, voice quality assessment, sound design comparison, experimental composition research. Automated batch processing — analyze hundreds of files in minutes without manual measurement.
Key Features:
- Batch Processing — Analyzes ALL sounds in Objects window automatically
- 19 Features — Duration, 6 pitch metrics, 4 intensity metrics, 3 voice quality, 6 spectral
- Robust Error Handling — Handles undefined values, unvoiced sounds, edge cases
- CSV Export — Table format ready for Excel, R, Python, SPSS
- No Parameters — Automatic processing with scientifically optimal defaults
- Research-Ready — Metrics used in speech science, MIR, acoustics research
Technical Implementation: (1) Auto-select all Sound objects in workspace, (2) Create empty table with 19 column headers, (3) For each sound: extract pitch (autocorrelation 75-600Hz), calculate pitch statistics (mean, min, max, median, stdev), extract intensity (100Hz smoothing), calculate intensity statistics (max, min, median, variance), create point process for periodicity detection, calculate jitter and shimmer (local perturbation), calculate harmonicity (cepstral method), perform spectral analysis (FFT), calculate SPR (singing power ratio 50-2000Hz vs 2000-4000Hz), calculate spectral centroid and spread (weighted frequency distribution), calculate spectral rolloff (85% energy threshold), calculate spectral flatness (geometric/arithmetic mean ratio), calculate spectral roughness (local irregularity), (4) Fill table row with all metrics, (5) Export to CSV for external analysis. Robust: all undefined values set to zero, handles unvoiced sounds, checks for sufficient data before calculations.
Quick start
- Load multiple Sound files into Praat Objects window (Open → Read from file...).
- Run script… →
descriptions.praat. - No dialog — script automatically analyzes all sounds.
- Processing displays progress in Info window.
- Result: Table Results appears in Objects window with all metrics.
- Export: Select table → Save as comma-separated file... → save as .csv.
Feature Descriptions
Complete Feature Set (19 Metrics)
📊 Extracted Features
Temporal: 1 feature
Pitch: 6 features
Intensity: 4 features
Voice Quality: 3 features
Spectral: 5 features
1. Temporal Features
Duration_s
Unit: Seconds
Description: Total duration of sound file
Calculation: End time - start time
Interpretation:
- Simple temporal measurement
- Useful for normalizing other metrics
- Important for rhythm/timing studies
2. Pitch Features (F0 Statistics)
Pitch_mean_Hz
Unit: Hertz (Hz)
Description: Average fundamental frequency across voiced portions
Method: Autocorrelation (raw cc), 75-600 Hz range
Interpretation:
- Speech: ~120 Hz (male), ~220 Hz (female), ~300 Hz (child)
- Singing: Varies by register (bass ~100 Hz, soprano ~400 Hz)
- Instruments: Depends on tuning (A4 = 440 Hz standard)
- Zero: Indicates unpitched sound (noise, percussion, silence)
Pitch_min_Hz & Pitch_max_Hz
Unit: Hertz (Hz)
Description: Lowest and highest detected fundamental frequencies
Interpretation:
- Range = max - min: Pitch variability
- Large range: expressive speech, melodic singing, vibrato
- Small range: monotone speech, steady instruments
- Useful for: tessitura analysis, vocal range studies
Pitch_median_Hz
Unit: Hertz (Hz)
Description: 50th percentile (middle value) of pitch distribution
Advantage over mean: Robust to outliers (octave errors, glitches)
Interpretation:
- More reliable than mean for noisy pitch tracks
- Compare to mean: large difference indicates skewed distribution or errors
Pitch_stdev_Hz
Unit: Hertz (Hz)
Description: Standard deviation of fundamental frequency
Interpretation:
- Low stdev (<10 Hz): Stable pitch (monotone, sustained tone)
- Medium stdev (10-50 Hz): Moderate variation (expressive speech)
- High stdev (>50 Hz): Large pitch range (singing, melodic speech)
- Correlates with: Prosodic expressiveness, melodic activity
3. Intensity Features (Loudness Dynamics)
Intensity_max_dB & Intensity_min_dB
Unit: Decibels (dB SPL)
Description: Peak and minimum intensity values
Smoothing: 100 Hz window (removes sample-level fluctuations)
Interpretation:
- Dynamic range = max - min: Loudness variation
- Large range: high dynamics (classical music, expressive speech)
- Small range: compressed (pop music, broadcast speech)
- Typical speech: 20-30 dB range
Intensity_median_dB
Unit: Decibels (dB SPL)
Description: Median intensity (50th percentile)
Interpretation:
- Representative "average loudness"
- More robust than mean for signals with silence/noise
Intensity_variance_dB
Unit: dB² (squared decibels)
Description: Variance of intensity (stdev²)
Interpretation:
- Measures spread of loudness distribution
- High variance: uneven dynamics, high contrast
- Low variance: steady loudness, compressed signal
4. Voice Quality Features (Perturbation Metrics)
Jitter_local
Unit: Ratio (dimensionless)
Description: Average absolute difference between consecutive pitch periods
Formula: Jitter = (1/N) Σ |T(i) - T(i-1)| / mean(T)
Interpretation:
- Healthy voice: <1% (0.01)
- Rough/hoarse voice: >1.5% (0.015)
- Pathological: >3% (0.03)
- Clinical use: Voice disorder diagnosis
- Music: Expressive techniques (vibrato has high jitter)
Shimmer_local
Unit: Ratio (dimensionless)
Description: Average absolute difference between consecutive peak amplitudes
Formula: Shimmer = (1/N) Σ |A(i) - A(i-1)| / mean(A)
Interpretation:
- Healthy voice: <3% (0.03)
- Breathy voice: >5% (0.05)
- Pathological: >10% (0.1)
- Correlates with: Breathiness, irregular glottal closure
Harmonicity_dB (HNR)
Unit: Decibels (dB)
Description: Harmonics-to-Noise Ratio — signal periodicity measure
Method: Cepstral analysis
Interpretation:
- >20 dB: Clear, periodic signal (healthy voice, musical instrument)
- 10-20 dB: Moderate noise (normal conversational speech)
- <10 dB: Noisy signal (breathy voice, distorted sound)
- Negative: More noise than harmonic content
- Clinical: Voice pathology indicator
- Music: Distinguishes clean vs distorted tones
5. Spectral Features (Frequency Domain)
SPR_dB (Singing Power Ratio)
Unit: Decibels (dB)
Description: Difference between low band (50-2000 Hz) and high band (2000-4000 Hz) maxima
Formula: SPR = max(50-2000Hz) - max(2000-4000Hz)
Interpretation:
- Positive SPR: More low-frequency energy (typical speech, bass instruments)
- Near-zero SPR: Balanced spectrum
- Negative SPR: More high-frequency energy (cymbals, sibilants, brightness)
- Singing application: Trained singers develop "singer's formant" around 2500-3500 Hz → higher SPR
- Acoustic projection: High SPR = better audibility over orchestra
Spectral_centroid_Hz
Unit: Hertz (Hz)
Description: Center of gravity of spectrum — frequency-weighted mean
Formula: Centroid = Σ(f × magnitude) / Σ(magnitude)
Interpretation:
- Perceptual correlate: Brightness, sharpness
- Low centroid (<1000 Hz): Dark, warm, muffled (cello, male voice)
- Mid centroid (1000-3000 Hz): Balanced (piano, female voice)
- High centroid (>3000 Hz): Bright, sharp, harsh (cymbals, sibilants)
- MIR application: Instrument classification, timbre analysis
Spectral_spread_Hz
Unit: Hertz (Hz)
Description: Standard deviation of spectrum around centroid — bandwidth measure
Formula: Spread = √[Σ((f - centroid)² × magnitude) / Σ(magnitude)]
Interpretation:
- Low spread (<500 Hz): Narrow bandwidth (sine wave, whistle, flute)
- Mid spread (500-1500 Hz): Moderate bandwidth (voice, most instruments)
- High spread (>1500 Hz): Wide bandwidth (noise, crash cymbals)
- Combination with centroid: High centroid + low spread = pure high tone; Low centroid + high spread = rich low tone
Spectral_rolloff_Hz
Unit: Hertz (Hz)
Description: Frequency below which 85% of spectral energy is contained
Calculation: Cumulative energy threshold
Interpretation:
- Low rolloff (<2000 Hz): Energy concentrated in bass/mids (bass, kick drum)
- Mid rolloff (2000-5000 Hz): Balanced energy distribution (voice, guitar)
- High rolloff (>5000 Hz): Significant high-frequency content (cymbals, hi-hats)
- Distinguishes: Dull vs bright sounds independently of centroid
- MIR use: Genre classification (metal has higher rolloff than jazz)
Spectral_flatness
Unit: Ratio 0-1 (dimensionless)
Description: Ratio of geometric mean to arithmetic mean of spectrum (80-5000 Hz)
Formula: Flatness = exp(mean(ln(power))) / mean(power)
Interpretation:
- 0 (tonal): Pure tones, harmonics (sine wave, sustained notes)
- ~0.3 (mixed): Combination of tones and noise (voiced fricatives, bowed strings)
- ~0.7-1.0 (noisy): White noise, unvoiced speech, crash cymbals
- Perceptual: Tonality vs noisiness
- Speech: Vowels ~0.1, fricatives ~0.6-0.9
Spectral_roughness
Unit: Arbitrary (relative measure)
Description: Average absolute difference between adjacent spectral bins — local irregularity
Formula: Roughness = mean(|magnitude(f) - mean(magnitude(f±1))|)
Interpretation:
- Low roughness: Smooth spectrum (pure tones, low-pass filtered)
- High roughness: Irregular spectrum (distortion, inharmonicity, complex textures)
- Perceptual: Corresponds to auditory roughness/harshness
- Not directly comparable across sample rates (bin-dependent)
- Use: Within-corpus comparisons, roughness trends
Feature Summary Table
| Feature | Unit | Low Values | High Values |
|---|---|---|---|
| Duration_s | seconds | Short sounds | Long sounds |
| Pitch_mean_Hz | Hz | Low pitch (bass) | High pitch (treble) |
| Pitch_stdev_Hz | Hz | Monotone | Melodic variation |
| Intensity range | dB | Compressed dynamics | High dynamics |
| Jitter | ratio | Stable pitch | Irregular/rough |
| Shimmer | ratio | Stable amplitude | Breathy/irregular |
| Harmonicity | dB | Noisy | Clear/periodic |
| SPR | dB | Bright/high energy | Dark/low energy |
| Spectral_centroid | Hz | Dark timbre | Bright timbre |
| Spectral_spread | Hz | Narrow bandwidth | Wide bandwidth |
| Spectral_rolloff | Hz | Bass-heavy | Treble-rich |
| Spectral_flatness | 0-1 | Tonal | Noisy |
| Spectral_roughness | relative | Smooth spectrum | Irregular spectrum |
Interpreting Results
Reading the Results Table
Table structure:
- Each row = one sound file
- Column 1: SoundName (filename)
- Columns 2-20: 19 numerical features
- Values formatted as numbers (not scientific notation)
Zero Values: What They Mean
Typical Value Ranges by Sound Type
Male Speech
Pitch_stdev_Hz: 10-30
Intensity_max_dB: 60-80
Jitter: 0.005-0.015
Shimmer: 0.02-0.05
Harmonicity_dB: 10-20
Spectral_centroid_Hz: 800-1500
Spectral_flatness: 0.1-0.3
Female Speech
Pitch_stdev_Hz: 15-40
Intensity_max_dB: 60-80
Jitter: 0.005-0.015
Shimmer: 0.02-0.05
Harmonicity_dB: 10-20
Spectral_centroid_Hz: 1000-2000
Spectral_flatness: 0.1-0.3
Singing Voice (Trained)
Pitch_stdev_Hz: 30-100 (wide melodic range)
Jitter: 0.003-0.01 (lower than speech)
Shimmer: 0.01-0.03 (controlled)
Harmonicity_dB: 15-25 (clearer than speech)
SPR_dB: Higher (singer's formant)
Spectral_centroid_Hz: 1500-3000