Multi-Band Onset Detector — User Guide
Spectral event separation: isolates transient attacks from sustained content using multi-band energy analysis, with specialized modes for music and speech processing.
What this does
This script implements multi-band onset detection and separation — an advanced signal processing technique that isolates transient events from sustained content by analyzing energy changes across multiple frequency bands. Unlike simple amplitude-based detection, this method provides: (1) Multi-band analysis: Detects onsets across the entire frequency spectrum. (2) Energy-based detection: Identifies rapid energy increases characteristic of attacks. (3) Dual-mode operation: Music mode for percussive/tonal separation, Speech mode for consonant/vowel separation. (4) Temporal shaping: Adjustable attack and release windows for natural-sounding results. (5) Performance optimization: Configurable downsampling for efficient processing. Process analyzes audio across logarithmically-spaced frequency bands, computes combined energy envelope, detects onsets via differentiation, creates temporal mask, and separates transients from sustained content. Result: clean separation of attack transients from sustained resonances with applications in sound design, music analysis, and audio restoration.
Key Features:
- Dual Processing Modes — Music (normal) and Speech (swapped outputs)
- Multi-Band Analysis — Logarithmic frequency bands for full-spectrum coverage
- Adaptive Thresholding — Automatic level adjustment with user control
- Temporal Control — Adjustable attack and release windows
- Performance Optimization — Configurable downsampling for speed
- Professional Output — Normalized, full-quality separated components
Technical Implementation: (1) Preprocessing: Convert to mono and optional downsampling for performance. (2) Multi-band filtering: Split audio into logarithmically-spaced frequency bands. (3) Envelope extraction: Compute energy envelope for each band via rectification and smoothing. (4) Onset detection: Differentiate combined envelope to find energy increases. (5) Mask creation: Threshold onset function and apply temporal shaping. (6) Separation: Multiply original audio by mask and inverse mask. (7) Post-processing: Resample back to original rate and normalize. Key insight: Musical attacks and speech consonants create rapid energy increases across multiple frequency bands simultaneously, while sustained tones and vowels exhibit relatively stable energy distributions, enabling reliable separation based on temporal energy characteristics.
Quick start
- In Praat, select exactly one Sound object.
- Run script… →
multi_band_onset_detector.praat. - Set Transient_threshold for detection sensitivity (-20 to -40 dB typical).
- Adjust Attack_window and Release_window for temporal shaping.
- Set frequency range appropriate for your audio.
- Choose Number_of_bands (3-6 recommended).
- For long files, set Working_sample_rate for faster processing.
- Enable Swap_outputs_for_speech for vocal/consonant separation.
- Choose which outputs to create and enable normalization.
- Click OK — processing completes with separated components.
Onset Detection Theory
Multi-Band Energy Analysis
Spectral-Temporal Processing
Band-splitting and envelope extraction:
Why Multi-Band Approach?
Advantages over broadband detection:
- Frequency-specific detection: Identifies where in spectrum onsets occur
- Robustness: Less affected by masking between frequency regions
- Musical relevance: Matches human perception of attack brightness
- Flexibility: Can weight different frequency regions if needed
Onset Detection Mathematics
Energy-Based Detection
Computing the onset function:
Temporal Mask Properties
Attack and release shaping:
Initial mask: 1 where onset > threshold, 0 elsewhere
Creates abrupt transitions that sound artificial
Attack window (10-30ms typical):
Linear ramp from 0 to 1 over attack duration
Preserves sharpness of attack perception
Shorter = more precise timing, longer = smoother
Release window (30-100ms typical):
Exponential decay from 1 to 0 over release duration
Natural-sounding decay of transient energy
Shorter = tighter transients, longer = more sustain bleed
Convolution implementation:
Mask convolved with attack/release envelope
Efficient computation of shaped mask
Mathematically equivalent to sample-by-sample shaping
Perceptual benefits:
Smooth transitions prevent clicks and artifacts
Natural amplitude envelopes for separated components
Musical timing preservation
🎵 Perceptual Intuition
Transients (attacks):
Rapid energy increases across frequencies
Percussive hits, instrument attacks, consonants
Short duration, broadband energy
Sustain (resonance):
Stable or slowly changing energy
Instrument bodies, vowel sounds, reverberation
Long duration, tonal character
Detection principle:
Find moments when energy rapidly increases
Separate these from stable energy periods
Dual-Mode Operation
Music vs Speech Processing
Mode-dependent output assignment:
Why Dual Modes Matter
Perceptual differences:
- Music perception: Attacks define rhythm, sustain defines harmony
- Speech perception: Consonants define intelligibility, vowels define prosody
- Different goals: Music separation for creative effects, speech separation for analysis
- Signal characteristics: Musical attacks often sharper, speech consonants more complex
Complete Processing Pipeline
Processing Modes
Music Mode
🎵 Attack-Sustain Separation
Character: Normal operation for musical sounds
Output assignment: Transients = attacks, Sustain = resonances
Best for: Instruments, percussive sounds, general audio
Music mode applications:
| Sound Type | Transients Content | Sustain Content | Typical Use |
|---|---|---|---|
| Drums/Percussion | Attack transients, stick hits | Body resonance, ring | Drum replacement, transient shaping |
| Plucked Strings | Pick noise, string attack | String vibration, body resonance | Articulation control, re-amping |
| Piano/Keys | Hammer noise, key attack | String sustain, harmonic content | Dynamic control, note editing |
| Brass/Woodwinds | Tonguing, air bursts | Instrument body, tone sustain | Articulation analysis, phrase shaping |
Speech Mode
🗣️ Consonant-Vowel Separation
Character: Swapped outputs for speech analysis
Output assignment: Transients = consonants, Sustain = vowels
Best for: Voice, speech, vocal recordings
Speech mode applications:
| Speech Element | Transients Content | Sustain Content | Typical Use |
|---|---|---|---|
| Plosives (p,t,k,b,d,g) | Burst release, aspiration | Voicing (if present) | Speech analysis, consonant enhancement |
| Fricatives (f,v,s,z,sh) | Turbulent noise | Voicing (if present) | De-essing, noise reduction |
| Affricates (ch,j) | Stop + fricative portions | Transition regions | Articulation study |
| Vowels (a,e,i,o,u) | Onset/offset transitions | Vowel formants, voicing | Prosody analysis, vowel modification |
Mode Selection Guide
🎯 Choosing the Right Mode
Music mode: When you want to separate attacks from sustained tones
Speech mode: When you want to separate consonants from vowels
Consider content: What perceptual separation makes sense for your audio?
Experiment: Try both modes and listen to which produces more useful results
Parameters
Detection Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| Transient_threshold_(dB) | real | -30 | Sensitivity of onset detection |
| Attack_window_(ms) | real | 20 | Duration of attack ramp in mask |
| Release_window_(ms) | real | 50 | Duration of release decay in mask |
Frequency Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| Low_frequency_(Hz) | real | 100 | Lowest frequency band edge |
| High_frequency_(Hz) | real | 8000 | Highest frequency band edge |
| Number_of_bands | integer | 4 | Number of frequency bands (3-6 recommended) |
Performance Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| Working_sample_rate_(Hz) | real | 8000 | Processing sample rate for speed |
Output Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| Create_transient_sound | boolean | 1 (on) | Generate transient component output |
| Create_sustain_sound | boolean | 1 (on) | Generate sustain component output |
| Swap_outputs_for_speech | boolean | 0 (off) | Swap outputs for speech processing |
| Normalize_outputs | boolean | 1 (on) | Normalize output levels |
| Peak_amplitude | real | 0.99 | Target peak level for normalization |
Applications
Sound Design and Music Production
Use case: Creative processing and effect generation
Technique: Process transients and sustain separately with different effects
Example: Add reverb only to sustain, compression only to transients
Audio Restoration and Enhancement
Use case: Targeted processing of specific sound components
Technique: Isolate problematic elements for selective treatment
Workflow:
- Separate transients and sustain components
- Apply noise reduction only to sustain (preserves transients)
- Apply click removal only to transients (preserves sustain)
- Recombine processed components
Music Analysis and Education
Use case: Studying articulation and performance technique
Advantages:
- Reveals timing and intensity of attacks
- Shows sustain characteristics separate from attacks
- Enables detailed study of performance articulation
- Useful for instrumental pedagogy
Example: Analyze piano touch by examining transient-sustain relationship
Speech Processing and Analysis
Use case: Speech analysis and modification
Technique: Use speech mode for consonant-vowel separation
Application: Speech therapy, accent modification, voice transformation
Practical Workflow Examples
🎵 Drum Processing
Goal: Separate drum attacks from body resonance for individual processing
Settings:
- Mode: Music (swap disabled)
- Threshold: -25 dB (sensitive for sharp attacks)
- Attack: 15 ms (fast capture)
- Release: 40 ms (medium decay)
- Bands: 5 (good frequency resolution)
- Sample rate: 8000 Hz (balanced performance)
Result: Clean separation of drum hits from ring/resonance
🗣️ Speech De-essing
Goal: Reduce sibilance in vocal recordings
Settings:
- Mode: Speech (swap enabled)
- Threshold: -35 dB (sensitive for fricatives)
- Attack: 10 ms (fast consonant capture)
- Release: 30 ms (quick release)
- Bands: 4 (adequate for speech)
- Sample rate: 8000 Hz (speech bandwidth adequate)
Processing: Apply EQ/compression to transients (sibilance) only
🎻 Instrument Analysis
Goal: Study bowing/plucking technique in string instruments
Settings:
- Mode: Music (swap disabled)
- Threshold: -30 dB (balanced detection)
- Attack: 20 ms (capture bow attack)
- Release: 60 ms (capture initial decay)
- Bands: 6 (detailed frequency analysis)
- Sample rate: 16000 Hz (full quality for analysis)
Result: Clear view of articulation separate from tone sustain
Advanced Techniques
- Layered effects: Apply different effects to transients vs sustain
- Cross-synthesis: Use transients from one sound with sustain from another
- Rhythmic manipulation: Time-stretch sustain while keeping transients tight
- Spectral morphing: Process frequency content separately in each component
- Dynamic mixing: Automate blend between processed components
The separation enables entirely new processing possibilities
- Sharp attacks: Lower threshold, shorter attack window
- Gradual onsets: Higher threshold, longer attack window
- Percussive material: More bands for frequency resolution
- Tonal material: Fewer bands for smoother operation
- Fast processing: Lower working sample rate, fewer bands
- High quality: Higher working sample rate, more bands
- Clean separation: Higher threshold, shorter release
- Smooth transition: Lower threshold, longer release
Troubleshooting Common Issues
Cause: Threshold too low, capturing non-attack content
Solution: Increase threshold, check if speech mode should be enabled
Cause: Threshold too high, attack window too short
Solution: Decrease threshold, increase attack window
Cause: Abrupt mask transitions, insufficient smoothing
Solution: Increase attack/release windows, ensure normalization
Cause: Audio not suitable for attack-sustain model
Solution: Try different mode, adjust parameters, consider alternative approaches
Technical Deep Dive
Algorithm Performance
Computational Complexity
Processing time analysis:
Detection Accuracy Metrics
Evaluation criteria:
Ability to locate onset times accurately
Measured in milliseconds deviation from actual onset
This implementation: Typically 5-20 ms precision
Detection rate:
Percentage of actual onsets correctly detected
Depends on threshold setting and audio content
Typical: 80-95% with optimized parameters
False positive rate:
Percentage of detected onsets that are not actual onsets
Controlled by threshold setting
Lower threshold → higher detection but more false positives
Separation quality:
How cleanly transients are separated from sustain
Subjective but can be measured via spectral analysis
This method: Good for percussive sounds, fair for complex mixes
Comparative performance:
Multi-band vs single-band: Better for complex sounds
Energy-based vs spectral flux: More robust to timbre changes
This implementation: Balanced approach for general use