Convert speech rhythm and dynamics into MusicXML notation with automatic tempo detection, meter estimation, and quantized rhythmic notation for music composition and analysis.
This script converts speech rhythm into MusicXML notation by detecting syllable onsets, estimating tempo and meter, quantizing to musical note values, and generating standard MusicXML files that can be opened in notation software like MuseScore, Finale, or Sibelius. The system performs sophisticated acoustic analysis to extract rhythmic patterns from speech, converts them to quantized musical rhythms, preserves dynamic information as musical dynamics, and creates both MusicXML notation and detailed rhythm TextGrids for analysis. This enables composers, researchers, and musicians to transcribe speech rhythms into musical notation automatically.
Automatic Tempo Detection — Statistical analysis of inter-onset intervals
Meter Estimation — Beat and measure structure from accent patterns
Intelligent Quantization — Converts continuous timing to note values with dotted notes
Dynamics Extraction — Maps intensity contours to musical dynamics (pp-ff)
MusicXML Export — Standard format compatible with all notation software
Rhythm TextGrid — Multi-tier visualization of rhythm analysis
Why Convert Speech to Musical Rhythm? Speech contains rich rhythmic patterns that can inspire musical composition and provide insights into prosody and timing. This conversion process enables: (1) Musical composition: Use speech rhythms as compositional material. (2) Prosody analysis: Study speech timing through musical notation. (3) Educational tools: Visualize speech rhythm for language learning. (4) Creative applications: Generate rhythmic patterns from spoken word. The system bridges speech analysis and music notation by: (1) Detecting acoustic events: Syllable onsets, intensity peaks. (2) Statistical analysis: Finding tempo through histogram analysis. (3) Musical mapping: Converting continuous time to discrete note values. (4) Preserving expressivity: Maintaining dynamics and timing variations. The result is musically meaningful notation that captures speech rhythm essence.
Technical Implementation: (1) Silence detection: Creates TextGrid with silent/sounding intervals using intensity threshold. (2) Onset detection: 3 methods with peak refinement and prominence filtering. (3) Tempo estimation: Histogram analysis of inter-onset intervals with confidence scoring. (4) Meter estimation: Analyzes accent patterns to determine time signature. (5) Quantization: Rounds onsets to nearest musical grid with note value optimization. (6) Dynamics mapping: Converts intensity to musical dynamics markings. (7) MusicXML generation: Builds valid MusicXML structure with measures, notes, rests, and metadata. (8) TextGrid creation: Multi-tier visualization of rhythm, dynamics, beats, and measures. The complete pipeline runs within Praat using built-in analysis capabilities.
Quick start
In Praat, select exactly one Sound object (mono recommended).
Run script… → speech_to_musicxml_rhythm_v2.7.praat.
Enable Auto_detect_tempo (recommended for most cases).
Set Pulse_unit: "Quarter note" for typical speech rhythm.
Set Divisions_per_quarter: 8 for 32nd note resolution.
Enable Allow_dotted_notes for natural rhythm variations.
Enable Auto_detect_meter to automatically find time signature.
Choose Detection_method: "Intensity only" for speech, "Multi-feature" for music.
Set Min_onset_separation: 0.08s (80ms) for syllable-level detection.
Adjust Prominence_threshold: 2.5dB for clear onsets.
Set silence parameters: Min_silent_duration 0.10s, Silence_threshold -25dB.
Enable Extract_dynamics to preserve loudness variations.
Choose Output_pitch: "C4 (middle C)" for standard notation.
Enable Create_TextGrid for rhythm visualization.
Click OK — analysis, detection, quantization, and XML generation will run.
Copy MusicXML output from Info window and save as .xml file.
Open .xml file in MuseScore, Finale, Sibelius, or other notation software.
Quick tip: Start with clean speech recordings (1-10 seconds) for best results. For poetry or recited text, use "Intensity only" detection with pulse_unit = "Quarter note". For musical or sung phrases, use "Multi-feature" detection. The auto tempo detection works best when speech has clear rhythmic patterns. Processing stages: (1) Silence/speech segmentation (2-5 seconds), (2) Onset detection (5-15 seconds), (3) Tempo/meter estimation, (4) Quantization, (5) MusicXML generation. For visual feedback, always enable Create_TextGrid to see detected onsets and rhythm mapping. The MusicXML output appears in Praat's Info window — copy everything from <?xml to and save as .xml file. Recommended software: MuseScore (free) for viewing and editing the notation.
Important:CLEAN INPUT REQUIRED: Background noise affects onset detection. MINIMUM LENGTH: Need at least 2 seconds of speech with clear rhythm. ONSET SEPARATION: Too low may detect micro-variations, too high may miss syllables. TEMPO DETECTION: Works best with regular speech rhythm; irregular speech may produce uncertain tempo. QUANTIZATION ERROR: Speech timing doesn't exactly match musical grid — some error inevitable. METER DETECTION: Based on accent patterns; may not match linguistic meter. MUSICXML COMPATIBILITY: Generated XML follows MusicXML 3.1 standard. PITCH SETTING: Only affects visual notation, not audio. SILENCE THRESHOLD: Adjust based on recording noise floor. DYNAMICS MAPPING: Relative to within-file intensity range, not absolute dB.
# ONSET DETECTION IMPLEMENTATION DETAILS
# === METHOD 1: INTENSITY ONLY ===
# 1. Get intensity object (100Hz, 0.01s steps)
To Intensity: 100, 0.01, "yes"
# 2. Detect silent/sounding intervals
To TextGrid (silences): 100, 0, silence_threshold, min_silent_dur, min_sounding_dur
# 3. Scan each sounding interval
FOR each sounding interval:
t = interval_start + adaptive_window
adaptive_window = max(0.02, min(0.05, interval_dur/10))
WHILE t < interval_end - adaptive_window:
# Get intensity at three points for parabolic fit
int_val = Get value at time: t, "Cubic"
int_m1 = Get value at time: t - adaptive_window/2, "Cubic"
int_p1 = Get value at time: t + adaptive_window/2, "Cubic"
# Check for local maximum
IF int_val > int_m1 AND int_val > int_p1:
# Parabolic refinement
α = int_m1, β = int_val, γ = int_p1
IF (α - 2β + γ) ≠ 0:
p = 0.5 × (α - γ) / (α - 2β + γ)
refined_t = t + p × (adaptive_window/2)
refined_int = β - 0.25 × (α - γ) × p
ELSE:
refined_t = t
refined_int = int_val
# Calculate prominence (local median)
window_start = max(t - 0.15, interval_start)
window_end = min(t + 0.15, interval_end)
sample_vals[1..15] = intensity at 15 evenly spaced points
# Sort for median
FOR i FROM 1 TO 14:
FOR j FROM i+1 TO 15:
IF sample_vals[i] > sample_vals[j]:
SWAP sample_vals[i], sample_vals[j]
local_median = sample_vals[8] # 8th of 15 = median
prominence = refined_int - local_median
# Apply threshold and separation constraints
IF prominence ≥ prominence_threshold AND
refined_t - last_onset ≥ min_separation:
ADD ONSET: refined_t, refined_int
last_onset = refined_t
t = t + 0.005 # 5ms step
# === METHOD 2: MULTI-FEATURE ===
# Same as Method 1 PLUS spectral flux validation
# Additional step after intensity peak detection:
selectObject: spectrogram
slice_before = Get power at: refined_t - 0.02, 1000
slice_after = Get power at: refined_t + 0.01, 1000
IF slice_before ≠ undefined AND slice_after ≠ undefined:
flux = slice_after - slice_before
spectral_onset = (flux > 0) # True if spectral increase
# Both intensity peak AND spectral increase required
IF prominence ≥ prominence_threshold AND
refined_t - last_onset ≥ min_separation AND
spectral_onset:
ADD ONSET: refined_t, refined_int
# === METHOD 3: SYLLABLE NUCLEI ===
# 1. Get pitch object for voiced/unvoiced detection
To Pitch: 0, 75, 600
# 2. In each sounding interval:
selectObject: intensity
t = interval_start + 0.03
WHILE t < interval_end - 0.03:
# Find local maximum in intensity
local_max_t = Get time of maximum: t - 0.04, t + 0.04, "Parabolic"
local_max_int = Get maximum: t - 0.04, t + 0.04, "Parabolic"
IF local_max_t ≠ undefined AND abs(local_max_t - t) < 0.01:
# Check for voicing at this time
selectObject: pitch_obj
f0 = Get value at time: local_max_t, "Hertz", "Linear"
IF f0 ≠ undefined: # Voiced = syllable nucleus
IF local_max_t - last_onset ≥ min_separation:
ADD ONSET: local_max_t, local_max_int
last_onset = local_max_t
t = t + 0.02 # 20ms step
# === PARAMETER EFFECTS ===
# min_separation: Higher = fewer onsets, more separation
# prominence_threshold: Higher = only strong onsets
# silence_threshold: Higher = more detected as silence
# adaptive_window: Automatically adjusts to interval length
# === PERFORMANCE CHARACTERISTICS ===
# Method 1: Fastest, good for clean speech
# Method 2: Slower (spectrogram), more robust
# Method 3: Medium speed, syllable-focused
Silence Detection & Segmentation
🔇 Smart Silence/Speech Segmentation
Purpose: Separate speech from silence/pauses
Method: Intensity threshold with duration constraints
Parameters: Silence threshold (dB), min silent/sounding durations
Output: TextGrid with "silent" and "sounding" intervals
Importance: Prevents false onsets in silent regions
Tempo & Meter Analysis
Automatic Tempo Detection
🎼 Statistical Tempo Estimation
Input: Inter-onset intervals (IOIs) from detected onsets
Method: Histogram analysis with peak detection
Output: Tempo (BPM), confidence score, dominant IOI
Pulse mapping: Maps IOI to musical note value (whole to 16th)
Validation: Checks for reasonable tempo range (30-300 BPM)
Tempo Detection Algorithm
# TEMPO DETECTION ALGORITHM
# 1. Calculate inter-onset intervals (IOIs)
FOR i FROM 1 TO onset_count - 1:
IOI[i] = onset_time[i + 1] - onset_time[i]
# 2. Determine histogram range
ioi_min = min(IOI[1..n_iois])
ioi_max = max(IOI[1..n_iois])
hist_min = max(0.05, ioi_min × 0.8) # Minimum 50ms
hist_max = min(2.0, ioi_max × 1.2) # Maximum 2 seconds
n_bins = 50
bin_width = (hist_max - hist_min) / n_bins
# 3. Build histogram
FOR b FROM 1 TO n_bins:
hist_count[b] = 0
hist_center[b] = hist_min + (b - 0.5) × bin_width
FOR i FROM 1 TO n_iois:
IF IOI[i] ≥ hist_min AND IOI[i] < hist_max:
bin_idx = floor((IOI[i] - hist_min) / bin_width) + 1
IF bin_idx ≥ 1 AND bin_idx ≤ n_bins:
hist_count[bin_idx] = hist_count[bin_idx] + 1
# 4. Find peak region
peak_bin = 1
peak_count = hist_count[1]
FOR b FROM 2 TO n_bins:
IF hist_count[b] > peak_count:
peak_count = hist_count[b]
peak_bin = b
# 5. Weighted average around peak (3-bin window)
weight_sum = 0
weighted_ioi = 0
FOR b FROM max(1, peak_bin - 2) TO min(n_bins, peak_bin + 2):
weight_sum = weight_sum + hist_count[b]
weighted_ioi = weighted_ioi + hist_count[b] × hist_center[b]
IF weight_sum > 0:
dominant_ioi = weighted_ioi / weight_sum
ELSE:
dominant_ioi = hist_center[peak_bin]
# 6. Confidence calculation
total_count = 0
FOR b FROM 1 TO n_bins:
total_count = total_count + hist_count[b]
peak_region_count = 0
FOR b FROM max(1, peak_bin - 2) TO min(n_bins, peak_bin + 2):
peak_region_count = peak_region_count + hist_count[b]
confidence = peak_region_count / (total_count + 0.001)
# 7. Map to musical pulse based on pulse_unit setting
CASE pulse_unit OF:
1: # Whole note
quarter_dur_est = dominant_ioi / 4
pulse_name$ = "whole note"
2: # Half note
quarter_dur_est = dominant_ioi / 2
pulse_name$ = "half note"
3: # Quarter note
quarter_dur_est = dominant_ioi
pulse_name$ = "quarter note"
4: # Eighth note
quarter_dur_est = dominant_ioi × 2
pulse_name$ = "eighth note"
5: # 16th note
quarter_dur_est = dominant_ioi × 4
pulse_name$ = "16th note"
ENDCASE
# 8. Calculate tempo
raw_tempo = 60.0 / quarter_dur_est
# 9. Validate and adjust tempo range
# Speech typically 60-180 BPM; music 40-240 BPM
IF raw_tempo < 40: # Too slow, maybe pulse is half note
raw_tempo = raw_tempo × 2
ELSIF raw_tempo > 240: # Too fast, maybe pulse is quarter note
raw_tempo = raw_tempo / 2
tempo = round(raw_tempo)
tempo = max(30, min(300, tempo)) # Clamp to reasonable range
# 10. Output statistics
# dominant_ioi: Most common interval between onsets (seconds)
# tempo: Estimated beats per minute
# confidence: 0-1, higher = more consistent rhythm
# pulse_name$: Musical note value that maps to dominant_ioi
# === INTERPRETATION EXAMPLES ===
# Dominant IOI = 0.5s, pulse_unit = quarter note:
# quarter_dur = 0.5s, tempo = 60/0.5 = 120 BPM
# Dominant IOI = 0.25s, pulse_unit = eighth note:
# quarter_dur = 0.25×2 = 0.5s, tempo = 60/0.5 = 120 BPM
# Dominant IOI = 0.333s, pulse_unit = quarter note:
# quarter_dur = 0.333s, tempo = 60/0.333 ≈ 180 BPM
Meter Estimation
🎵 Time Signature Detection
Principle: Analyze accent patterns for metrical structure
Method: Intensity-based accent detection and interval analysis
Output: Beats per measure, beat type, compound meter flag
Common results: 2/4, 3/4, 4/4, 6/8 based on speech patterns
Fallback: Defaults to 4/4 if pattern unclear
Meter Detection Algorithm
# METER ESTIMATION ALGORITHM
# 1. Find accents (intensity peaks above mean)
int_sum = 0
FOR i FROM 1 TO onset_count:
int_sum = int_sum + onset_intensity[i]
int_mean = int_sum / onset_count
# 2. Detect accent positions
accent_count = 0
last_accent = 0
FOR i FROM 1 TO onset_count:
IF onset_intensity[i] > int_mean × 1.1: # 10% above mean
IF last_accent > 0:
accent_count = accent_count + 1
accent_interval[accent_count] = i - last_accent
ENDIF
last_accent = i
# 3. Analyze accent intervals
# Count occurrences of common metrical patterns
count_2 = 0 # 2-beat patterns (2/4, 2/2)
count_3 = 0 # 3-beat patterns (3/4)
count_4 = 0 # 4-beat patterns (4/4)
count_6 = 0 # 6-beat patterns (6/8 compound)
FOR i FROM 1 TO accent_count:
CASE accent_interval[i] OF:
2: count_2 = count_2 + 1 # Accents every 2 onsets
3: count_3 = count_3 + 1 # Accents every 3 onsets
4: count_4 = count_4 + 1 # Accents every 4 onsets
6: count_6 = count_6 + 1 # Accents every 6 onsets (compound)
ENDCASE
# 4. Determine meter based on strongest pattern
IF accent_count ≥ 3: # Need enough data
IF count_6 > count_4 AND count_6 > count_3 AND count_6 > count_2:
beats = 6
beat_type = 8
compound = 1 # Compound meter (6/8)
ELSIF count_3 > count_4 AND count_3 > count_2:
beats = 3
beat_type = 4
compound = 0 # Simple triple (3/4)
ELSIF count_2 > count_4:
beats = 2
beat_type = 4
compound = 0 # Simple duple (2/4)
ELSE:
beats = 4
beat_type = 4
compound = 0 # Default to 4/4
ELSE:
# Not enough accents, use defaults
beats = 4
beat_type = 4
compound = 0
# 5. Calculate measure structure
IF compound:
# Compound meter: beat unit = dotted quarter
# Example: 6/8 = 2 beats of dotted quarter
measure_dur = (beats / 3) × (beat_dur × 1.5)
divs_per_measure = beats × (divisions / 2)
ELSE:
# Simple meter: beat unit = quarter
measure_dur = beat_dur × beats
divs_per_measure = beats × divisions
# === METER INTERPRETATION EXAMPLES ===
# Accent pattern: X . . X . . (every 3 onsets)
# → count_3 highest → 3/4 time
# Accent pattern: X . X . (every 2 onsets)
# → count_2 highest → 2/4 time
# Accent pattern: X . . . X . . . (every 4 onsets)
# → count_4 highest → 4/4 time
# Accent pattern: X . . X . . (but accents on 1st and 4th of 6)
# → count_6 highest → 6/8 time (compound duple)
# === COMPOUND METER DETECTION ===
# Compound meter (6/8, 9/8, 12/8) has accents grouping in 3s
# Accent interval of 6 means accents every 6 onsets
# This suggests 2 groups of 3 (6/8) or 3 groups of 3 (9/8)
# === LIMITATIONS ===
# Requires clear accent patterns
# May not match linguistic meter (poetic meter)
# Works best with rhythmic, accented speech
# Music with syncopation may confuse detection
MusicXML Generation
MusicXML Structure
📄 Standard Music Notation Format
Format: MusicXML 3.1 Partwise DTD
Structure: Score-partwise with measures, attributes, notes
Compatibility: MuseScore, Finale, Sibelius, Dorico, etc.
Elements: Work info, identification, part list, measures
Musical/sung: Use multi-feature detection, divisions = 12
Noisy recordings: Increase silence_threshold, use multi-feature
Poetic/recited: Enable dotted notes, auto meter detection
Always check TextGrid output to verify onset detection quality
Troubleshooting Common Issues
Problem: Too many/few onsets detected Cause: Prominence_threshold too low/high, or min_separation wrong Solution: Adjust prominence_threshold (2-4dB typical), check TextGrid visualization
Problem: Tempo detection unrealistic (too fast/slow) Cause: Pulse_unit setting wrong for speech rhythm Solution: Try different pulse_unit (quarter or eighth note usually best)
Problem: Quantization error high (>50ms average) Cause: Tempo doesn't match speech rhythm, or divisions too low Solution: Adjust tempo manually, increase divisions_per_quarter
Problem: MusicXML won't open in notation software Cause: XML structure issue or software compatibility Solution: Ensure copying entire XML output, save as .xml or .musicxml, try MuseScore
Problem: Dynamics not showing in notation Cause: Extract_dynamics disabled, or intensity range too small Solution: Enable extract_dynamics, check recording has dynamic variation
Integration with Other Tools
Complete speech-to-music workflow:
Pre-processing: Use FIR filter bank to clean speech recording
Rhythm extraction: This script to get MusicXML rhythm
Pitch extraction: Use Praat's pitch analysis for melodic contour