Speech to MusicXML Rhythm — User Guide

Convert speech rhythm and dynamics into MusicXML notation with automatic tempo detection, meter estimation, and quantized rhythmic notation for music composition and analysis.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 2.7 (Notation Edition, 2025) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools
Contents:

What this does

This script converts speech rhythm into MusicXML notation by detecting syllable onsets, estimating tempo and meter, quantizing to musical note values, and generating standard MusicXML files that can be opened in notation software like MuseScore, Finale, or Sibelius. The system performs sophisticated acoustic analysis to extract rhythmic patterns from speech, converts them to quantized musical rhythms, preserves dynamic information as musical dynamics, and creates both MusicXML notation and detailed rhythm TextGrids for analysis. This enables composers, researchers, and musicians to transcribe speech rhythms into musical notation automatically.

Key Features:

Why Convert Speech to Musical Rhythm? Speech contains rich rhythmic patterns that can inspire musical composition and provide insights into prosody and timing. This conversion process enables: (1) Musical composition: Use speech rhythms as compositional material. (2) Prosody analysis: Study speech timing through musical notation. (3) Educational tools: Visualize speech rhythm for language learning. (4) Creative applications: Generate rhythmic patterns from spoken word. The system bridges speech analysis and music notation by: (1) Detecting acoustic events: Syllable onsets, intensity peaks. (2) Statistical analysis: Finding tempo through histogram analysis. (3) Musical mapping: Converting continuous time to discrete note values. (4) Preserving expressivity: Maintaining dynamics and timing variations. The result is musically meaningful notation that captures speech rhythm essence.

Technical Implementation: (1) Silence detection: Creates TextGrid with silent/sounding intervals using intensity threshold. (2) Onset detection: 3 methods with peak refinement and prominence filtering. (3) Tempo estimation: Histogram analysis of inter-onset intervals with confidence scoring. (4) Meter estimation: Analyzes accent patterns to determine time signature. (5) Quantization: Rounds onsets to nearest musical grid with note value optimization. (6) Dynamics mapping: Converts intensity to musical dynamics markings. (7) MusicXML generation: Builds valid MusicXML structure with measures, notes, rests, and metadata. (8) TextGrid creation: Multi-tier visualization of rhythm, dynamics, beats, and measures. The complete pipeline runs within Praat using built-in analysis capabilities.

Quick start

  1. In Praat, select exactly one Sound object (mono recommended).
  2. Run script…speech_to_musicxml_rhythm_v2.7.praat.
  3. Enable Auto_detect_tempo (recommended for most cases).
  4. Set Pulse_unit: "Quarter note" for typical speech rhythm.
  5. Set Divisions_per_quarter: 8 for 32nd note resolution.
  6. Enable Allow_dotted_notes for natural rhythm variations.
  7. Enable Auto_detect_meter to automatically find time signature.
  8. Choose Detection_method: "Intensity only" for speech, "Multi-feature" for music.
  9. Set Min_onset_separation: 0.08s (80ms) for syllable-level detection.
  10. Adjust Prominence_threshold: 2.5dB for clear onsets.
  11. Set silence parameters: Min_silent_duration 0.10s, Silence_threshold -25dB.
  12. Enable Extract_dynamics to preserve loudness variations.
  13. Choose Output_pitch: "C4 (middle C)" for standard notation.
  14. Enable Create_TextGrid for rhythm visualization.
  15. Click OK — analysis, detection, quantization, and XML generation will run.
  16. Copy MusicXML output from Info window and save as .xml file.
  17. Open .xml file in MuseScore, Finale, Sibelius, or other notation software.
Quick tip: Start with clean speech recordings (1-10 seconds) for best results. For poetry or recited text, use "Intensity only" detection with pulse_unit = "Quarter note". For musical or sung phrases, use "Multi-feature" detection. The auto tempo detection works best when speech has clear rhythmic patterns. Processing stages: (1) Silence/speech segmentation (2-5 seconds), (2) Onset detection (5-15 seconds), (3) Tempo/meter estimation, (4) Quantization, (5) MusicXML generation. For visual feedback, always enable Create_TextGrid to see detected onsets and rhythm mapping. The MusicXML output appears in Praat's Info window — copy everything from <?xml to and save as .xml file. Recommended software: MuseScore (free) for viewing and editing the notation.
Important: CLEAN INPUT REQUIRED: Background noise affects onset detection. MINIMUM LENGTH: Need at least 2 seconds of speech with clear rhythm. ONSET SEPARATION: Too low may detect micro-variations, too high may miss syllables. TEMPO DETECTION: Works best with regular speech rhythm; irregular speech may produce uncertain tempo. QUANTIZATION ERROR: Speech timing doesn't exactly match musical grid — some error inevitable. METER DETECTION: Based on accent patterns; may not match linguistic meter. MUSICXML COMPATIBILITY: Generated XML follows MusicXML 3.1 standard. PITCH SETTING: Only affects visual notation, not audio. SILENCE THRESHOLD: Adjust based on recording noise floor. DYNAMICS MAPPING: Relative to within-file intensity range, not absolute dB.

Rhythm Extraction Theory

Speech Rhythm Fundamentals

⏱️ From Continuous Time to Discrete Notation

Speech timing: Continuous, variable, context-dependent

Musical rhythm: Discrete, quantized, grid-based

Mapping challenge: Convert continuous onsets to note values

Key concepts: Onsets, inter-onset intervals (IOIs), tempo, meter

Musical elements: Notes, rests, dotted values, dynamics

Rhythmic Analysis Mathematics

# SPEECH RHYTHM ANALYSIS MATHEMATICS # 1. ONSET DETECTION (time domain) # Speech signal: s(t) # Intensity envelope: I(t) = 10·log₁₀(∫|s(τ)|²dτ) # Onset defined as local maximum in I(t) with sufficient prominence # Peak refinement (parabolic interpolation): # Given three points: (t₁, I₁), (t₂, I₂), (t₃, I₃) with t₂ as peak α = I₁, β = I₂, γ = I₃ p = 0.5 × (α - γ) / (α - 2β + γ) # Peak offset from center refined_t = t₂ + p × Δt # Refined time refined_I = β - 0.25 × (α - γ) × p # Refined intensity # 2. PROMINENCE CALCULATION # Local median over window: M(t) = median{I(t-Δ) to I(t+Δ)} # Prominence: P(t) = I(t) - M(t) # Threshold: P(t) ≥ prominence_threshold (dB) # 3. INTER-ONSET INTERVALS (IOIs) # For N onsets at times t₁, t₂, ..., t_N IOI_i = t_{i+1} - t_i for i = 1 to N-1 # 4. TEMPO ESTIMATION (histogram method) # Create histogram of IOIs with bins b₁, b₂, ..., b_K # Find peak bin: b_peak = argmax count(b) # Weighted average around peak: IOI_dominant = Σ_{b∈peak_region} count(b)·center(b) / Σ count(b) # Map to musical pulse based on pulse_unit setting: IF pulse_unit = 1 (whole note): quarter_dur = IOI_dominant / 4 IF pulse_unit = 2 (half note): quarter_dur = IOI_dominant / 2 IF pulse_unit = 3 (quarter note): quarter_dur = IOI_dominant IF pulse_unit = 4 (eighth note): quarter_dur = IOI_dominant × 2 IF pulse_unit = 5 (16th note): quarter_dur = IOI_dominant × 4 # Tempo calculation: tempo = 60 / quarter_dur # BPM # 5. CONFIDENCE SCORING peak_region = bins within ±2 of peak total_count = Σ count(b) for all bins peak_count = Σ count(b) for b ∈ peak_region confidence = peak_count / total_count # 6. QUANTIZATION division_dur = quarter_dur / divisions_per_quarter raw_divs = onset_time / division_dur quantized_divs = round(raw_divs) error = |raw_divs - quantized_divs| × division_dur # 7. NOTE VALUE DURATIONS (in divisions) whole_note = divisions × 4 half_note = divisions × 2 quarter_note = divisions eighth_note = max(1, divisions / 2) sixteenth_note = max(1, divisions / 4) thirtysecond_note = max(1, divisions / 8) # Dotted values: dotted_half = half_note + quarter_note dotted_quarter = quarter_note + eighth_note dotted_eighth = eighth_note + sixteenth_note

Musical Grid System

Quantization & Note Value Selection

Converting continuous time to musical notation:

# MUSICAL QUANTIZATION ALGORITHM # INPUT: Target duration in divisions (T_divs) # OUTPUT: Best note value(s) to represent duration procedure getBestNoteValue: T_divs # Available note values (in divisions): whole = divisions × 4 # 𝅝 half = divisions × 2 # 𝅗𝅥 quarter = divisions # 𝅘𝅥 eighth = max(1, divisions / 2) # 𝅘𝅥𝅮 sixteenth = max(1, divisions / 4) # 𝅘𝅥𝅯 thirtysecond = max(1, divisions / 8) # 𝅘𝅥𝅰 # Dotted values (if allowed_dotted = 1): dotted_half = half + quarter dotted_quarter = quarter + eighth dotted_eighth = eighth + sixteenth # Greedy algorithm: Use largest possible note value IF T_divs ≥ whole: duration = whole type$ = "whole" dotted = 0 ELSIF T_divs ≥ dotted_half AND allow_dotted: duration = dotted_half type$ = "half" dotted = 1 ELSIF T_divs ≥ half: duration = half type$ = "half" dotted = 0 ELSIF T_divs ≥ dotted_quarter AND allow_dotted: duration = dotted_quarter type$ = "quarter" dotted = 1 ELSIF T_divs ≥ quarter: duration = quarter type$ = "quarter" dotted = 0 ELSIF T_divs ≥ dotted_eighth AND allow_dotted: duration = dotted_eighth type$ = "eighth" dotted = 1 ELSIF T_divs ≥ eighth: duration = eighth type$ = "eighth" dotted = 0 ELSIF T_divs ≥ sixteenth: duration = sixteenth type$ = "16th" dotted = 0 ELSIF T_divs ≥ thirtysecond: duration = thirtysecond type$ = "32nd" dotted = 0 ELSE: duration = 1 # Minimum: 128th note equivalent type$ = "32nd" dotted = 0 ENDIF # If selected duration > target, try next smaller value IF duration > T_divs: CASE type$ OF "whole": duration = half; type$ = "half" "half": IF dotted = 1 duration = half; dotted = 0 ELSE duration = quarter; type$ = "quarter" ENDIF "quarter": IF dotted = 1 duration = quarter; dotted = 0 ELSE duration = eighth; type$ = "eighth" ENDIF "eighth": IF dotted = 1 duration = eighth; dotted = 0 ELSE duration = sixteenth; type$ = "16th" ENDIF "16th": duration = thirtysecond; type$ = "32nd" "32nd": duration = 1; type$ = "32nd" ENDCASE ENDIF endproc # QUANTIZATION EXAMPLE: # divisions = 8, T_divs = 10 # Available: whole=32, half=16, quarter=8, eighth=4, sixteenth=2, 32nd=1 # Dotted: dotted_half=24, dotted_quarter=12, dotted_eighth=6 # T_divs=10 → would select quarter (8) but that's < 10 # Next: dotted_eighth (6) too small # Try: eighth (4) too small # Best: Use eighth (4) + sixteenth (2) + sixteenth (2) + 32nd (1) + 32nd (1) # Or with dotted notes: quarter (8) + 32nd (1) + 32nd (1) # Algorithm picks most efficient representation # MEASURE FILLING: # Each measure has capacity: divs_per_measure = beats × divisions # Notes are placed sequentially, measures filled automatically # When measure capacity reached, new measure starts # Rests added to complete incomplete measures

Onset Detection Methods

Three Detection Algorithms

🔊 Method 1: Intensity Only

Principle: Detect peaks in intensity envelope

Processing: Parabolic interpolation for precise timing

Best for: Clear speech, poetry, recited text

Parameters: Prominence threshold, min separation

Advantages: Fast, reliable for most speech

📈 Method 2: Multi-Feature (Intensity + Spectral)

Principle: Combine intensity peaks with spectral flux

Processing: Requires both intensity peak AND spectral increase

Best for: Music, singing, noisy environments

Parameters: Same as Method 1 plus spectral validation

Advantages: More robust, fewer false positives

🎵 Method 3: Syllable Nuclei

Principle: Detect voiced syllable centers

Processing: Pitch-synchronous detection with F0 validation

Best for: Clear voiced speech, language analysis

Parameters: Pitch floor/ceiling, voicing threshold

Advantages: Captures syllable rhythm accurately

Onset Detection Implementation

# ONSET DETECTION IMPLEMENTATION DETAILS # === METHOD 1: INTENSITY ONLY === # 1. Get intensity object (100Hz, 0.01s steps) To Intensity: 100, 0.01, "yes" # 2. Detect silent/sounding intervals To TextGrid (silences): 100, 0, silence_threshold, min_silent_dur, min_sounding_dur # 3. Scan each sounding interval FOR each sounding interval: t = interval_start + adaptive_window adaptive_window = max(0.02, min(0.05, interval_dur/10)) WHILE t < interval_end - adaptive_window: # Get intensity at three points for parabolic fit int_val = Get value at time: t, "Cubic" int_m1 = Get value at time: t - adaptive_window/2, "Cubic" int_p1 = Get value at time: t + adaptive_window/2, "Cubic" # Check for local maximum IF int_val > int_m1 AND int_val > int_p1: # Parabolic refinement α = int_m1, β = int_val, γ = int_p1 IF (α - 2β + γ) ≠ 0: p = 0.5 × (α - γ) / (α - 2β + γ) refined_t = t + p × (adaptive_window/2) refined_int = β - 0.25 × (α - γ) × p ELSE: refined_t = t refined_int = int_val # Calculate prominence (local median) window_start = max(t - 0.15, interval_start) window_end = min(t + 0.15, interval_end) sample_vals[1..15] = intensity at 15 evenly spaced points # Sort for median FOR i FROM 1 TO 14: FOR j FROM i+1 TO 15: IF sample_vals[i] > sample_vals[j]: SWAP sample_vals[i], sample_vals[j] local_median = sample_vals[8] # 8th of 15 = median prominence = refined_int - local_median # Apply threshold and separation constraints IF prominence ≥ prominence_threshold AND refined_t - last_onset ≥ min_separation: ADD ONSET: refined_t, refined_int last_onset = refined_t t = t + 0.005 # 5ms step # === METHOD 2: MULTI-FEATURE === # Same as Method 1 PLUS spectral flux validation # Additional step after intensity peak detection: selectObject: spectrogram slice_before = Get power at: refined_t - 0.02, 1000 slice_after = Get power at: refined_t + 0.01, 1000 IF slice_before ≠ undefined AND slice_after ≠ undefined: flux = slice_after - slice_before spectral_onset = (flux > 0) # True if spectral increase # Both intensity peak AND spectral increase required IF prominence ≥ prominence_threshold AND refined_t - last_onset ≥ min_separation AND spectral_onset: ADD ONSET: refined_t, refined_int # === METHOD 3: SYLLABLE NUCLEI === # 1. Get pitch object for voiced/unvoiced detection To Pitch: 0, 75, 600 # 2. In each sounding interval: selectObject: intensity t = interval_start + 0.03 WHILE t < interval_end - 0.03: # Find local maximum in intensity local_max_t = Get time of maximum: t - 0.04, t + 0.04, "Parabolic" local_max_int = Get maximum: t - 0.04, t + 0.04, "Parabolic" IF local_max_t ≠ undefined AND abs(local_max_t - t) < 0.01: # Check for voicing at this time selectObject: pitch_obj f0 = Get value at time: local_max_t, "Hertz", "Linear" IF f0 ≠ undefined: # Voiced = syllable nucleus IF local_max_t - last_onset ≥ min_separation: ADD ONSET: local_max_t, local_max_int last_onset = local_max_t t = t + 0.02 # 20ms step # === PARAMETER EFFECTS === # min_separation: Higher = fewer onsets, more separation # prominence_threshold: Higher = only strong onsets # silence_threshold: Higher = more detected as silence # adaptive_window: Automatically adjusts to interval length # === PERFORMANCE CHARACTERISTICS === # Method 1: Fastest, good for clean speech # Method 2: Slower (spectrogram), more robust # Method 3: Medium speed, syllable-focused

Silence Detection & Segmentation

🔇 Smart Silence/Speech Segmentation

Purpose: Separate speech from silence/pauses

Method: Intensity threshold with duration constraints

Parameters: Silence threshold (dB), min silent/sounding durations

Output: TextGrid with "silent" and "sounding" intervals

Importance: Prevents false onsets in silent regions

Tempo & Meter Analysis

Automatic Tempo Detection

🎼 Statistical Tempo Estimation

Input: Inter-onset intervals (IOIs) from detected onsets

Method: Histogram analysis with peak detection

Output: Tempo (BPM), confidence score, dominant IOI

Pulse mapping: Maps IOI to musical note value (whole to 16th)

Validation: Checks for reasonable tempo range (30-300 BPM)

Tempo Detection Algorithm

# TEMPO DETECTION ALGORITHM # 1. Calculate inter-onset intervals (IOIs) FOR i FROM 1 TO onset_count - 1: IOI[i] = onset_time[i + 1] - onset_time[i] # 2. Determine histogram range ioi_min = min(IOI[1..n_iois]) ioi_max = max(IOI[1..n_iois]) hist_min = max(0.05, ioi_min × 0.8) # Minimum 50ms hist_max = min(2.0, ioi_max × 1.2) # Maximum 2 seconds n_bins = 50 bin_width = (hist_max - hist_min) / n_bins # 3. Build histogram FOR b FROM 1 TO n_bins: hist_count[b] = 0 hist_center[b] = hist_min + (b - 0.5) × bin_width FOR i FROM 1 TO n_iois: IF IOI[i] ≥ hist_min AND IOI[i] < hist_max: bin_idx = floor((IOI[i] - hist_min) / bin_width) + 1 IF bin_idx ≥ 1 AND bin_idx ≤ n_bins: hist_count[bin_idx] = hist_count[bin_idx] + 1 # 4. Find peak region peak_bin = 1 peak_count = hist_count[1] FOR b FROM 2 TO n_bins: IF hist_count[b] > peak_count: peak_count = hist_count[b] peak_bin = b # 5. Weighted average around peak (3-bin window) weight_sum = 0 weighted_ioi = 0 FOR b FROM max(1, peak_bin - 2) TO min(n_bins, peak_bin + 2): weight_sum = weight_sum + hist_count[b] weighted_ioi = weighted_ioi + hist_count[b] × hist_center[b] IF weight_sum > 0: dominant_ioi = weighted_ioi / weight_sum ELSE: dominant_ioi = hist_center[peak_bin] # 6. Confidence calculation total_count = 0 FOR b FROM 1 TO n_bins: total_count = total_count + hist_count[b] peak_region_count = 0 FOR b FROM max(1, peak_bin - 2) TO min(n_bins, peak_bin + 2): peak_region_count = peak_region_count + hist_count[b] confidence = peak_region_count / (total_count + 0.001) # 7. Map to musical pulse based on pulse_unit setting CASE pulse_unit OF: 1: # Whole note quarter_dur_est = dominant_ioi / 4 pulse_name$ = "whole note" 2: # Half note quarter_dur_est = dominant_ioi / 2 pulse_name$ = "half note" 3: # Quarter note quarter_dur_est = dominant_ioi pulse_name$ = "quarter note" 4: # Eighth note quarter_dur_est = dominant_ioi × 2 pulse_name$ = "eighth note" 5: # 16th note quarter_dur_est = dominant_ioi × 4 pulse_name$ = "16th note" ENDCASE # 8. Calculate tempo raw_tempo = 60.0 / quarter_dur_est # 9. Validate and adjust tempo range # Speech typically 60-180 BPM; music 40-240 BPM IF raw_tempo < 40: # Too slow, maybe pulse is half note raw_tempo = raw_tempo × 2 ELSIF raw_tempo > 240: # Too fast, maybe pulse is quarter note raw_tempo = raw_tempo / 2 tempo = round(raw_tempo) tempo = max(30, min(300, tempo)) # Clamp to reasonable range # 10. Output statistics # dominant_ioi: Most common interval between onsets (seconds) # tempo: Estimated beats per minute # confidence: 0-1, higher = more consistent rhythm # pulse_name$: Musical note value that maps to dominant_ioi # === INTERPRETATION EXAMPLES === # Dominant IOI = 0.5s, pulse_unit = quarter note: # quarter_dur = 0.5s, tempo = 60/0.5 = 120 BPM # Dominant IOI = 0.25s, pulse_unit = eighth note: # quarter_dur = 0.25×2 = 0.5s, tempo = 60/0.5 = 120 BPM # Dominant IOI = 0.333s, pulse_unit = quarter note: # quarter_dur = 0.333s, tempo = 60/0.333 ≈ 180 BPM

Meter Estimation

🎵 Time Signature Detection

Principle: Analyze accent patterns for metrical structure

Method: Intensity-based accent detection and interval analysis

Output: Beats per measure, beat type, compound meter flag

Common results: 2/4, 3/4, 4/4, 6/8 based on speech patterns

Fallback: Defaults to 4/4 if pattern unclear

Meter Detection Algorithm

# METER ESTIMATION ALGORITHM # 1. Find accents (intensity peaks above mean) int_sum = 0 FOR i FROM 1 TO onset_count: int_sum = int_sum + onset_intensity[i] int_mean = int_sum / onset_count # 2. Detect accent positions accent_count = 0 last_accent = 0 FOR i FROM 1 TO onset_count: IF onset_intensity[i] > int_mean × 1.1: # 10% above mean IF last_accent > 0: accent_count = accent_count + 1 accent_interval[accent_count] = i - last_accent ENDIF last_accent = i # 3. Analyze accent intervals # Count occurrences of common metrical patterns count_2 = 0 # 2-beat patterns (2/4, 2/2) count_3 = 0 # 3-beat patterns (3/4) count_4 = 0 # 4-beat patterns (4/4) count_6 = 0 # 6-beat patterns (6/8 compound) FOR i FROM 1 TO accent_count: CASE accent_interval[i] OF: 2: count_2 = count_2 + 1 # Accents every 2 onsets 3: count_3 = count_3 + 1 # Accents every 3 onsets 4: count_4 = count_4 + 1 # Accents every 4 onsets 6: count_6 = count_6 + 1 # Accents every 6 onsets (compound) ENDCASE # 4. Determine meter based on strongest pattern IF accent_count ≥ 3: # Need enough data IF count_6 > count_4 AND count_6 > count_3 AND count_6 > count_2: beats = 6 beat_type = 8 compound = 1 # Compound meter (6/8) ELSIF count_3 > count_4 AND count_3 > count_2: beats = 3 beat_type = 4 compound = 0 # Simple triple (3/4) ELSIF count_2 > count_4: beats = 2 beat_type = 4 compound = 0 # Simple duple (2/4) ELSE: beats = 4 beat_type = 4 compound = 0 # Default to 4/4 ELSE: # Not enough accents, use defaults beats = 4 beat_type = 4 compound = 0 # 5. Calculate measure structure IF compound: # Compound meter: beat unit = dotted quarter # Example: 6/8 = 2 beats of dotted quarter measure_dur = (beats / 3) × (beat_dur × 1.5) divs_per_measure = beats × (divisions / 2) ELSE: # Simple meter: beat unit = quarter measure_dur = beat_dur × beats divs_per_measure = beats × divisions # === METER INTERPRETATION EXAMPLES === # Accent pattern: X . . X . . (every 3 onsets) # → count_3 highest → 3/4 time # Accent pattern: X . X . (every 2 onsets) # → count_2 highest → 2/4 time # Accent pattern: X . . . X . . . (every 4 onsets) # → count_4 highest → 4/4 time # Accent pattern: X . . X . . (but accents on 1st and 4th of 6) # → count_6 highest → 6/8 time (compound duple) # === COMPOUND METER DETECTION === # Compound meter (6/8, 9/8, 12/8) has accents grouping in 3s # Accent interval of 6 means accents every 6 onsets # This suggests 2 groups of 3 (6/8) or 3 groups of 3 (9/8) # === LIMITATIONS === # Requires clear accent patterns # May not match linguistic meter (poetic meter) # Works best with rhythmic, accented speech # Music with syncopation may confuse detection

MusicXML Generation

MusicXML Structure

📄 Standard Music Notation Format

Format: MusicXML 3.1 Partwise DTD

Structure: Score-partwise with measures, attributes, notes

Compatibility: MuseScore, Finale, Sibelius, Dorico, etc.

Elements: Work info, identification, part list, measures

Note data: Pitch, duration, type, dots, dynamics

MusicXML Implementation

# MUSICXML GENERATION IMPLEMENTATION # ===== HEADER ===== xml$ = "<?xml version=""1.0"" encoding=""UTF-8""?>" + newline$ xml$ = xml$ + "" + newline$ xml$ = xml$ + "" + newline$ # Optional comments (if include_comments = 1) IF include_comments: xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ IF auto_tempo: xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ # ===== WORK INFORMATION ===== xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + sound_name$ + " - Rhythm" + newline$ xml$ = xml$ + " " + newline$ # ===== IDENTIFICATION ===== xml$ = xml$ + " " + newline$ xml$ = xml$ + " Speech Analysis" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " Praat AudioTools v2.7" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ # ===== PART LIST ===== xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " Speech Rhythm" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ # ===== PART & MEASURES ===== xml$ = xml$ + " " + newline$ # Measure 1 with attributes xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + string$(divisions) + "" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " G" + newline$ xml$ = xml$ + " 2" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ # Tempo indication xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " quarter" + newline$ xml$ = xml$ + " " + string$(tempo) + "" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ # ===== NOTE/REST GENERATION ===== # For each item in note list: FOR n FROM 1 TO note_list_count: IF note_type[n] = 0: type$ = "rest" ELSE: type$ = "note" # Emit XML element xml$ = xml$ + " " + newline$ IF type$ = "rest": xml$ = xml$ + " " + newline$ ELSE: xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + pitch_step$ + "" + newline$ xml$ = xml$ + " " + string$(pitch_octave) + "" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + string$(note_duration[n]) + "" + newline$ xml$ = xml$ + " " + note_value_type$[n] + "" + newline$ IF note_dotted[n]: xml$ = xml$ + " " + newline$ # Dynamics (if note and extract_dynamics) IF type$ = "note" AND extract_dynamics: # Map normalized dynamics to musical markings dyn_norm = note_dynamics[n] IF dyn_norm ≥ 0.9: marking$ = "ff" ELSIF dyn_norm ≥ 0.75: marking$ = "f" ELSIF dyn_norm ≥ 0.55: marking$ = "mf" ELSIF dyn_norm ≥ 0.40: marking$ = "mp" ELSIF dyn_norm ≥ 0.25: marking$ = "p" ELSE: marking$ = "pp" IF marking$ ≠ "": xml$ = xml$ + " " + newline$ xml$ = xml$ + " <" + marking$ + "/>" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ # ===== FOOTER ===== xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + "" # ===== MUSICXML ELEMENTS EXPLANATION ===== # : Number of divisions per quarter note #

Dynamics Mapping

🎚️ Intensity to Musical Dynamics

Input: Normalized intensity values (0-1)

Mapping: 6-level dynamic scale: pp, p, mp, mf, f, ff

Smoothing: Dynamics_smoothing parameter reduces fluctuations

Output: Standard musical dynamics markings

Relative scaling: Based on file's intensity range, not absolute dB

Rhythm TextGrid Creation

📊 Multi-Tier Rhythm Visualization

Tier 1: Note/Rest values with durations

Tier 2: Dynamics markings (pp-ff)

Tier 3: Beat positions within measure

Tier 4: Measure numbers

Purpose: Visual verification and analysis of rhythm mapping

Parameters & Specifications

Tempo & Quantization

ParameterTypeDefaultRangeDescription
Auto_detect_tempoboolean1 (yes)0/1Automatically estimate tempo from speech rhythm
Manual_tempo_(BPM)positive12030 - 300Manual tempo setting (if auto_detect off)
Pulse_unitoptionQuarter note5 optionsMusical note value that maps to dominant IOI
Divisions_per_quarterpositive81 - 32Resolution (higher = finer quantization)
Allow_dotted_notesboolean1 (yes)0/1Allow dotted note values for better rhythm representation

Time Signature

ParameterTypeDefaultRangeDescription
Auto_detect_meterboolean1 (yes)0/1Automatically estimate time signature from accent patterns
Time_signature_beatspositive41 - 12Beats per measure (if auto_detect off)
Time_signature_typepositive41 - 32Beat unit (if auto_detect off)
Compound_meterboolean0 (no)0/1Use compound meter (6/8, 9/8, 12/8) (if auto_detect off)

Onset Detection

ParameterTypeDefaultRangeDescription
Detection_methodoptionIntensity only3 optionsAlgorithm for detecting syllable/musical onsets
Min_onset_separation_(s)positive0.080.02 - 0.30Minimum time between detected onsets
Prominence_threshold_(dB)positive2.51.0 - 10.0Minimum intensity prominence for onset detection

Silence Detection

ParameterTypeDefaultRangeDescription
Min_silent_duration_(s)positive0.100.02 - 0.50Minimum duration to classify as silence
Min_sounding_duration_(s)positive0.080.02 - 0.30Minimum duration to classify as speech
Silence_threshold_(dB)integer-25-60 - 0Intensity threshold for silence detection

Dynamics

ParameterTypeDefaultRangeDescription
Extract_dynamicsboolean1 (yes)0/1Extract and include dynamic markings in notation
Dynamics_smoothing_(s)positive0.10.02 - 0.50Time constant for intensity smoothing for dynamics

Output

ParameterTypeDefaultDescription
Output_pitchoptionC4 (middle C)Pitch for notes in notation (visual only)
Include_commentsboolean1 (yes)Include analysis comments in MusicXML file
Create_TextGridboolean1 (yes)Create rhythm analysis TextGrid for visualization

Performance Characteristics

CharacteristicTypical ValueDependenceNotes
Processing time5-30 secondsFile length, detection methodMethod 2 (multi-feature) slowest
Onset detection rate3-10 HzSpeech rate, prominence thresholdTypical syllable rate
Tempo confidence0.3-0.9Rhythmic regularity>0.7 = good rhythm detection
Quantization error10-40 msTempo accuracy, divisionsLower with higher divisions
MusicXML file size5-50 KBNumber of notes/restsCompresses well for sharing

Note Value Mapping

Divisions per quarterShortest noteNote values availableTypical use
416th noteWhole, half, quarter, eighth, 16thSimple rhythms, educational
832nd noteWhole, half, quarter, eighth, 16th, 32ndStandard speech rhythm
1232nd tripletAll standard + triplets possibleComplex rhythms, music
1664th noteVery fine resolutionPrecise speech timing

Detection Method Comparison

MethodSpeedAccuracyBest forParameters to adjust
Intensity onlyFastestGood for clear speechPoetry, recited text, clean recordingsProminence threshold, min separation
Multi-featureSlowestMost robustMusic, singing, noisy environmentsSame as intensity + spectral validation
Syllable nucleiMediumBest for syllable timingLanguage analysis, clear voiced speechPitch parameters, voicing threshold

Applications

Musical Composition

Use case: Generate rhythmic material from spoken word

Workflow:

Speech & Language Analysis

Use case: Study speech rhythm and timing patterns

Workflow:

Educational Tools

Use case: Teach rhythm and timing through speech

Workflow:

Practical Workflow Examples

🎤 Poetry to Rhythm Notation

Goal: Convert recited poetry to musical rhythm

Settings:

  • Detection_method: Intensity only
  • Auto_detect_tempo: Yes
  • Pulse_unit: Quarter note
  • Divisions_per_quarter: 8
  • Allow_dotted_notes: Yes
  • Auto_detect_meter: Yes
  • Min_onset_separation: 0.10s
  • Extract_dynamics: Yes
  • Create_TextGrid: Yes

Result: Musical notation capturing poetic rhythm and dynamics

🎵 Rap/Rhythmic Speech Analysis

Goal: Analyze rap vocals for rhythmic patterns

Settings:

  • Detection_method: Multi-feature
  • Auto_detect_tempo: Yes
  • Pulse_unit: Eighth note (for faster rhythms)
  • Divisions_per_quarter: 12 (for triplet feel)
  • Allow_dotted_notes: Yes
  • Auto_detect_meter: Yes
  • Min_onset_separation: 0.06s (faster detection)
  • Prominence_threshold: 3.0dB (stronger onsets)
  • Extract_dynamics: Yes

Result: Detailed rhythm notation with syncopation and accents

🔬 Language Rhythm Comparison

Goal: Compare rhythmic patterns across languages

Settings:

  • Detection_method: Syllable nuclei
  • Auto_detect_tempo: Yes
  • Pulse_unit: Quarter note
  • Divisions_per_quarter: 8
  • Allow_dotted_notes: No (simpler analysis)
  • Auto_detect_meter: No, set to 4/4
  • Min_onset_separation: 0.08s
  • Extract_dynamics: No (focus on timing only)
  • Create_TextGrid: Yes (for analysis)

Result: Comparable rhythm notation for cross-linguistic study

Advanced Techniques

Post-processing in notation software:
  • Cleaning up: Remove small rests, simplify complex rhythms
  • Adding harmony: Set chord progressions under speech rhythm
  • Orchestration: Assign different rhythms to different instruments
  • Development: Use as motif for musical development
  • Combining: Layer multiple speech rhythms polyrhythmically

MusicXML import preserves all rhythmic information for further musical development

Parameter optimization for different speech types:
  • Slow, clear speech: Min_separation = 0.12s, prominence_threshold = 2.0dB
  • Fast, conversational: Min_separation = 0.06s, prominence_threshold = 3.0dB
  • Musical/sung: Use multi-feature detection, divisions = 12
  • Noisy recordings: Increase silence_threshold, use multi-feature
  • Poetic/recited: Enable dotted notes, auto meter detection

Always check TextGrid output to verify onset detection quality

Troubleshooting Common Issues

Problem: Too many/few onsets detected
Cause: Prominence_threshold too low/high, or min_separation wrong
Solution: Adjust prominence_threshold (2-4dB typical), check TextGrid visualization
Problem: Tempo detection unrealistic (too fast/slow)
Cause: Pulse_unit setting wrong for speech rhythm
Solution: Try different pulse_unit (quarter or eighth note usually best)
Problem: Quantization error high (>50ms average)
Cause: Tempo doesn't match speech rhythm, or divisions too low
Solution: Adjust tempo manually, increase divisions_per_quarter
Problem: MusicXML won't open in notation software
Cause: XML structure issue or software compatibility
Solution: Ensure copying entire XML output, save as .xml or .musicxml, try MuseScore
Problem: Dynamics not showing in notation
Cause: Extract_dynamics disabled, or intensity range too small
Solution: Enable extract_dynamics, check recording has dynamic variation

Integration with Other Tools

Complete speech-to-music workflow:
  1. Pre-processing: Use FIR filter bank to clean speech recording
  2. Rhythm extraction: This script to get MusicXML rhythm
  3. Pitch extraction: Use Praat's pitch analysis for melodic contour
  4. Notation software: Import MusicXML, add harmony, orchestrate
  5. Performance: Use MIDI export or live performance
  6. Recording: Record instrumental/vocal performance based on notation

This creates a complete pipeline from speech to performed music