Speech to MusicXML Rhythm — User Guide

Convert speech rhythm and dynamics into MusicXML notation with automatic tempo detection, meter estimation, and quantized rhythmic notation for music composition and analysis.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 2.7 (Notation Edition, 2025) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools

Contents:

What this does Quick start Rhythm Extraction Theory Onset Detection Methods Tempo & Meter Analysis MusicXML Generation Parameters Applications

What this does

This script converts speech rhythm into MusicXML notation by detecting syllable onsets, estimating tempo and meter, quantizing to musical note values, and generating standard MusicXML files that can be opened in notation software like MuseScore, Finale, or Sibelius. The system performs sophisticated acoustic analysis to extract rhythmic patterns from speech, converts them to quantized musical rhythms, preserves dynamic information as musical dynamics, and creates both MusicXML notation and detailed rhythm TextGrids for analysis. This enables composers, researchers, and musicians to transcribe speech rhythms into musical notation automatically.

Key Features:

3 Onset Detection Methods — Intensity-based, multi-feature, syllable nuclei
Automatic Tempo Detection — Statistical analysis of inter-onset intervals
Meter Estimation — Beat and measure structure from accent patterns
Intelligent Quantization — Converts continuous timing to note values with dotted notes
Dynamics Extraction — Maps intensity contours to musical dynamics (pp-ff)
MusicXML Export — Standard format compatible with all notation software
Rhythm TextGrid — Multi-tier visualization of rhythm analysis

Why Convert Speech to Musical Rhythm? Speech contains rich rhythmic patterns that can inspire musical composition and provide insights into prosody and timing. This conversion process enables: (1) Musical composition: Use speech rhythms as compositional material. (2) Prosody analysis: Study speech timing through musical notation. (3) Educational tools: Visualize speech rhythm for language learning. (4) Creative applications: Generate rhythmic patterns from spoken word. The system bridges speech analysis and music notation by: (1) Detecting acoustic events: Syllable onsets, intensity peaks. (2) Statistical analysis: Finding tempo through histogram analysis. (3) Musical mapping: Converting continuous time to discrete note values. (4) Preserving expressivity: Maintaining dynamics and timing variations. The result is musically meaningful notation that captures speech rhythm essence.

Technical Implementation: (1) Silence detection: Creates TextGrid with silent/sounding intervals using intensity threshold. (2) Onset detection: 3 methods with peak refinement and prominence filtering. (3) Tempo estimation: Histogram analysis of inter-onset intervals with confidence scoring. (4) Meter estimation: Analyzes accent patterns to determine time signature. (5) Quantization: Rounds onsets to nearest musical grid with note value optimization. (6) Dynamics mapping: Converts intensity to musical dynamics markings. (7) MusicXML generation: Builds valid MusicXML structure with measures, notes, rests, and metadata. (8) TextGrid creation: Multi-tier visualization of rhythm, dynamics, beats, and measures. The complete pipeline runs within Praat using built-in analysis capabilities.

Quick start

In Praat, select exactly one Sound object (mono recommended).
Run script… → speech_to_musicxml_rhythm_v2.7.praat.
Enable Auto_detect_tempo (recommended for most cases).
Set Pulse_unit: "Quarter note" for typical speech rhythm.
Set Divisions_per_quarter: 8 for 32nd note resolution.
Enable Allow_dotted_notes for natural rhythm variations.
Enable Auto_detect_meter to automatically find time signature.
Choose Detection_method: "Intensity only" for speech, "Multi-feature" for music.
Set Min_onset_separation: 0.08s (80ms) for syllable-level detection.
Adjust Prominence_threshold: 2.5dB for clear onsets.
Set silence parameters: Min_silent_duration 0.10s, Silence_threshold -25dB.
Enable Extract_dynamics to preserve loudness variations.
Choose Output_pitch: "C4 (middle C)" for standard notation.
Enable Create_TextGrid for rhythm visualization.
Click OK — analysis, detection, quantization, and XML generation will run.
Copy MusicXML output from Info window and save as .xml file.
Open .xml file in MuseScore, Finale, Sibelius, or other notation software.

Quick tip: Start with clean speech recordings (1-10 seconds) for best results. For poetry or recited text, use "Intensity only" detection with pulse_unit = "Quarter note". For musical or sung phrases, use "Multi-feature" detection. The auto tempo detection works best when speech has clear rhythmic patterns. Processing stages: (1) Silence/speech segmentation (2-5 seconds), (2) Onset detection (5-15 seconds), (3) Tempo/meter estimation, (4) Quantization, (5) MusicXML generation. For visual feedback, always enable Create_TextGrid to see detected onsets and rhythm mapping. The MusicXML output appears in Praat's Info window — copy everything from <?xml to and save as .xml file. Recommended software: MuseScore (free) for viewing and editing the notation.

Important: CLEAN INPUT REQUIRED: Background noise affects onset detection. MINIMUM LENGTH: Need at least 2 seconds of speech with clear rhythm. ONSET SEPARATION: Too low may detect micro-variations, too high may miss syllables. TEMPO DETECTION: Works best with regular speech rhythm; irregular speech may produce uncertain tempo. QUANTIZATION ERROR: Speech timing doesn't exactly match musical grid — some error inevitable. METER DETECTION: Based on accent patterns; may not match linguistic meter. MUSICXML COMPATIBILITY: Generated XML follows MusicXML 3.1 standard. PITCH SETTING: Only affects visual notation, not audio. SILENCE THRESHOLD: Adjust based on recording noise floor. DYNAMICS MAPPING: Relative to within-file intensity range, not absolute dB.

Rhythm Extraction Theory

Speech Rhythm Fundamentals

⏱️ From Continuous Time to Discrete Notation

Speech timing: Continuous, variable, context-dependent

Musical rhythm: Discrete, quantized, grid-based

Mapping challenge: Convert continuous onsets to note values

Key concepts: Onsets, inter-onset intervals (IOIs), tempo, meter

Musical elements: Notes, rests, dotted values, dynamics

Rhythmic Analysis Mathematics

# SPEECH RHYTHM ANALYSIS MATHEMATICS # 1. ONSET DETECTION (time domain) # Speech signal: s(t) # Intensity envelope: I(t) = 10·log₁₀(∫|s(τ)|²dτ) # Onset defined as local maximum in I(t) with sufficient prominence # Peak refinement (parabolic interpolation): # Given three points: (t₁, I₁), (t₂, I₂), (t₃, I₃) with t₂ as peak α = I₁, β = I₂, γ = I₃ p = 0.5 × (α - γ) / (α - 2β + γ) # Peak offset from center refined_t = t₂ + p × Δt # Refined time refined_I = β - 0.25 × (α - γ) × p # Refined intensity # 2. PROMINENCE CALCULATION # Local median over window: M(t) = median{I(t-Δ) to I(t+Δ)} # Prominence: P(t) = I(t) - M(t) # Threshold: P(t) ≥ prominence_threshold (dB) # 3. INTER-ONSET INTERVALS (IOIs) # For N onsets at times t₁, t₂, ..., t_N IOI_i = t_{i+1} - t_i for i = 1 to N-1 # 4. TEMPO ESTIMATION (histogram method) # Create histogram of IOIs with bins b₁, b₂, ..., b_K # Find peak bin: b_peak = argmax count(b) # Weighted average around peak: IOI_dominant = Σ_{b∈peak_region} count(b)·center(b) / Σ count(b) # Map to musical pulse based on pulse_unit setting: IF pulse_unit = 1 (whole note): quarter_dur = IOI_dominant / 4 IF pulse_unit = 2 (half note): quarter_dur = IOI_dominant / 2 IF pulse_unit = 3 (quarter note): quarter_dur = IOI_dominant IF pulse_unit = 4 (eighth note): quarter_dur = IOI_dominant × 2 IF pulse_unit = 5 (16th note): quarter_dur = IOI_dominant × 4 # Tempo calculation: tempo = 60 / quarter_dur # BPM # 5. CONFIDENCE SCORING peak_region = bins within ±2 of peak total_count = Σ count(b) for all bins peak_count = Σ count(b) for b ∈ peak_region confidence = peak_count / total_count # 6. QUANTIZATION division_dur = quarter_dur / divisions_per_quarter raw_divs = onset_time / division_dur quantized_divs = round(raw_divs) error = |raw_divs - quantized_divs| × division_dur # 7. NOTE VALUE DURATIONS (in divisions) whole_note = divisions × 4 half_note = divisions × 2 quarter_note = divisions eighth_note = max(1, divisions / 2) sixteenth_note = max(1, divisions / 4) thirtysecond_note = max(1, divisions / 8) # Dotted values: dotted_half = half_note + quarter_note dotted_quarter = quarter_note + eighth_note dotted_eighth = eighth_note + sixteenth_note

Musical Grid System

Quantization & Note Value Selection

Converting continuous time to musical notation:

# MUSICAL QUANTIZATION ALGORITHM # INPUT: Target duration in divisions (T_divs) # OUTPUT: Best note value(s) to represent duration procedure getBestNoteValue: T_divs # Available note values (in divisions): whole = divisions × 4 # 𝅝 half = divisions × 2 # 𝅗𝅥 quarter = divisions # 𝅘𝅥 eighth = max(1, divisions / 2) # 𝅘𝅥𝅮 sixteenth = max(1, divisions / 4) # 𝅘𝅥𝅯 thirtysecond = max(1, divisions / 8) # 𝅘𝅥𝅰 # Dotted values (if allowed_dotted = 1): dotted_half = half + quarter dotted_quarter = quarter + eighth dotted_eighth = eighth + sixteenth # Greedy algorithm: Use largest possible note value IF T_divs ≥ whole: duration = whole type$ = "whole" dotted = 0 ELSIF T_divs ≥ dotted_half AND allow_dotted: duration = dotted_half type$ = "half" dotted = 1 ELSIF T_divs ≥ half: duration = half type$ = "half" dotted = 0 ELSIF T_divs ≥ dotted_quarter AND allow_dotted: duration = dotted_quarter type$ = "quarter" dotted = 1 ELSIF T_divs ≥ quarter: duration = quarter type$ = "quarter" dotted = 0 ELSIF T_divs ≥ dotted_eighth AND allow_dotted: duration = dotted_eighth type$ = "eighth" dotted = 1 ELSIF T_divs ≥ eighth: duration = eighth type$ = "eighth" dotted = 0 ELSIF T_divs ≥ sixteenth: duration = sixteenth type$ = "16th" dotted = 0 ELSIF T_divs ≥ thirtysecond: duration = thirtysecond type$ = "32nd" dotted = 0 ELSE: duration = 1 # Minimum: 128th note equivalent type$ = "32nd" dotted = 0 ENDIF # If selected duration > target, try next smaller value IF duration > T_divs: CASE type$ OF "whole": duration = half; type$ = "half" "half": IF dotted = 1 duration = half; dotted = 0 ELSE duration = quarter; type$ = "quarter" ENDIF "quarter": IF dotted = 1 duration = quarter; dotted = 0 ELSE duration = eighth; type$ = "eighth" ENDIF "eighth": IF dotted = 1 duration = eighth; dotted = 0 ELSE duration = sixteenth; type$ = "16th" ENDIF "16th": duration = thirtysecond; type$ = "32nd" "32nd": duration = 1; type$ = "32nd" ENDCASE ENDIF endproc # QUANTIZATION EXAMPLE: # divisions = 8, T_divs = 10 # Available: whole=32, half=16, quarter=8, eighth=4, sixteenth=2, 32nd=1 # Dotted: dotted_half=24, dotted_quarter=12, dotted_eighth=6 # T_divs=10 → would select quarter (8) but that's < 10 # Next: dotted_eighth (6) too small # Try: eighth (4) too small # Best: Use eighth (4) + sixteenth (2) + sixteenth (2) + 32nd (1) + 32nd (1) # Or with dotted notes: quarter (8) + 32nd (1) + 32nd (1) # Algorithm picks most efficient representation # MEASURE FILLING: # Each measure has capacity: divs_per_measure = beats × divisions # Notes are placed sequentially, measures filled automatically # When measure capacity reached, new measure starts # Rests added to complete incomplete measures

Onset Detection Methods

Three Detection Algorithms

🔊 Method 1: Intensity Only

Principle: Detect peaks in intensity envelope

Processing: Parabolic interpolation for precise timing

Best for: Clear speech, poetry, recited text

Parameters: Prominence threshold, min separation

Advantages: Fast, reliable for most speech

📈 Method 2: Multi-Feature (Intensity + Spectral)

Principle: Combine intensity peaks with spectral flux

Processing: Requires both intensity peak AND spectral increase

Best for: Music, singing, noisy environments

Parameters: Same as Method 1 plus spectral validation

Advantages: More robust, fewer false positives

🎵 Method 3: Syllable Nuclei

Principle: Detect voiced syllable centers

Processing: Pitch-synchronous detection with F0 validation

Best for: Clear voiced speech, language analysis

Parameters: Pitch floor/ceiling, voicing threshold

Advantages: Captures syllable rhythm accurately

Onset Detection Implementation

# ONSET DETECTION IMPLEMENTATION DETAILS # === METHOD 1: INTENSITY ONLY === # 1. Get intensity object (100Hz, 0.01s steps) To Intensity: 100, 0.01, "yes" # 2. Detect silent/sounding intervals To TextGrid (silences): 100, 0, silence_threshold, min_silent_dur, min_sounding_dur # 3. Scan each sounding interval FOR each sounding interval: t = interval_start + adaptive_window adaptive_window = max(0.02, min(0.05, interval_dur/10)) WHILE t < interval_end - adaptive_window: # Get intensity at three points for parabolic fit int_val = Get value at time: t, "Cubic" int_m1 = Get value at time: t - adaptive_window/2, "Cubic" int_p1 = Get value at time: t + adaptive_window/2, "Cubic" # Check for local maximum IF int_val > int_m1 AND int_val > int_p1: # Parabolic refinement α = int_m1, β = int_val, γ = int_p1 IF (α - 2β + γ) ≠ 0: p = 0.5 × (α - γ) / (α - 2β + γ) refined_t = t + p × (adaptive_window/2) refined_int = β - 0.25 × (α - γ) × p ELSE: refined_t = t refined_int = int_val # Calculate prominence (local median) window_start = max(t - 0.15, interval_start) window_end = min(t + 0.15, interval_end) sample_vals[1..15] = intensity at 15 evenly spaced points # Sort for median FOR i FROM 1 TO 14: FOR j FROM i+1 TO 15: IF sample_vals[i] > sample_vals[j]: SWAP sample_vals[i], sample_vals[j] local_median = sample_vals[8] # 8th of 15 = median prominence = refined_int - local_median # Apply threshold and separation constraints IF prominence ≥ prominence_threshold AND refined_t - last_onset ≥ min_separation: ADD ONSET: refined_t, refined_int last_onset = refined_t t = t + 0.005 # 5ms step # === METHOD 2: MULTI-FEATURE === # Same as Method 1 PLUS spectral flux validation # Additional step after intensity peak detection: selectObject: spectrogram slice_before = Get power at: refined_t - 0.02, 1000 slice_after = Get power at: refined_t + 0.01, 1000 IF slice_before ≠ undefined AND slice_after ≠ undefined: flux = slice_after - slice_before spectral_onset = (flux > 0) # True if spectral increase # Both intensity peak AND spectral increase required IF prominence ≥ prominence_threshold AND refined_t - last_onset ≥ min_separation AND spectral_onset: ADD ONSET: refined_t, refined_int # === METHOD 3: SYLLABLE NUCLEI === # 1. Get pitch object for voiced/unvoiced detection To Pitch: 0, 75, 600 # 2. In each sounding interval: selectObject: intensity t = interval_start + 0.03 WHILE t < interval_end - 0.03: # Find local maximum in intensity local_max_t = Get time of maximum: t - 0.04, t + 0.04, "Parabolic" local_max_int = Get maximum: t - 0.04, t + 0.04, "Parabolic" IF local_max_t ≠ undefined AND abs(local_max_t - t) < 0.01: # Check for voicing at this time selectObject: pitch_obj f0 = Get value at time: local_max_t, "Hertz", "Linear" IF f0 ≠ undefined: # Voiced = syllable nucleus IF local_max_t - last_onset ≥ min_separation: ADD ONSET: local_max_t, local_max_int last_onset = local_max_t t = t + 0.02 # 20ms step # === PARAMETER EFFECTS === # min_separation: Higher = fewer onsets, more separation # prominence_threshold: Higher = only strong onsets # silence_threshold: Higher = more detected as silence # adaptive_window: Automatically adjusts to interval length # === PERFORMANCE CHARACTERISTICS === # Method 1: Fastest, good for clean speech # Method 2: Slower (spectrogram), more robust # Method 3: Medium speed, syllable-focused

Silence Detection & Segmentation

🔇 Smart Silence/Speech Segmentation

Purpose: Separate speech from silence/pauses

Method: Intensity threshold with duration constraints

Parameters: Silence threshold (dB), min silent/sounding durations

Output: TextGrid with "silent" and "sounding" intervals

Importance: Prevents false onsets in silent regions

Tempo & Meter Analysis

Automatic Tempo Detection

🎼 Statistical Tempo Estimation

Input: Inter-onset intervals (IOIs) from detected onsets

Method: Histogram analysis with peak detection

Output: Tempo (BPM), confidence score, dominant IOI

Pulse mapping: Maps IOI to musical note value (whole to 16th)

Validation: Checks for reasonable tempo range (30-300 BPM)

Tempo Detection Algorithm

# TEMPO DETECTION ALGORITHM # 1. Calculate inter-onset intervals (IOIs) FOR i FROM 1 TO onset_count - 1: IOI[i] = onset_time[i + 1] - onset_time[i] # 2. Determine histogram range ioi_min = min(IOI[1..n_iois]) ioi_max = max(IOI[1..n_iois]) hist_min = max(0.05, ioi_min × 0.8) # Minimum 50ms hist_max = min(2.0, ioi_max × 1.2) # Maximum 2 seconds n_bins = 50 bin_width = (hist_max - hist_min) / n_bins # 3. Build histogram FOR b FROM 1 TO n_bins: hist_count[b] = 0 hist_center[b] = hist_min + (b - 0.5) × bin_width FOR i FROM 1 TO n_iois: IF IOI[i] ≥ hist_min AND IOI[i] < hist_max: bin_idx = floor((IOI[i] - hist_min) / bin_width) + 1 IF bin_idx ≥ 1 AND bin_idx ≤ n_bins: hist_count[bin_idx] = hist_count[bin_idx] + 1 # 4. Find peak region peak_bin = 1 peak_count = hist_count[1] FOR b FROM 2 TO n_bins: IF hist_count[b] > peak_count: peak_count = hist_count[b] peak_bin = b # 5. Weighted average around peak (3-bin window) weight_sum = 0 weighted_ioi = 0 FOR b FROM max(1, peak_bin - 2) TO min(n_bins, peak_bin + 2): weight_sum = weight_sum + hist_count[b] weighted_ioi = weighted_ioi + hist_count[b] × hist_center[b] IF weight_sum > 0: dominant_ioi = weighted_ioi / weight_sum ELSE: dominant_ioi = hist_center[peak_bin] # 6. Confidence calculation total_count = 0 FOR b FROM 1 TO n_bins: total_count = total_count + hist_count[b] peak_region_count = 0 FOR b FROM max(1, peak_bin - 2) TO min(n_bins, peak_bin + 2): peak_region_count = peak_region_count + hist_count[b] confidence = peak_region_count / (total_count + 0.001) # 7. Map to musical pulse based on pulse_unit setting CASE pulse_unit OF: 1: # Whole note quarter_dur_est = dominant_ioi / 4 pulse_name$ = "whole note" 2: # Half note quarter_dur_est = dominant_ioi / 2 pulse_name$ = "half note" 3: # Quarter note quarter_dur_est = dominant_ioi pulse_name$ = "quarter note" 4: # Eighth note quarter_dur_est = dominant_ioi × 2 pulse_name$ = "eighth note" 5: # 16th note quarter_dur_est = dominant_ioi × 4 pulse_name$ = "16th note" ENDCASE # 8. Calculate tempo raw_tempo = 60.0 / quarter_dur_est # 9. Validate and adjust tempo range # Speech typically 60-180 BPM; music 40-240 BPM IF raw_tempo < 40: # Too slow, maybe pulse is half note raw_tempo = raw_tempo × 2 ELSIF raw_tempo > 240: # Too fast, maybe pulse is quarter note raw_tempo = raw_tempo / 2 tempo = round(raw_tempo) tempo = max(30, min(300, tempo)) # Clamp to reasonable range # 10. Output statistics # dominant_ioi: Most common interval between onsets (seconds) # tempo: Estimated beats per minute # confidence: 0-1, higher = more consistent rhythm # pulse_name$: Musical note value that maps to dominant_ioi # === INTERPRETATION EXAMPLES === # Dominant IOI = 0.5s, pulse_unit = quarter note: # quarter_dur = 0.5s, tempo = 60/0.5 = 120 BPM # Dominant IOI = 0.25s, pulse_unit = eighth note: # quarter_dur = 0.25×2 = 0.5s, tempo = 60/0.5 = 120 BPM # Dominant IOI = 0.333s, pulse_unit = quarter note: # quarter_dur = 0.333s, tempo = 60/0.333 ≈ 180 BPM

Meter Estimation

🎵 Time Signature Detection

Principle: Analyze accent patterns for metrical structure

Method: Intensity-based accent detection and interval analysis

Output: Beats per measure, beat type, compound meter flag

Common results: 2/4, 3/4, 4/4, 6/8 based on speech patterns

Fallback: Defaults to 4/4 if pattern unclear

Meter Detection Algorithm

# METER ESTIMATION ALGORITHM # 1. Find accents (intensity peaks above mean) int_sum = 0 FOR i FROM 1 TO onset_count: int_sum = int_sum + onset_intensity[i] int_mean = int_sum / onset_count # 2. Detect accent positions accent_count = 0 last_accent = 0 FOR i FROM 1 TO onset_count: IF onset_intensity[i] > int_mean × 1.1: # 10% above mean IF last_accent > 0: accent_count = accent_count + 1 accent_interval[accent_count] = i - last_accent ENDIF last_accent = i # 3. Analyze accent intervals # Count occurrences of common metrical patterns count_2 = 0 # 2-beat patterns (2/4, 2/2) count_3 = 0 # 3-beat patterns (3/4) count_4 = 0 # 4-beat patterns (4/4) count_6 = 0 # 6-beat patterns (6/8 compound) FOR i FROM 1 TO accent_count: CASE accent_interval[i] OF: 2: count_2 = count_2 + 1 # Accents every 2 onsets 3: count_3 = count_3 + 1 # Accents every 3 onsets 4: count_4 = count_4 + 1 # Accents every 4 onsets 6: count_6 = count_6 + 1 # Accents every 6 onsets (compound) ENDCASE # 4. Determine meter based on strongest pattern IF accent_count ≥ 3: # Need enough data IF count_6 > count_4 AND count_6 > count_3 AND count_6 > count_2: beats = 6 beat_type = 8 compound = 1 # Compound meter (6/8) ELSIF count_3 > count_4 AND count_3 > count_2: beats = 3 beat_type = 4 compound = 0 # Simple triple (3/4) ELSIF count_2 > count_4: beats = 2 beat_type = 4 compound = 0 # Simple duple (2/4) ELSE: beats = 4 beat_type = 4 compound = 0 # Default to 4/4 ELSE: # Not enough accents, use defaults beats = 4 beat_type = 4 compound = 0 # 5. Calculate measure structure IF compound: # Compound meter: beat unit = dotted quarter # Example: 6/8 = 2 beats of dotted quarter measure_dur = (beats / 3) × (beat_dur × 1.5) divs_per_measure = beats × (divisions / 2) ELSE: # Simple meter: beat unit = quarter measure_dur = beat_dur × beats divs_per_measure = beats × divisions # === METER INTERPRETATION EXAMPLES === # Accent pattern: X . . X . . (every 3 onsets) # → count_3 highest → 3/4 time # Accent pattern: X . X . (every 2 onsets) # → count_2 highest → 2/4 time # Accent pattern: X . . . X . . . (every 4 onsets) # → count_4 highest → 4/4 time # Accent pattern: X . . X . . (but accents on 1st and 4th of 6) # → count_6 highest → 6/8 time (compound duple) # === COMPOUND METER DETECTION === # Compound meter (6/8, 9/8, 12/8) has accents grouping in 3s # Accent interval of 6 means accents every 6 onsets # This suggests 2 groups of 3 (6/8) or 3 groups of 3 (9/8) # === LIMITATIONS === # Requires clear accent patterns # May not match linguistic meter (poetic meter) # Works best with rhythmic, accented speech # Music with syncopation may confuse detection

MusicXML Generation

MusicXML Structure

📄 Standard Music Notation Format

Format: MusicXML 3.1 Partwise DTD

Structure: Score-partwise with measures, attributes, notes

Compatibility: MuseScore, Finale, Sibelius, Dorico, etc.

Elements: Work info, identification, part list, measures

Note data: Pitch, duration, type, dots, dynamics

MusicXML Implementation

# MUSICXML GENERATION IMPLEMENTATION # ===== HEADER ===== xml$ = "<?xml version=""1.0"" encoding=""UTF-8""?>" + newline$ xml$ = xml$ + "" + newline$ xml$ = xml$ + "" + newline$ # Optional comments (if include_comments = 1) IF include_comments: xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ IF auto_tempo: xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ # ===== WORK INFORMATION ===== xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + sound_name$ + " - Rhythm" + newline$ xml$ = xml$ + " " + newline$ # ===== IDENTIFICATION ===== xml$ = xml$ + " " + newline$ xml$ = xml$ + " Speech Analysis" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " Praat AudioTools v2.7" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ # ===== PART LIST ===== xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " Speech Rhythm" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ # ===== PART & MEASURES ===== xml$ = xml$ + " " + newline$ # Measure 1 with attributes xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + string$(divisions) + "" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + string$(time_sig_beats) + "" + newline$ xml$ = xml$ + " " + string$(time_sig_type) + "" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " G" + newline$ xml$ = xml$ + " 2" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ # Tempo indication xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " quarter" + newline$ xml$ = xml$ + " " + string$(tempo) + "" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ # ===== NOTE/REST GENERATION ===== # For each item in note list: FOR n FROM 1 TO note_list_count: IF note_type[n] = 0: type$ = "rest" ELSE: type$ = "note" # Emit XML element xml$ = xml$ + " " + newline$ IF type$ = "rest": xml$ = xml$ + " " + newline$ ELSE: xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + pitch_step$ + "" + newline$ xml$ = xml$ + " " + string$(pitch_octave) + "" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + string$(note_duration[n]) + "" + newline$ xml$ = xml$ + " " + note_value_type$[n] + "" + newline$ IF note_dotted[n]: xml$ = xml$ + " " + newline$ # Dynamics (if note and extract_dynamics) IF type$ = "note" AND extract_dynamics: # Map normalized dynamics to musical markings dyn_norm = note_dynamics[n] IF dyn_norm ≥ 0.9: marking$ = "ff" ELSIF dyn_norm ≥ 0.75: marking$ = "f" ELSIF dyn_norm ≥ 0.55: marking$ = "mf" ELSIF dyn_norm ≥ 0.40: marking$ = "mp" ELSIF dyn_norm ≥ 0.25: marking$ = "p" ELSE: marking$ = "pp" IF marking$ ≠ "": xml$ = xml$ + " " + newline$ xml$ = xml$ + " <" + marking$ + "/>" + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ # ===== FOOTER ===== xml$ = xml$ + " " + newline$ xml$ = xml$ + " " + newline$ xml$ = xml$ + "" # ===== MUSICXML ELEMENTS EXPLANATION ===== # : Number of divisions per quarter note # : Time signature (beats/beat-type) # : Staff clef (G clef on line 2 = treble clef) # : Tempo indication (quarter = X BPM) # : Individual note or rest # for rests, for notes # : Length in divisions # : Note value (whole, half, quarter, etc.) # : Dotted note # : Dynamic markings (pp, p, mp, mf, f, ff) # ===== OUTPUT ===== # The complete xml$ is written to Praat Info window # User copies and saves as .xml file # File can be opened in any MusicXML-compatible software # ===== COMPATIBILITY NOTES ===== # MusicXML 3.1 is widely supported # All major notation software can import # Some software may require .musicxml extension # MuseScore: Open directly # Finale: Use File → Open # Sibelius: Use File → Import # Dorico: Use File → Open

Dynamics Mapping

🎚️ Intensity to Musical Dynamics

Input: Normalized intensity values (0-1)

Mapping: 6-level dynamic scale: pp, p, mp, mf, f, ff

Smoothing: Dynamics_smoothing parameter reduces fluctuations

Output: Standard musical dynamics markings

Relative scaling: Based on file's intensity range, not absolute dB

Rhythm TextGrid Creation

📊 Multi-Tier Rhythm Visualization

Tier 1: Note/Rest values with durations

Tier 2: Dynamics markings (pp-ff)

Tier 3: Beat positions within measure

Tier 4: Measure numbers

Purpose: Visual verification and analysis of rhythm mapping

Parameters & Specifications

Tempo & Quantization

Parameter	Type	Default	Range	Description
Auto_detect_tempo	boolean	1 (yes)	0/1	Automatically estimate tempo from speech rhythm
Manual_tempo_(BPM)	positive	120	30 - 300	Manual tempo setting (if auto_detect off)
Pulse_unit	option	Quarter note	5 options	Musical note value that maps to dominant IOI
Divisions_per_quarter	positive	8	1 - 32	Resolution (higher = finer quantization)
Allow_dotted_notes	boolean	1 (yes)	0/1	Allow dotted note values for better rhythm representation

Time Signature

Parameter	Type	Default	Range	Description
Auto_detect_meter	boolean	1 (yes)	0/1	Automatically estimate time signature from accent patterns
Time_signature_beats	positive	4	1 - 12	Beats per measure (if auto_detect off)
Time_signature_type	positive	4	1 - 32	Beat unit (if auto_detect off)
Compound_meter	boolean	0 (no)	0/1	Use compound meter (6/8, 9/8, 12/8) (if auto_detect off)

Onset Detection

Parameter	Type	Default	Range	Description
Detection_method	option	Intensity only	3 options	Algorithm for detecting syllable/musical onsets
Min_onset_separation_(s)	positive	0.08	0.02 - 0.30	Minimum time between detected onsets
Prominence_threshold_(dB)	positive	2.5	1.0 - 10.0	Minimum intensity prominence for onset detection

Silence Detection

Parameter	Type	Default	Range	Description
Min_silent_duration_(s)	positive	0.10	0.02 - 0.50	Minimum duration to classify as silence
Min_sounding_duration_(s)	positive	0.08	0.02 - 0.30	Minimum duration to classify as speech
Silence_threshold_(dB)	integer	-25	-60 - 0	Intensity threshold for silence detection

Dynamics

Parameter	Type	Default	Range	Description
Extract_dynamics	boolean	1 (yes)	0/1	Extract and include dynamic markings in notation
Dynamics_smoothing_(s)	positive	0.1	0.02 - 0.50	Time constant for intensity smoothing for dynamics

Output

Parameter	Type	Default	Description
Output_pitch	option	C4 (middle C)	Pitch for notes in notation (visual only)
Include_comments	boolean	1 (yes)	Include analysis comments in MusicXML file
Create_TextGrid	boolean	1 (yes)	Create rhythm analysis TextGrid for visualization

Performance Characteristics

Characteristic	Typical Value	Dependence	Notes
Processing time	5-30 seconds	File length, detection method	Method 2 (multi-feature) slowest
Onset detection rate	3-10 Hz	Speech rate, prominence threshold	Typical syllable rate
Tempo confidence	0.3-0.9	Rhythmic regularity	>0.7 = good rhythm detection
Quantization error	10-40 ms	Tempo accuracy, divisions	Lower with higher divisions
MusicXML file size	5-50 KB	Number of notes/rests	Compresses well for sharing

Note Value Mapping

Divisions per quarter	Shortest note	Note values available	Typical use
4	16th note	Whole, half, quarter, eighth, 16th	Simple rhythms, educational
8	32nd note	Whole, half, quarter, eighth, 16th, 32nd	Standard speech rhythm
12	32nd triplet	All standard + triplets possible	Complex rhythms, music
16	64th note	Very fine resolution	Precise speech timing

Detection Method Comparison

Method	Speed	Accuracy	Best for	Parameters to adjust
Intensity only	Fastest	Good for clear speech	Poetry, recited text, clean recordings	Prominence threshold, min separation
Multi-feature	Slowest	Most robust	Music, singing, noisy environments	Same as intensity + spectral validation
Syllable nuclei	Medium	Best for syllable timing	Language analysis, clear voiced speech	Pitch parameters, voicing threshold

Applications

Musical Composition

Use case: Generate rhythmic material from spoken word

Workflow:

Record speech with clear rhythmic patterns (poetry, rap, recitation)
Convert to MusicXML with auto tempo/meter detection
Import into notation software and assign to instruments
Develop into full musical composition

Speech & Language Analysis

Use case: Study speech rhythm and timing patterns

Workflow:

Analyze speech samples from different languages/dialects
Compare rhythmic patterns through musical notation
Study timing variations in pathological speech
Quantify speech rhythm for research

Educational Tools

Use case: Teach rhythm and timing through speech

Workflow:

Students speak rhythmic patterns
Convert to notation for visual feedback
Compare intended vs. performed rhythm
Practice rhythm reading with familiar speech

Practical Workflow Examples

🎤 Poetry to Rhythm Notation

Goal: Convert recited poetry to musical rhythm

Settings:

Detection_method: Intensity only
Auto_detect_tempo: Yes
Pulse_unit: Quarter note
Divisions_per_quarter: 8
Allow_dotted_notes: Yes
Auto_detect_meter: Yes
Min_onset_separation: 0.10s
Extract_dynamics: Yes
Create_TextGrid: Yes

Result: Musical notation capturing poetic rhythm and dynamics

🎵 Rap/Rhythmic Speech Analysis

Goal: Analyze rap vocals for rhythmic patterns

Settings:

Detection_method: Multi-feature
Auto_detect_tempo: Yes
Pulse_unit: Eighth note (for faster rhythms)
Divisions_per_quarter: 12 (for triplet feel)
Allow_dotted_notes: Yes
Auto_detect_meter: Yes
Min_onset_separation: 0.06s (faster detection)
Prominence_threshold: 3.0dB (stronger onsets)
Extract_dynamics: Yes

Result: Detailed rhythm notation with syncopation and accents

🔬 Language Rhythm Comparison

Goal: Compare rhythmic patterns across languages

Settings:

Detection_method: Syllable nuclei
Auto_detect_tempo: Yes
Pulse_unit: Quarter note
Divisions_per_quarter: 8
Allow_dotted_notes: No (simpler analysis)
Auto_detect_meter: No, set to 4/4
Min_onset_separation: 0.08s
Extract_dynamics: No (focus on timing only)
Create_TextGrid: Yes (for analysis)

Result: Comparable rhythm notation for cross-linguistic study

Advanced Techniques

Post-processing in notation software:

Cleaning up: Remove small rests, simplify complex rhythms
Adding harmony: Set chord progressions under speech rhythm
Orchestration: Assign different rhythms to different instruments
Development: Use as motif for musical development
Combining: Layer multiple speech rhythms polyrhythmically

MusicXML import preserves all rhythmic information for further musical development

Parameter optimization for different speech types:

Slow, clear speech: Min_separation = 0.12s, prominence_threshold = 2.0dB
Fast, conversational: Min_separation = 0.06s, prominence_threshold = 3.0dB
Musical/sung: Use multi-feature detection, divisions = 12
Noisy recordings: Increase silence_threshold, use multi-feature
Poetic/recited: Enable dotted notes, auto meter detection

Always check TextGrid output to verify onset detection quality

Troubleshooting Common Issues

Problem: Too many/few onsets detected
Cause: Prominence_threshold too low/high, or min_separation wrong
Solution: Adjust prominence_threshold (2-4dB typical), check TextGrid visualization

Problem: Tempo detection unrealistic (too fast/slow)
Cause: Pulse_unit setting wrong for speech rhythm
Solution: Try different pulse_unit (quarter or eighth note usually best)

Problem: Quantization error high (>50ms average)
Cause: Tempo doesn't match speech rhythm, or divisions too low
Solution: Adjust tempo manually, increase divisions_per_quarter

Problem: MusicXML won't open in notation software
Cause: XML structure issue or software compatibility
Solution: Ensure copying entire XML output, save as .xml or .musicxml, try MuseScore

Problem: Dynamics not showing in notation
Cause: Extract_dynamics disabled, or intensity range too small
Solution: Enable extract_dynamics, check recording has dynamic variation

Integration with Other Tools

Complete speech-to-music workflow:

Pre-processing: Use FIR filter bank to clean speech recording
Rhythm extraction: This script to get MusicXML rhythm
Pitch extraction: Use Praat's pitch analysis for melodic contour
Notation software: Import MusicXML, add harmony, orchestrate
Performance: Use MIDI export or live performance
Recording: Record instrumental/vocal performance based on notation

This creates a complete pipeline from speech to performed music