Pitch & Loudness Comparison Tool — User Guide

Dual-track analysis: compares pitch and loudness contours between two audio files frame-by-frame, designed for teacher-student comparison in vocal training and language learning.

Author: Shai Cohen Version: 2.0 (Fixed) License: MIT License Category: Educational & Comparative Analysis

Contents:

What this does Quick start Loudness Comparison Pitch Comparison Result Interpretation Educational Applications

What this does

This script implements frame-by-frame audio comparison — a specialized tool for comparing two audio recordings along pitch and loudness dimensions. The tool is designed for educational contexts where a "teacher" recording (model) is compared with a "student" recording (attempt). It performs: (1) Temporal Alignment: Aligns the two recordings by time frame based on analysis step size. (2) Loudness Analysis: Computes intensity contours, compares decibel levels frame-by-frame, calculates statistical differences. (3) Pitch Analysis: Extracts fundamental frequency contours, compares in semitone units, assesses pitch matching accuracy. (4) Comprehensive Metrics: Generates multiple difference measures including average difference, RMS difference, maximum difference, dynamic range comparison, and consistency measures.

Key Features:

Three Analysis Modes — Loudness only, Pitch only, or Both
Frame-by-Frame Comparison — Precise temporal alignment and comparison
Educational Focus — Designed for teacher-student vocal training
Multiple Difference Metrics — Average, RMS, max, min differences
Dynamic Range Analysis — Compares overall loudness variation
Consistency Measurement — Standard deviation comparison
Intelligent Handling — Skips unvoiced frames in pitch analysis
Clear Output — Formatted results in Praat Info window

Why frame-by-frame comparison? Traditional audio comparison: Overall statistics (mean, range). Frame-by-frame comparison: Temporal evolution matching. Advantages: (1) Temporal precision: Identifies where differences occur in time. (2) Pattern recognition: Detects matching of contours, not just averages. (3) Educational value: Shows students exactly where they deviate from model. (4) Detailed feedback: Multiple metrics provide comprehensive assessment. (5) Alignment insensitive: Works even if recordings aren't perfectly time-aligned (compares aligned frames). Use cases: Language learning (pronunciation coaching), singing training (pitch matching), speech therapy (prosody training), audio forensics (recording comparison), voice analysis (style imitation).

Technical Implementation: The script processes two sounds in parallel: (1) Intensity Extraction: Converts each sound to intensity contour with 75 Hz pitch floor, user-defined time step. (2) Pitch Extraction: Converts each sound to pitch contour with user-defined floor/ceiling. (3) Frame Alignment: Uses minimum number of frames between the two analyses. (4) Difference Calculation: For each aligned frame, computes absolute difference (dB for loudness, semitones for pitch). (5) Statistical Aggregation: Computes average, RMS, min, max differences across all frames. (6) Additional Metrics: Dynamic range difference (max-min), consistency difference (std deviation). Key insight: The comparison is symmetric but assumes Sound1 as reference/teacher, Sound2 as student/attempt.

Quick start

In Praat, select EXACTLY TWO Sound objects.
Important: First selected = Teacher/Reference, Second selected = Student/Attempt.
Run script… → compare_pitch_and_loudness_FIXED_STDDEV_ARGS.praat.
Choose analysis_type: Loudness_only, Pitch_only, or Both_pitch_and_loudness.
Set time_step (default 0.01s = 100 Hz frame rate).
For pitch analysis: Set pitch_floor_hz and pitch_ceiling_hz.
Click OK — results appear in Info window.
Review comparison metrics for feedback.

Quick tip: For language learning, use Both_pitch_and_loudness to analyze prosody. Set time_step = 0.01 for detailed analysis (100 frames/sec). For singing practice, adjust pitch_floor_hz and ceiling to match vocal range (e.g., 80-400 Hz for typical singing). Ensure recordings are similar duration — longer ones will be truncated to shorter duration for comparison. The script aligns by frame number, not by absolute time — slight timing differences are okay. For best results, normalize both recordings to similar loudness before comparison. Watch for "Voiced frames analyzed" count in pitch results — low counts indicate poor pitch detection (may need adjustment). Results show both absolute differences and relative patterns.

Important: EXACTLY TWO SOUNDS REQUIRED — Script will fail with 0, 1, or 3+ sounds. Order matters: First selected = Teacher, Second selected = Student — reverse order reverses interpretation. Different durations handled: Comparison uses minimum frames — longer recording truncated to match shorter. Pitch analysis requires voiced sounds — Unvoiced frames (fricatives, silence) are skipped in pitch comparison. Intensity in dB — Values are relative to Praat's reference (1e-12 W/m²), not perceptual loudness. Semitone calculation uses logarithmic formula: 12 × ln(f2/f1)/ln(2). Time step trade-off: Smaller = more frames = more detailed but slower. Pitch floor/ceiling must bracket actual pitch — incorrect settings cause pitch detection failures.

Loudness Comparison

🔊 Loudness Analysis Pipeline

Step 1: Convert each sound to intensity contour (75 Hz pitch floor)

Step 2: Extract frame-by-frame dB values

Step 3: Align frames (use minimum number)

Step 4: Compute differences for each aligned frame

Step 5: Calculate aggregate statistics

Step 6: Compute additional loudness metrics

Intensity Extraction Parameters

Parameter	Default	Effect	Recommended Range
time_step	0.01 s	Frame rate (100 Hz)	0.005-0.02 s
Pitch floor (implied)	75 Hz	Minimum frequency in intensity	50-100 Hz
Subtract mean	yes	Removes DC component	Always yes

Loudness Difference Metrics

FRAME-BY-FRAME DIFFERENCES: For each frame i (1 to N): db1[i] = intensity of teacher at frame i db2[i] = intensity of student at frame i if both defined: diff[i] = |db1[i] - db2[i]| AGGREGATE STATISTICS: 1. AVERAGE DIFFERENCE: avg_diff = (sum of diff[i]) / N 2. RMS DIFFERENCE: rms_diff = sqrt( (sum of diff[i]²) / N ) 3. MAXIMUM DIFFERENCE: max_diff = maximum(diff[i]) 4. MINIMUM DIFFERENCE: min_diff = minimum(diff[i]) ADDITIONAL METRICS: 5. DYNAMIC RANGE DIFFERENCE: dr1 = max(db1) - min(db1) dr2 = max(db2) - min(db2) dr_diff = |dr1 - dr2| 6. CONSISTENCY DIFFERENCE: std1 = standard_deviation(db1) std2 = standard_deviation(db2) consistency_diff = |std1 - std2|

Interpreting Loudness Results

Metric	Excellent	Good	Needs Work	Interpretation
Average dB difference	< 2 dB	2-5 dB	> 5 dB	Overall loudness matching
RMS dB difference	< 3 dB	3-6 dB	> 6 dB	Consistent matching
Max dB difference	< 10 dB	10-20 dB	> 20 dB	Worst-case deviation
Dynamic range diff	< 3 dB	3-8 dB	> 8 dB	Expression matching
Consistency diff	< 2 dB	2-4 dB	> 4 dB	Steadiness matching

Educational Guidelines for Loudness

🎤 Loudness Training Scenarios

Scenario 1: Volume Matching

Goal: Match overall loudness level
Target: Average difference < 3 dB
Focus: Average and RMS differences

Scenario 2: Dynamic Expression

Goal: Match loudness contours/expression
Target: Dynamic range difference < 5 dB
Focus: Dynamic range and consistency differences

Scenario 3: Loudness Control

Goal: Maintain steady volume
Target: Consistency difference < 2 dB
Focus: Standard deviation comparison

Common Loudness Issues and Solutions

ISSUE: Large average difference (> 5 dB) CAUSE: Overall volume mismatch SOLUTION: Normalize student recording to match teacher's RMS level ISSUE: Large dynamic range difference (> 8 dB) CAUSE: Different expressive patterns SOLUTION: Practice crescendo/decrescendo matching ISSUE: Large max difference (> 20 dB) CAUSE: Sudden loud spikes or drops SOLUTION: Identify and smooth extreme variations ISSUE: Inconsistent matching (high RMS > low average) CAUSE: Erratic volume changes SOLUTION: Work on steady breath control

Pitch Comparison

🎵 Pitch Analysis Pipeline

Step 1: Convert each sound to pitch contour (user-defined range)

Step 2: Extract frame-by-frame F0 values (Hz)

Step 3: Align frames (use minimum number)

Step 4: Compute semitone differences for voiced frames only

Step 5: Calculate aggregate pitch statistics

Pitch Extraction Parameters

Parameter	Default	Effect	Recommended Range
time_step	0.01 s	Frame rate (100 Hz)	0.005-0.02 s
pitch_floor_hz	75 Hz	Minimum F0 to search	Male: 75-100, Female: 150-200
pitch_ceiling_hz	600 Hz	Maximum F0 to search	Male: 300-400, Female: 500-600

Pitch Difference Calculation

SEMITONE CALCULATION: For two frequencies f1 and f2 (in Hz): semitone_difference = |12 × ln(f2 / f1) / ln(2)| Where: • ln = natural logarithm • 12 semitones per octave • ln(2) ≈ 0.693147 Examples: f1 = 220 Hz (A3), f2 = 440 Hz (A4) → 12.00 semitones (octave) f1 = 261.6 Hz (C4), f2 = 293.7 Hz (D4) → 2.00 semitones (whole step) f1 = 100 Hz, f2 = 105 Hz → 0.84 semitones (≈ quarter tone) FRAME PROCESSING: For each frame i: f1 = teacher pitch at frame i (Hz) f2 = student pitch at frame i (Hz) If f1 > 0 AND f2 > 0 (both voiced): semitone_diff[i] = |12 × ln(f2/f1)/ln(2)| Include in statistics Else (unvoiced in either): Skip frame AGGREGATE STATISTICS: • Average semitone difference = mean(semitone_diff[i]) • Maximum semitone difference = max(semitone_diff[i]) • Voiced frames count = number of frames where both voiced

Interpreting Pitch Results

Metric	Excellent	Good	Needs Work	Interpretation
Average semitone difference	< 0.5 st	0.5-1.5 st	> 1.5 st	Overall pitch accuracy
Max semitone difference	< 3 st	3-6 st	> 6 st	Worst-case pitch error
Voiced frames analyzed	> 80%	50-80%	< 50%	Pitch detection reliability

Pitch Detection and Voicing

VOICED FRAME DETECTION: A frame is considered "voiced" if: pitch > 0 Hz (Praat's pitch detector found F0) Typical voiced percentages: • Vowels: 95-100% voiced • Nasals: 80-95% voiced • Voiced fricatives: 50-80% voiced • Unvoiced sounds: 0-20% voiced • Silence: 0% voiced IMPORTANT NOTES: • Only voiced frames included in pitch statistics • Unvoiced frames (fricatives, stops, silence) skipped • Low voiced percentage may indicate: - Incorrect pitch floor/ceiling - Very breathy/whispered voice - Background noise - Technical issues with recording

Educational Guidelines for Pitch

🎶 Pitch Training Scenarios

Scenario 1: Absolute Pitch Matching

Goal: Match exact pitch frequencies
Target: Average difference < 0.5 semitones
Exercises: Sustained note matching, pitch glide following

Scenario 2: Relative Pitch Contours

Goal: Match pitch patterns/melodies
Target: Average difference < 1.5 semitones with good contour
Exercises: Sentence intonation, musical phrase imitation

Scenario 3: Pitch Stability

Goal: Maintain steady pitch
Target: Max difference < 3 semitones during sustained notes
Exercises: Long tone practice, vibrato control

Common Pitch Issues and Solutions

ISSUE: High average semitone difference (> 2 st) CAUSE: Systematic pitch offset (singing in wrong key) SOLUTION: Practice with reference tones, use pitch pipe ISSUE: Large max difference (> 6 st) but low average CAUSE: Occasional large errors (cracking, slips) SOLUTION: Identify problematic notes, practice transitions ISSUE: Low voiced frames percentage (< 50%) CAUSE: Breathy voice, incorrect pitch range settings SOLUTION: Adjust pitch floor/ceiling, improve vocal fold closure ISSUE: Inconsistent matching (high max, medium average) CAUSE: Erratic pitch control SOLUTION: Slower practice, focus on steady tones first

Result Interpretation

📊 Comprehensive Assessment Framework

Holistic View: Consider all metrics together, not individually

Context Matters: Different goals require different metric priorities

Progress Tracking: Compare metrics across practice sessions

Threshold Guidelines: Use suggested ranges as starting points

Combined Pitch and Loudness Assessment

Performance Level	Pitch (Avg st diff)	Loudness (Avg dB diff)	Typical Profile
Expert	< 0.5	< 2	Near-perfect matching
Advanced	0.5-1.0	2-3	Good matching, minor variations
Intermediate	1.0-2.0	3-5	Recognizable pattern, noticeable differences
Beginner	2.0-4.0	5-8	Basic contour followed, significant errors
Novice	> 4.0	> 8	Little pattern matching

Example Output Interpretation

SAMPLE OUTPUT (Both analysis): === LOUDNESS COMPARISON === Frames compared: 452 Average dB difference: 4.237 dB RMS dB difference: 5.812 dB Max dB difference: 18.943 dB Min dB difference: 0.012 dB Dynamic range difference: 7.456 dB Consistency difference (std): 3.128 dB === PITCH COMPARISON === Voiced frames analyzed: 387 out of 452 Average semitone difference: 1.843 Max semitone difference: 8.237 INTERPRETATION: 1. Loudness: Moderate matching (4.2 dB average diff) - Dynamic expression differs (7.5 dB range diff) - Some consistency issues (3.1 dB std diff) 2. Pitch: Intermediate level (1.8 st average diff) - Some large errors (8.2 st max diff) - Good voiced detection (387/452 = 86%) 3. Overall: Student follows general patterns - Needs work on pitch accuracy and loudness control - Focus on reducing maximum errors

Progress Tracking Over Time

TRACKING TEMPLATE: Session 1: Pitch avg: 3.2 st | Loudness avg: 6.5 dB Issues: Large pitch errors, volume mismatch Session 2: Pitch avg: 2.1 st | Loudness avg: 4.8 dB Progress: Both improved, still significant differences Session 3: Pitch avg: 1.4 st | Loudness avg: 3.2 dB Progress: Good improvement, approaching target ranges Session 4: Pitch avg: 0.9 st | Loudness avg: 2.4 dB Goal achieved: Good matching, minor variations KEY INSIGHTS: • Track relative improvement, not just absolute values • Different metrics may improve at different rates • Plateaus are normal in skill development • Some sessions may show regression (normal in learning)

Limitations and Considerations

Not a complete assessment: Only measures pitch and loudness — doesn't assess timbre, articulation, vowel quality, or timing precision. Frame alignment assumption: Assumes similar timing — large tempo differences will cause misalignment. Pitch detection limitations: Praat's pitch detector can fail on breathy, creaky, or noisy voices. Loudness vs perceived loudness: dB measurements don't account for equal-loudness contours (Fletcher-Munson). No normalization: Doesn't automatically adjust for different recording levels — pre-normalization recommended. Binary voiced/unvoiced: Simplified model — real voice has degrees of voicing. Semitone calculation assumes equal temperament: May not match perception for microtonal music.

Educational Applications

Language Learning and Pronunciation

🗣️ Pronunciation and Intonation Training

Goal: Improve pronunciation accuracy through prosody matching

Workflow:

Teacher records model sentence with target intonation
Student records imitation attempt
Run Both_pitch_and_loudness analysis
Review metrics together
Identify specific problem areas
Practice with focused exercises

Target metrics: Pitch average < 1.5 st, Loudness average < 4 dB

Singing and Vocal Training

🎤 Vocal Pitch Accuracy Training

Goal: Improve pitch accuracy for singers

Workflow:

Select Pitch_only analysis
Teacher sings target phrase
Student imitates
Analyze pitch matching
Focus on reducing average and maximum differences
Track progress over practice sessions

Target metrics: Professional: < 0.3 st, Student: < 1.0 st

Speech Therapy and Voice Rehabilitation

🏥 Voice Therapy Applications

Goal: Monitor voice parameter changes in therapy

Workflow:

Record baseline (healthy model or patient's best effort)
Record current performance
Compare to track changes
Use metrics as objective measures of progress
Adjust therapy based on results

Applications: Parkinson's voice therapy, vocal fold paralysis, pitch therapy for transgender voice

Accent Reduction and Dialect Training

🌍 Prosody Pattern Acquisition

Goal: Acquire target language/dialect prosody patterns

Workflow:

Native speaker records target prosody patterns
Learner records imitation
Compare using Both analysis
Identify which aspects (pitch vs loudness) need most work
Practice specific problematic patterns

Example: English question intonation, Mandarin tones, Japanese pitch accent

Advanced Educational Techniques

Segmented Practice Approach:

Isolate: Practice single words or short phrases
Analyze: Get immediate feedback on each attempt
Correct: Adjust based on specific metrics
Integrate: Combine into longer phrases
Automate: Practice until metrics reach target ranges

Example progression: Single vowel → Word → Phrase → Sentence → Paragraph

Gamification Strategies:

Score system: Convert metrics to points (lower differences = higher scores)
Level progression: Unlock harder material as metrics improve
Challenge modes: Try to beat previous best scores
Multiplayer: Compare scores among students (healthy competition)
Achievements: Unlock badges for reaching metric milestones

Troubleshooting Common Issues

Problem: Very low voiced frame count in pitch analysis
Causes: Incorrect pitch range, breathy voice, background noise, recording issues
Solutions: Adjust pitch_floor_hz and pitch_ceiling_hz, improve recording quality, use pop filter

Problem: Extremely high dB differences (> 30 dB)
Causes: Different recording levels, clipping, normalization needed
Solutions: Normalize both recordings to similar RMS levels before comparison

Problem: Inconsistent results across attempts
Causes: Variable performance, different recording conditions, analysis parameter changes
Solutions:
Problem: Analysis takes very long
Causes: Long recordings, very small time_step
Solutions: Use shorter segments for practice, increase time_step (e.g., 0.02s), focus on key phrases

Integration with Other Tools

Tool Integration Enhanced Capability

Praat TextGrid Segment recordings by phoneme/word Compare specific speech segments

Spreadsheet software Export metrics for tracking Long-term progress visualization

Audio editor Pre-process recordings Normalize, trim, filter before analysis

Recording app Standardize recording setup Consistent input quality

Learning management system Embed in online courses Distance learning applications

Tool	Integration	Enhanced Capability
Praat TextGrid	Segment recordings by phoneme/word	Compare specific speech segments
Spreadsheet software	Export metrics for tracking	Long-term progress visualization
Audio editor	Pre-process recordings	Normalize, trim, filter before analysis
Recording app	Standardize recording setup	Consistent input quality
Learning management system	Embed in online courses	Distance learning applications

Citation: Audio Tools Team. (2024). Praat AudioTools: Pitch & Loudness Comparison Tool. Educational tool for vocal and pronunciation training.

Further Reading:

Pronunciation teaching: Celce-Murcia, M., Brinton, D. M., & Goodwin, J. M. (2010). Teaching Pronunciation. Cambridge University Press.

Voice science: Titze, I. R. (2000). Principles of Voice Production. National Center for Voice and Speech.

Prosody analysis: Hirst, D. & Di Cristo, A. (1998). Intonation Systems: A Survey of Twenty Languages. Cambridge University Press.

Computer-assisted pronunciation: Neri, A., Cucchiarini, C., & Strik, H. (2008). The effectiveness of computer-based speech corrective feedback. Computer Assisted Language Learning.