Hidden Markov Model Timbre Sequencer — User Guide
True Hidden Markov Model (HMM) for timbre-based sequence generation with Gaussian emission models, Viterbi decoding, and comprehensive visualization.
What this does
This script implements a True Hidden Markov Model (HMM) for timbre-based sequence generation. Unlike basic Markov chains, this is a complete HMM with hidden states (timbre classes), Gaussian emission models, learned transition probabilities, and Viterbi decoding for state inference.
- Hidden States: Discovered timbre classes (via k-means initialization)
- Observations: 4D feature vectors (intensity, pitch, centroid, slope)
- Emission Model: Gaussian distributions per state
- Transition Model: Learned state-to-state probabilities
- Decoding: Viterbi algorithm to find most likely state path
- Generation: Sample states → sample observations → synthesize
Key Features:
- True HMM with emission probabilities
- Viterbi decoding for state inference
- Comprehensive 6-panel visualization
- Emission-based probabilistic frame selection
- State statistics and diagnostics
- Match input duration option
- Stereo output support
- Optimized crossfade processing
Technical Implementation: (1) Frame-based analysis: Divide audio into overlapping frames. (2) Feature extraction: Extract 4 timbre features per frame: intensity (RMS), pitch (F0), spectral centroid (brightness), spectral slope (high/low balance). (3) Normalization: Z-score normalization (mean=0, std=1). (4) K-means initialization: Group frames into K timbre states. (5) Emission modeling: Compute Gaussian parameters (mean, std) for each state. (6) Transition learning: Build probability matrix from state sequences. (7) Viterbi decoding: Find optimal state path through input. (8) Sequence generation: Generate new state sequence using transition probabilities. (9) Probabilistic synthesis: Sample frames based on emission probabilities. (10) Crossfade concatenation: Smooth grain assembly with configurable overlap.
Quick start
- In Praat Objects window, select a single Sound object (5-30 seconds recommended).
- Open script:
HMM_Timbre_Sequencing.praat - Choose a Preset or use Custom:
- Custom — Use your own parameter values
- Fine Grain — Subtle, 12 states (10ms frames, 5ms hop)
- Coarse Grain — Bold, 5 states (100ms frames, 50ms hop)
- Textural — Dense, 16 states (20ms frames, 10ms hop)
- Rhythmic — Pulse, 8 states (30ms frames, 15ms hop)
- Experimental — Glitchy, 24 states (8ms frames, 4ms hop)
- Set Feature Extraction parameters:
- Frame_size_ms: 20ms (analysis window)
- Frame_hop_ms: 10ms (50% overlap)
- Configure HMM Parameters:
- Number_of_states_K: 8 (timbre clusters)
- Max_kmeans_iterations: 50 (clustering iterations)
- Set Sequence Generation options:
- Match_input_duration: ON (match source length) or OFF (use custom length)
- Output_length_frames: 200 (only used if Match_input_duration is OFF)
- Configure Output settings:
- Crossfade_ms: 5ms (grain crossfade)
- Stereo_output: ON (stereo) or OFF (mono)
- Draw_visualization: ON (6-panel visualization)
- Show_info: ON (detailed statistics)
- Click OK — script analyzes, learns HMM, generates sequence
- Output appears in Objects window with preset name (e.g., "FineGrain_HMM_Sequence")
- Check Info window for detailed HMM statistics (if Show_info enabled)
- View Picture window for 6-panel visualization (if Draw_visualization enabled)
About Presets
The script includes 6 presets that configure all parameters for different sonic results:
- Fine Grain: High resolution (12 states, 10ms frames, 5ms hop) — captures subtle timbral nuances, smooth transitions, good for detailed textures
- Coarse Grain: Low resolution (5 states, 100ms frames, 50ms hop) — bold timbral shifts, rhythmic chunks, good for dramatic contrasts
- Textural: Dense (16 states, 20ms frames, 10ms hop) — complex timbral palette, rich textures, 600 frames output for extended morphing
- Rhythmic: Pulsed (8 states, 30ms frames, 15ms hop) — balanced resolution, 128 frames output creates clear rhythmic patterns
- Experimental: Extreme (24 states, 8ms frames, 4ms hop) — maximum timbral detail, granular, glitchy, unpredictable sequences
- Custom: Manual control — set all parameters yourself for specific needs
When you select a preset, all frame size, hop, K, output length, and crossfade parameters are automatically configured. You can still modify individual parameters after selecting a preset.
HMM Theory
What is a Hidden Markov Model?
A Hidden Markov Model is a statistical model that describes a system with hidden states that cannot be directly observed. Instead, we observe outputs (emissions) that are probabilistically related to the hidden states.
- States (S): Hidden timbre classes (e.g., "bright", "dark", "noisy", etc.)
- Observations (O): Measurable features extracted from audio frames
- Transition Probabilities P(St+1|St): Probability of moving from state i to state j
- Emission Probabilities P(O|S): Probability of observing features given a state
- Initial Probabilities P(S1): Starting state distribution
HMM vs Simple Markov Chain
| Aspect | Simple Markov Chain | Hidden Markov Model |
|---|---|---|
| States | Directly observed | Hidden, inferred from observations |
| Observations | States themselves | Separate from states |
| Emission model | None | Probabilistic (Gaussian in this script) |
| State inference | Direct assignment | Viterbi algorithm |
| Generation | Follow transitions | Sample states, then sample observations |
| Flexibility | Rigid state assignments | Probabilistic, handles uncertainty |
Key Algorithms
Groups similar feature vectors into K clusters to initialize hidden states. Each cluster becomes a timbre class.
For each state k, compute Gaussian parameters:
- Mean μk = average feature vector for all frames in state k
- Std σk = standard deviation for each feature dimension
Emission probability: P(observation | state k) ~ N(μk, σk²)
Count transitions from state i to state j across the input sequence, then normalize to get probabilities.
Finds the most likely state sequence given the observations using dynamic programming. For each frame, computes the most probable path to each state considering both transition and emission probabilities.
Generates new state sequences by sampling from the transition matrix, starting from a random state and following transition probabilities.
For each generated state, selects audio frames weighted by their emission probability (how "typical" they are for that state). This creates more natural-sounding output than random selection.
Mathematical Formulation
HMM Parameters
A = {aij} transition probabilities, aij = P(St+1 = j | St = i)
B = {bj(o)} emission probabilities, bj(o) = P(Ot = o | St = j)
π = {πi} initial state probabilities, πi = P(S1 = i)
Gaussian Emission
where μj = mean feature vector for state j
σj = standard deviation for state j
Viterbi Algorithm
i
where δt(j) = probability of most likely path ending in state j at time t
Timbre Features
The script extracts four timbre descriptors from each audio frame to create a 4-dimensional feature space:
1. Intensity (RMS Energy)
What it measures: Overall loudness/energy of the frame
Calculation: Root Mean Square of amplitude values
Timbral meaning: Distinguishes loud vs quiet passages, attacks vs decays
Normalization: Z-score normalized across all frames
2. Pitch (Fundamental Frequency)
What it measures: Perceived pitch/F0 in Hertz
Calculation: Autocorrelation-based F0 estimation (Praat's pitch analysis)
Timbral meaning: Harmonic content, spectral structure, perceived height
Handling undefined pitch: Frames without clear pitch are assigned 0 Hz before normalization
Normalization: Z-score normalized across all frames
3. Spectral Centroid
What it measures: "Center of mass" of the spectrum — brightness
Calculation: Weighted average of frequency bins by their magnitudes
where f = frequency, M(f) = magnitude at f
Timbral meaning: Bright (high centroid) vs dark (low centroid) sounds
Normalization: Z-score normalized across all frames
4. Spectral Slope
What it measures: Balance between high and low frequencies
Calculation: Linear regression of log-magnitude spectrum
Timbral meaning: Negative slope = high-heavy (bright), positive slope = low-heavy (dull)
Normalization: Z-score normalized across all frames
Feature Space Visualization
The 6-panel visualization includes a 2D projection of the 4D feature space (Pitch vs Centroid), showing:
- Small colored dots: Individual frames colored by their state assignment
- Large circles with black outlines: Emission means (μ) for each state
- State labels: Numbers inside the mean circles
This reveals how well-separated the timbre classes are and how the Gaussian emission model represents each state in feature space.
Z-score Normalization
All features are normalized to have mean = 0 and standard deviation = 1 before k-means clustering:
where μ = mean across all frames
σ = standard deviation across all frames
This ensures all features contribute equally to clustering regardless of their original scale.
Algorithm Details
Processing Pipeline
Divide sound into overlapping frames using Hann window:
- Frame size: User-defined (e.g., 20ms)
- Hop size: User-defined (e.g., 10ms = 50% overlap)
- Number of frames: ⌊(duration - frame_size) / hop_size⌋ + 1
For each frame, extract:
- Intensity via Get energy
- Pitch via To Pitch (ac) — autocorrelation method
- Spectral centroid via To Spectrum → Get centre of gravity
- Spectral slope via linear regression on log-magnitude spectrum
Store in 4×N feature matrix where N = number of frames
Z-score normalization for each feature dimension:
- Compute mean μ and std σ across all N frames
- Transform: z = (x - μ) / σ
- Result: All features have mean 0, std 1
Initialize K timbre states using k-means:
- Random initialization: Pick K random frames as initial centroids
- Assignment step: Assign each frame to nearest centroid (Euclidean distance in 4D)
- Update step: Recompute centroids as mean of assigned frames
- Repeat assignment-update until convergence or max iterations
- Result: Each frame has a state label ∈ {1, 2, ..., K}
For each state k, compute Gaussian parameters:
- Find all frames assigned to state k
- Compute mean μk = average feature vector
- Compute std σk = standard deviation for each dimension
- Store as emission model: bk(o) ~ N(μk, σk²)
Build K×K transition matrix A:
- Initialize count matrix C[i][j] = 0
- For each consecutive pair (St, St+1), increment C[St][St+1]
- Normalize rows: A[i][j] = C[i][j] / Σk C[i][k]
- Add smoothing for zero-probability transitions (optional)
Find optimal state path through input observations:
- Initialize: δ1(j) = πj × bj(o1) for all states j
- Recursion: δt(j) = max[δt-1(i) × aij] × bj(ot)
- Termination: Best path probability = max[δT(j)]
- Backtrack to recover state sequence
This gives a "clean" state sequence that respects both transition and emission probabilities.
Generate new state sequence of desired length:
- Choose random initial state (or use input distribution)
- For each position: Sample next state from P(St+1|St) using transition matrix row
- Result: New state sequence with learned transition probabilities
For each generated state, select representative audio frame:
- Find all source frames assigned to this state
- Compute emission probability for each: P(frame features | state)
- Sample frame weighted by emission probabilities (higher = more likely)
- Extract corresponding audio segment from source
This creates more "typical" outputs compared to random frame selection.
Concatenate selected frames with crossfading:
- For each frame, extract audio with configured crossfade overlap
- Apply crossfade: linear fade-out on previous frame, fade-in on current frame
- Sum overlapping regions
- If stereo output: Generate two independent sequences (different random seeds)
- Result: Smooth concatenation without clicks
Optimization Techniques
- Global analysis objects: Create Intensity, Pitch, Spectrum once per source sound, reuse for all frames
- Vectorized operations: Store features in arrays for batch processing
- Early convergence: Stop k-means when centroids stabilize (< 0.001 change)
- Efficient frame extraction: Use Extract part for frame isolation instead of copying entire sound
- Memory management: Remove temporary objects immediately after use
Parameters Guide
Preset Selection
Preset: Optionmenu (6 options)
- Custom — Manual parameter control
- Fine Grain (subtle, 12 states) — 10ms frames, 5ms hop, 12 states, 400 frames output, 3ms crossfade
- Coarse Grain (bold, 5 states) — 100ms frames, 50ms hop, 5 states, 80 frames output, 10ms crossfade
- Textural (dense, 16 states) — 20ms frames, 10ms hop, 16 states, 600 frames output, 4ms crossfade
- Rhythmic (pulse, 8 states) — 30ms frames, 15ms hop, 8 states, 128 frames output, 5ms crossfade
- Experimental (glitchy, 24 states) — 8ms frames, 4ms hop, 24 states, 500 frames output, 1ms crossfade
When a preset is selected, all parameters below are automatically set. You can still manually adjust them afterward.
Feature Extraction
Frame_size_ms: Positive real (default: 20)
Duration of each analysis frame in milliseconds.
- Smaller values (5-15ms): High time resolution, captures transients, percussive details. More frames = slower processing.
- Medium values (20-40ms): Balanced time-frequency resolution. Good for most sounds.
- Larger values (50-200ms): Better frequency resolution, smoother features. Captures sustained timbres well.
Rule of thumb: Use smaller frames for fast-changing sounds (drums, speech), larger frames for sustained sounds (drones, pads).
Frame_hop_ms: Positive real (default: 10)
Time step between consecutive frames in milliseconds.
- Smaller hop: More overlap, smoother analysis, more frames (slower). 50% overlap is common (hop = 0.5 × frame_size).
- Larger hop: Less overlap, faster processing, fewer frames, may miss rapid changes.
- No overlap: hop = frame_size. Fastest but may create discontinuities.
Rule of thumb: Use hop = 0.5 × frame_size for most cases. Use smaller hop (0.25×) for very smooth analysis.
HMM Parameters
Number_of_states_K: Positive integer (default: 8)
Number of hidden timbre states (clusters) to discover.
- Fewer states (3-6): Coarse timbral categories, bold contrasts, simpler model, faster processing.
- Medium states (7-12): Balanced detail, captures most timbral variation without overfitting.
- More states (13-20+): Fine timbral distinctions, risk of overfitting, slower processing.
Rule of thumb: Use K ≈ 0.3 × sqrt(number_of_frames) or 5-12 for most sounds. If output sounds too "samey", increase K. If too random/incoherent, decrease K.
Max_kmeans_iterations: Positive integer (default: 50)
Maximum iterations for k-means clustering convergence.
- Fewer iterations (10-30): Faster, may not fully converge, less stable clustering.
- More iterations (50-100): Better convergence, more stable results, slightly slower.
Rule of thumb: 50 is usually sufficient. The algorithm often converges earlier. Increase if you see warnings about non-convergence in Info window.
Sequence Generation
Match_input_duration: Boolean (default: ON)
Automatically match output duration to input duration.
- ON: Output length = number of input frames. Convenient for creating variations of same length.
- OFF: Use manual Output_length_frames setting. Allows shorter or longer outputs.
Use case: Turn ON for quick variations. Turn OFF when you need precise control over output length.
Output_length_frames: Positive integer (default: 200)
Number of frames in generated sequence (only used if Match_input_duration is OFF).
- Fewer frames (50-100): Short sequences, good for testing, loops, or brief textures.
- Medium frames (150-300): Typical output length, allows full HMM behavior to emerge.
- More frames (400+): Long sequences, extended evolution, may become repetitive.
Duration calculation: Output duration ≈ Output_length_frames × Frame_hop_ms / 1000 seconds
Example: 200 frames × 10ms = 2.0 seconds
Output Settings
Crossfade_ms: Positive real (default: 5)
Crossfade duration between consecutive frames in milliseconds.
- No crossfade (0ms): Potential clicks, abrupt transitions. Not recommended.
- Short crossfade (1-5ms): Minimal smoothing, preserves transients, granular texture.
- Medium crossfade (5-15ms): Smooth transitions, natural sound, default choice.
- Long crossfade (20-50ms): Very smooth, blurred, may lose detail.
Rule of thumb: Use 5-10ms for most sounds. Decrease for percussive sounds to preserve attack. Increase for legato/sustained sounds.
Stereo_output: Boolean (default: ON)
Generate stereo (2-channel) or mono (1-channel) output.
- ON: Two independent sequences with different random seeds. Wider stereo field, richer sound, twice the computation.
- OFF: Single mono sequence. Faster processing, smaller file size.
Use case: Use stereo for final output, spatial music, or immersive textures. Use mono for testing or when stereo is not needed.
Draw_visualization: Boolean (default: ON)
Generate 6-panel comprehensive visualization in Picture window.
- ON: Creates detailed visual analysis showing: (1) Original state sequence, (2) Transition matrix heatmap, (3) Generated state sequence, (4) Input/output waveforms, (5) State distribution histograms, (6) Feature space projection
- OFF: No visualization. Faster processing.
Panels explained:
- Original State Sequence: State assignments over time in input sound
- Transition Matrix: Learned probabilities as color-coded grid (brighter = higher probability)
- Generated Sequence: New state sequence over time
- Waveforms: Input (top) and output (bottom) amplitude over time
- State Distributions: Histogram comparing state frequencies in input vs output
- Feature Space: 2D projection (Pitch vs Centroid) showing state clusters and emission means
Use case: Turn ON for analysis, debugging, or presentation. Turn OFF for batch processing or when visualization is not needed.
Show_info: Boolean (default: ON)
Display detailed statistics and diagnostics in Info window.
- ON: Prints comprehensive information including: HMM model structure, state statistics, emission parameters, transition probabilities, generation details, convergence info
- OFF: Minimal output. Only shows completion message.
Use case: Turn ON for first runs, debugging, or when you need technical details. Turn OFF for quiet batch processing.
Parameter Interactions
- Frame_size vs Frame_hop: Typical ratio is 2:1 (e.g., 20ms frame, 10ms hop). Smaller ratios = more overlap = smoother but slower.
- K vs Frame count: More frames support more states. Don't use K > 30-40% of frame count.
- Output_length vs Frame_hop: Longer hop = longer real-time duration for same frame count.
- Crossfade vs Frame_hop: Crossfade should be ≤ Frame_hop to avoid excessive blurring.
- Stereo vs Processing time: Stereo roughly doubles generation time (two independent sequences).
Applications
1. Algorithmic Composition
Generate musical material that evolves according to learned timbral patterns:
- Variation generation: Create multiple versions of a musical phrase with different temporal orderings but same timbral palette
- Motivic development: Extract timbral "motifs" from one sound and apply to another
- Form building: Generate long sequences that naturally develop timbre over time
- Orchestration ideas: Analyze orchestral textures to discover typical timbre successions
2. Sound Design
Create new textures and timbral evolutions:
- Texture synthesis: Generate evolving ambient textures from field recordings
- Glitch effects: Use high K and short frames for granular, glitchy results
- Morphing sequences: Smooth timbral transitions learned from source material
- Rhythmic patterns: Extract and recombine rhythmic elements from loops
3. Audio Analysis
Use HMM as an analysis tool:
- Timbre segmentation: Identify distinct timbral regions in recordings
- Style analysis: Compare transition matrices between different genres/performers
- Performance analysis: Study how performers navigate timbral space over time
- Similarity metrics: Use emission models to measure timbral distance
4. Generative Music
Real-time or offline generation systems:
- Live performance: Pre-compute HMMs from live input, generate variations on the fly
- Installation art: Continuously evolving soundscapes based on environmental recordings
- Interactive systems: User selects K, frame size, etc. to explore parameter space
- Crossfading installations: Generate multiple sequences and crossfade between them
5. Music Information Retrieval
Extract meaningful timbral information:
- Instrument recognition: Train HMMs on different instruments, compare likelihoods
- Audio fingerprinting: Use transition matrices as compact timbral signatures
- Cover song detection: Compare timbral evolution patterns across versions
- Genre classification: Different genres may have characteristic transition patterns
6. Educational Uses
Teaching timbre, probability, and signal processing:
- Timbre perception: Demonstrate how timbre can be quantified and manipulated
- Markov models: Concrete audio example of abstract probabilistic models
- Feature extraction: Visualize spectral features in musical context
- Experimental composition: Students create pieces using HMM-generated material
Creative Workflows
- Select a musical phrase or texture as source
- Use Match_input_duration = ON to keep same length
- Generate 5-10 variations with different random seeds (run script multiple times)
- Arrange variations in a DAW to create evolving section
- Layer variations for complex textures
- Prepare two contrasting source sounds (e.g., water, fire)
- Analyze each with same K value
- Manually blend transition matrices (external processing)
- Generate sequence from blended matrix
- Result: Hybrid texture with elements of both sources
- Record short improvisations exploring different timbres
- Analyze with high K (12-16) to capture nuances
- Generate long sequences (400+ frames) to develop ideas
- Use generated sequences as compositional sketches
- Refine and orchestrate based on generated material
Complete Workflow
Beginner Workflow: First Steps
- Prepare source sound:
- Open or record a sound in Praat (5-20 seconds recommended)
- Listen to it — understand its timbral content
- Optionally trim to most interesting section
- Run script with defaults:
- Select your Sound object
- Run HMM_Timbre_Sequencing.praat
- Keep all defaults (or try a preset like "Fine Grain")
- Click OK
- Examine output:
- Play the generated sound
- Look at the 6-panel visualization in Picture window
- Read the Info window statistics
- Understand what happened:
- Original state sequence (top-left): Shows timbral evolution in source
- Transition matrix (top-center): Shows which states follow others
- Generated sequence (top-right): Shows new temporal ordering
- Waveforms (middle-left): Compare input and output amplitude envelopes
- State distributions (middle-right): Compare state frequencies
- Feature space (bottom): Shows timbre clusters in 2D
- Experiment:
- Try different presets to hear their effects
- Adjust K up and down to change granularity
- Toggle Stereo_output to hear mono vs stereo differences
Intermediate Workflow: Targeted Results
- Define your goal:
- What kind of output do you want? (smooth, glitchy, rhythmic, textural, etc.)
- What should it preserve from source? (harmonic content, rhythmic feel, spectral character)
- What should it change? (temporal order, density, evolution)
- Choose appropriate preset as starting point:
- Smooth, subtle: Fine Grain
- Bold, chunky: Coarse Grain
- Dense, complex: Textural
- Pulsed, metric: Rhythmic
- Extreme, granular: Experimental
- Fine-tune parameters:
- Adjust K based on source complexity (more variety = higher K)
- Adjust frame size for time resolution (transients = smaller frames)
- Adjust crossfade for smoothness vs detail
- Iterate:
- Generate multiple versions with slight parameter variations
- Compare outputs to find optimal settings
- Use Show_info to diagnose issues (e.g., states with zero counts)
- Post-process:
- Export to audio editor or DAW
- Layer multiple generations for richness
- Apply effects (reverb, EQ, compression) to taste
- Combine with other sounds in larger composition
Advanced Workflow: Maximum Control
- Source preparation:
- Select source with clear timbral variety
- Optionally: Pre-process with EQ or dynamics to emphasize certain timbres
- Optionally: Concatenate multiple sources to learn from diverse material
- Parameter optimization:
- Run with Show_info = ON to examine state statistics
- Check if any states have very few frames (< 5) — if so, reduce K
- Check k-means convergence iterations — if hitting max, increase max_iterations
- Examine transition matrix in visualization — are there clear patterns or just noise?
- Feature analysis:
- Study feature space plot to see state separation
- If states overlap heavily, consider reducing K or different frame parameters
- If states are very sparse, consider increasing K
- Generation control:
- For specific duration, set Match_input_duration = OFF and Output_length_frames precisely
- For stereo variety, generate with Stereo_output = ON and compare left/right channels
- For multiple takes, run script repeatedly (different random seeds each time)
- Batch processing:
- Script multiple sources with same parameters for consistency
- Disable Draw_visualization and Show_info for faster processing
- Use Praat's scripting to automate parameter sweeps
- External analysis:
- Export transition matrix (copy from Info window if Show_info = ON)
- Analyze in Python/R for deeper statistical understanding
- Manually modify matrix and use in future generations (requires script modification)
Troubleshooting Common Issues
Causes:
- K too high (over-clustering)
- Source too varied (no clear patterns)
- Frame size too small (noisy features)
Solutions:
- Reduce K to 5-8
- Use longer frame size (30-50ms)
- Choose source with clearer timbral structure
Causes:
- K too low (under-clustering)
- Transition matrix too diagonal (strong self-transitions)
- Source too homogeneous
Solutions:
- Increase K to 12-16
- Use more varied source material
- Try different random seed (run again)
Causes:
- Crossfade too short or zero
- Frame size/hop mismatch
- Very short frames (< 5ms)
Solutions:
- Increase Crossfade_ms to 10-20ms
- Use frame hop = 0.5 × frame size
- Increase frame size to 15-20ms minimum
Causes:
- Very long source sound (> 60s)
- Very small frame hop (many frames)
- High K (> 20)
- Visualization enabled
Solutions:
- Trim source to 10-30 seconds
- Increase frame hop (reduce overlap)
- Reduce K to 8-12
- Disable Draw_visualization
- Use mono input instead of stereo
Causes:
- K too high for available data
- K-means initialization unlucky
- Some clusters too small
Solutions:
- Reduce K
- Increase max_kmeans_iterations for better convergence
- Run again (different initialization)
Causes:
- This shouldn't happen — each channel uses different random seed
- Possible script error or very deterministic transition matrix
Solutions:
- Check if transition matrix has only one non-zero entry per row (fully deterministic)
- Try different source or parameters to create more probabilistic matrix
Troubleshooting
Error Messages
Cause: No Sound selected, or multiple Sounds selected, or wrong object type selected.
Solution: Select exactly one Sound object in the Objects window before running the script.
Performance Issues
1. Reduce frame count:
- Shorter source sounds
- Larger frame_hop (less overlap)
- Longer frame_size (fewer frames total)
2. Optimize clustering:
- Smaller K (fewer clusters)
- Fewer max_kmeans_iterations
- Disable Show_info and Draw_visualization for faster runs
3. Memory management:
- Close unused Praat objects before running
- Use mono sounds instead of stereo (half the data)
- Lower sample rate if possible (22050 Hz often sufficient)
4. Batch processing:
- Process multiple sounds in sequence
- Save results and clear Objects window between runs
- Use script automation for parameter sweeps
Limitations and Workarounds
1. First-order Markov assumption:
- Limitation: Only considers immediate previous state, no long-term memory
- Workaround: Use longer frame sizes to capture more context
- Alternative: Generate multiple short sequences and manually arrange
2. Fixed feature set:
- Limitation: Only 4 timbre features (intensity, pitch, centroid, slope)
- Workaround: Modify script to add custom features (requires Praat scripting knowledge)
- Alternative: Pre-process source to emphasize desired timbral aspects
3. Gaussian emission assumption:
- Limitation: Assumes features are normally distributed within each state
- Workaround: Feature normalization helps, but non-Gaussian distributions may not be captured well
- Alternative: Use sources with relatively consistent timbres within each state
4. K-means initialization:
- Limitation: K-means can converge to local optima, results may vary between runs
- Workaround: Run script multiple times and choose best result
- Alternative: Increase max_kmeans_iterations for better convergence
5. Probabilistic frame selection:
- Limitation: Random frame selection weighted by emission probability, not guaranteed optimal
- Workaround: Generate multiple outputs and select best
- Alternative: Modify script for sequential or deterministic selection
Getting Help
1. Documentation:
- This user guide
- Script header comments in HMM_Timbre_Sequencing.praat
- Example files and presets in repository
2. Community:
- Praat user mailing list (users-list@praat.org)
- GitHub issues for bug reports: github.com/ShaiCohen-ops/Praat-plugin_AudioTools/issues
- Audio programming forums (KVR, Lines, etc.)
3. Further Development:
- Fork GitHub repository for custom modifications
- Contribute improvements via pull requests
- Request features via GitHub issues
- Share your results and creative applications
4. Academic References:
- Hidden Markov Models: Rabiner (1989) tutorial paper
- Timbre features: Peeters (2004) CUIDADO project
- Audio descriptor analysis: McAdams (2013) CMJ paper
- Granular synthesis: Roads (2001) Microsound book