Hidden Markov Model Timbre Sequencer — User Guide

True Hidden Markov Model (HMM) for timbre-based sequence generation with Gaussian emission models, Viterbi decoding, and comprehensive visualization.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 1.0 (2025) - True HMM Implementation License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools
Contents:

What this does

This script implements a True Hidden Markov Model (HMM) for timbre-based sequence generation. Unlike basic Markov chains, this is a complete HMM with hidden states (timbre classes), Gaussian emission models, learned transition probabilities, and Viterbi decoding for state inference.

HMM Components:
  • Hidden States: Discovered timbre classes (via k-means initialization)
  • Observations: 4D feature vectors (intensity, pitch, centroid, slope)
  • Emission Model: Gaussian distributions per state
  • Transition Model: Learned state-to-state probabilities
  • Decoding: Viterbi algorithm to find most likely state path
  • Generation: Sample states → sample observations → synthesize

Key Features:

Improvements in v1.0: This version implements a true Hidden Markov Model rather than just a simple Markov chain. Key enhancements include: (1) Gaussian emission models for each state with mean and standard deviation per feature. (2) Viterbi decoding to find the most likely state sequence given observations. (3) Emission-based probabilistic sampling for more natural frame selection. (4) 6-panel visualization system showing original state sequence, transition matrix, generated sequence, input/output waveforms, state distributions, and feature space. (5) Comprehensive state statistics including state counts, durations, and emission parameters. (6) Match input duration option for automatic length matching.

Technical Implementation: (1) Frame-based analysis: Divide audio into overlapping frames. (2) Feature extraction: Extract 4 timbre features per frame: intensity (RMS), pitch (F0), spectral centroid (brightness), spectral slope (high/low balance). (3) Normalization: Z-score normalization (mean=0, std=1). (4) K-means initialization: Group frames into K timbre states. (5) Emission modeling: Compute Gaussian parameters (mean, std) for each state. (6) Transition learning: Build probability matrix from state sequences. (7) Viterbi decoding: Find optimal state path through input. (8) Sequence generation: Generate new state sequence using transition probabilities. (9) Probabilistic synthesis: Sample frames based on emission probabilities. (10) Crossfade concatenation: Smooth grain assembly with configurable overlap.

Quick start

  1. In Praat Objects window, select a single Sound object (5-30 seconds recommended).
  2. Open script: HMM_Timbre_Sequencing.praat
  3. Choose a Preset or use Custom:
    • Custom — Use your own parameter values
    • Fine Grain — Subtle, 12 states (10ms frames, 5ms hop)
    • Coarse Grain — Bold, 5 states (100ms frames, 50ms hop)
    • Textural — Dense, 16 states (20ms frames, 10ms hop)
    • Rhythmic — Pulse, 8 states (30ms frames, 15ms hop)
    • Experimental — Glitchy, 24 states (8ms frames, 4ms hop)
  4. Set Feature Extraction parameters:
    • Frame_size_ms: 20ms (analysis window)
    • Frame_hop_ms: 10ms (50% overlap)
  5. Configure HMM Parameters:
    • Number_of_states_K: 8 (timbre clusters)
    • Max_kmeans_iterations: 50 (clustering iterations)
  6. Set Sequence Generation options:
    • Match_input_duration: ON (match source length) or OFF (use custom length)
    • Output_length_frames: 200 (only used if Match_input_duration is OFF)
  7. Configure Output settings:
    • Crossfade_ms: 5ms (grain crossfade)
    • Stereo_output: ON (stereo) or OFF (mono)
    • Draw_visualization: ON (6-panel visualization)
    • Show_info: ON (detailed statistics)
  8. Click OK — script analyzes, learns HMM, generates sequence
  9. Output appears in Objects window with preset name (e.g., "FineGrain_HMM_Sequence")
  10. Check Info window for detailed HMM statistics (if Show_info enabled)
  11. View Picture window for 6-panel visualization (if Draw_visualization enabled)
Quick tip: Use Match_input_duration = ON to automatically generate sequences with the same length as your input sound. This is convenient for creating variations while maintaining duration. If you want a specific length, turn it OFF and set Output_length_frames manually. Remember: output duration = Output_length_frames × Frame_hop_ms. For example, 200 frames × 10ms = 2.0 seconds. Stereo_output creates a richer spatial experience by generating two independent channels. Draw_visualization produces a comprehensive 6-panel view showing the complete HMM analysis and generation process.
Important: Select exactly one Sound object before running the script. The script will exit if you select zero or multiple sounds. For best results, use sounds between 5-30 seconds in length. Very short sounds (< 2 seconds) may not provide enough data for meaningful HMM learning. Very long sounds (> 60 seconds) may slow down processing significantly.

About Presets

The script includes 6 presets that configure all parameters for different sonic results:

  • Fine Grain: High resolution (12 states, 10ms frames, 5ms hop) — captures subtle timbral nuances, smooth transitions, good for detailed textures
  • Coarse Grain: Low resolution (5 states, 100ms frames, 50ms hop) — bold timbral shifts, rhythmic chunks, good for dramatic contrasts
  • Textural: Dense (16 states, 20ms frames, 10ms hop) — complex timbral palette, rich textures, 600 frames output for extended morphing
  • Rhythmic: Pulsed (8 states, 30ms frames, 15ms hop) — balanced resolution, 128 frames output creates clear rhythmic patterns
  • Experimental: Extreme (24 states, 8ms frames, 4ms hop) — maximum timbral detail, granular, glitchy, unpredictable sequences
  • Custom: Manual control — set all parameters yourself for specific needs

When you select a preset, all frame size, hop, K, output length, and crossfade parameters are automatically configured. You can still modify individual parameters after selecting a preset.

HMM Theory

What is a Hidden Markov Model?

A Hidden Markov Model is a statistical model that describes a system with hidden states that cannot be directly observed. Instead, we observe outputs (emissions) that are probabilistically related to the hidden states.

HMM Structure:
  • States (S): Hidden timbre classes (e.g., "bright", "dark", "noisy", etc.)
  • Observations (O): Measurable features extracted from audio frames
  • Transition Probabilities P(St+1|St): Probability of moving from state i to state j
  • Emission Probabilities P(O|S): Probability of observing features given a state
  • Initial Probabilities P(S1): Starting state distribution

HMM vs Simple Markov Chain

Aspect Simple Markov Chain Hidden Markov Model
States Directly observed Hidden, inferred from observations
Observations States themselves Separate from states
Emission model None Probabilistic (Gaussian in this script)
State inference Direct assignment Viterbi algorithm
Generation Follow transitions Sample states, then sample observations
Flexibility Rigid state assignments Probabilistic, handles uncertainty

Key Algorithms

1. K-means Initialization

Groups similar feature vectors into K clusters to initialize hidden states. Each cluster becomes a timbre class.

2. Emission Modeling

For each state k, compute Gaussian parameters:

  • Mean μk = average feature vector for all frames in state k
  • Std σk = standard deviation for each feature dimension

Emission probability: P(observation | state k) ~ N(μk, σk²)

3. Transition Matrix Learning

Count transitions from state i to state j across the input sequence, then normalize to get probabilities.

P(j|i) = count(i→j) / Σ count(i→k) for all k
4. Viterbi Decoding

Finds the most likely state sequence given the observations using dynamic programming. For each frame, computes the most probable path to each state considering both transition and emission probabilities.

5. Sequence Generation

Generates new state sequences by sampling from the transition matrix, starting from a random state and following transition probabilities.

6. Probabilistic Frame Selection

For each generated state, selects audio frames weighted by their emission probability (how "typical" they are for that state). This creates more natural-sounding output than random selection.

Mathematical Formulation

HMM Parameters

λ = (A, B, π)
A = {aij} transition probabilities, aij = P(St+1 = j | St = i)
B = {bj(o)} emission probabilities, bj(o) = P(Ot = o | St = j)
π = {πi} initial state probabilities, πi = P(S1 = i)

Gaussian Emission

bj(o) = (1/√(2πσj²)) × exp(-(o - μj)² / (2σj²))
where μj = mean feature vector for state j
σj = standard deviation for state j

Viterbi Algorithm

δt(j) = max [δt-1(i) × aij] × bj(ot)
i
where δt(j) = probability of most likely path ending in state j at time t

Timbre Features

The script extracts four timbre descriptors from each audio frame to create a 4-dimensional feature space:

1. Intensity (RMS Energy)

What it measures: Overall loudness/energy of the frame

Calculation: Root Mean Square of amplitude values

RMS = sqrt(Σ(x²) / N)

Timbral meaning: Distinguishes loud vs quiet passages, attacks vs decays

Normalization: Z-score normalized across all frames

2. Pitch (Fundamental Frequency)

What it measures: Perceived pitch/F0 in Hertz

Calculation: Autocorrelation-based F0 estimation (Praat's pitch analysis)

Timbral meaning: Harmonic content, spectral structure, perceived height

Handling undefined pitch: Frames without clear pitch are assigned 0 Hz before normalization

Normalization: Z-score normalized across all frames

3. Spectral Centroid

What it measures: "Center of mass" of the spectrum — brightness

Calculation: Weighted average of frequency bins by their magnitudes

Centroid = Σ(f × M(f)) / Σ M(f)
where f = frequency, M(f) = magnitude at f

Timbral meaning: Bright (high centroid) vs dark (low centroid) sounds

Normalization: Z-score normalized across all frames

4. Spectral Slope

What it measures: Balance between high and low frequencies

Calculation: Linear regression of log-magnitude spectrum

Slope = correlation between frequency and log(magnitude)

Timbral meaning: Negative slope = high-heavy (bright), positive slope = low-heavy (dull)

Normalization: Z-score normalized across all frames

Feature Space Visualization

The 6-panel visualization includes a 2D projection of the 4D feature space (Pitch vs Centroid), showing:

This reveals how well-separated the timbre classes are and how the Gaussian emission model represents each state in feature space.

Z-score Normalization

All features are normalized to have mean = 0 and standard deviation = 1 before k-means clustering:

z = (x - μ) / σ
where μ = mean across all frames
σ = standard deviation across all frames

This ensures all features contribute equally to clustering regardless of their original scale.

Algorithm Details

Processing Pipeline

Step 1: Frame Extraction

Divide sound into overlapping frames using Hann window:

  • Frame size: User-defined (e.g., 20ms)
  • Hop size: User-defined (e.g., 10ms = 50% overlap)
  • Number of frames: ⌊(duration - frame_size) / hop_size⌋ + 1
Step 2: Feature Extraction

For each frame, extract:

  1. Intensity via Get energy
  2. Pitch via To Pitch (ac) — autocorrelation method
  3. Spectral centroid via To Spectrum → Get centre of gravity
  4. Spectral slope via linear regression on log-magnitude spectrum

Store in 4×N feature matrix where N = number of frames

Step 3: Feature Normalization

Z-score normalization for each feature dimension:

  1. Compute mean μ and std σ across all N frames
  2. Transform: z = (x - μ) / σ
  3. Result: All features have mean 0, std 1
Step 4: K-means Clustering

Initialize K timbre states using k-means:

  1. Random initialization: Pick K random frames as initial centroids
  2. Assignment step: Assign each frame to nearest centroid (Euclidean distance in 4D)
  3. Update step: Recompute centroids as mean of assigned frames
  4. Repeat assignment-update until convergence or max iterations
  5. Result: Each frame has a state label ∈ {1, 2, ..., K}
Step 5: Emission Model Estimation

For each state k, compute Gaussian parameters:

  1. Find all frames assigned to state k
  2. Compute mean μk = average feature vector
  3. Compute std σk = standard deviation for each dimension
  4. Store as emission model: bk(o) ~ N(μk, σk²)
Step 6: Transition Matrix Learning

Build K×K transition matrix A:

  1. Initialize count matrix C[i][j] = 0
  2. For each consecutive pair (St, St+1), increment C[St][St+1]
  3. Normalize rows: A[i][j] = C[i][j] / Σk C[i][k]
  4. Add smoothing for zero-probability transitions (optional)
Step 7: Viterbi Decoding (Optional)

Find optimal state path through input observations:

  1. Initialize: δ1(j) = πj × bj(o1) for all states j
  2. Recursion: δt(j) = max[δt-1(i) × aij] × bj(ot)
  3. Termination: Best path probability = max[δT(j)]
  4. Backtrack to recover state sequence

This gives a "clean" state sequence that respects both transition and emission probabilities.

Step 8: Sequence Generation

Generate new state sequence of desired length:

  1. Choose random initial state (or use input distribution)
  2. For each position: Sample next state from P(St+1|St) using transition matrix row
  3. Result: New state sequence with learned transition probabilities
Step 9: Probabilistic Frame Selection

For each generated state, select representative audio frame:

  1. Find all source frames assigned to this state
  2. Compute emission probability for each: P(frame features | state)
  3. Sample frame weighted by emission probabilities (higher = more likely)
  4. Extract corresponding audio segment from source

This creates more "typical" outputs compared to random frame selection.

Step 10: Audio Reconstruction

Concatenate selected frames with crossfading:

  1. For each frame, extract audio with configured crossfade overlap
  2. Apply crossfade: linear fade-out on previous frame, fade-in on current frame
  3. Sum overlapping regions
  4. If stereo output: Generate two independent sequences (different random seeds)
  5. Result: Smooth concatenation without clicks

Optimization Techniques

Parameters Guide

Preset Selection

Preset: Optionmenu (6 options)

  • Custom — Manual parameter control
  • Fine Grain (subtle, 12 states) — 10ms frames, 5ms hop, 12 states, 400 frames output, 3ms crossfade
  • Coarse Grain (bold, 5 states) — 100ms frames, 50ms hop, 5 states, 80 frames output, 10ms crossfade
  • Textural (dense, 16 states) — 20ms frames, 10ms hop, 16 states, 600 frames output, 4ms crossfade
  • Rhythmic (pulse, 8 states) — 30ms frames, 15ms hop, 8 states, 128 frames output, 5ms crossfade
  • Experimental (glitchy, 24 states) — 8ms frames, 4ms hop, 24 states, 500 frames output, 1ms crossfade

When a preset is selected, all parameters below are automatically set. You can still manually adjust them afterward.

Feature Extraction

Frame_size_ms: Positive real (default: 20)

Duration of each analysis frame in milliseconds.

  • Smaller values (5-15ms): High time resolution, captures transients, percussive details. More frames = slower processing.
  • Medium values (20-40ms): Balanced time-frequency resolution. Good for most sounds.
  • Larger values (50-200ms): Better frequency resolution, smoother features. Captures sustained timbres well.

Rule of thumb: Use smaller frames for fast-changing sounds (drums, speech), larger frames for sustained sounds (drones, pads).

Frame_hop_ms: Positive real (default: 10)

Time step between consecutive frames in milliseconds.

  • Smaller hop: More overlap, smoother analysis, more frames (slower). 50% overlap is common (hop = 0.5 × frame_size).
  • Larger hop: Less overlap, faster processing, fewer frames, may miss rapid changes.
  • No overlap: hop = frame_size. Fastest but may create discontinuities.

Rule of thumb: Use hop = 0.5 × frame_size for most cases. Use smaller hop (0.25×) for very smooth analysis.

HMM Parameters

Number_of_states_K: Positive integer (default: 8)

Number of hidden timbre states (clusters) to discover.

  • Fewer states (3-6): Coarse timbral categories, bold contrasts, simpler model, faster processing.
  • Medium states (7-12): Balanced detail, captures most timbral variation without overfitting.
  • More states (13-20+): Fine timbral distinctions, risk of overfitting, slower processing.

Rule of thumb: Use K ≈ 0.3 × sqrt(number_of_frames) or 5-12 for most sounds. If output sounds too "samey", increase K. If too random/incoherent, decrease K.

Max_kmeans_iterations: Positive integer (default: 50)

Maximum iterations for k-means clustering convergence.

  • Fewer iterations (10-30): Faster, may not fully converge, less stable clustering.
  • More iterations (50-100): Better convergence, more stable results, slightly slower.

Rule of thumb: 50 is usually sufficient. The algorithm often converges earlier. Increase if you see warnings about non-convergence in Info window.

Sequence Generation

Match_input_duration: Boolean (default: ON)

Automatically match output duration to input duration.

  • ON: Output length = number of input frames. Convenient for creating variations of same length.
  • OFF: Use manual Output_length_frames setting. Allows shorter or longer outputs.

Use case: Turn ON for quick variations. Turn OFF when you need precise control over output length.

Output_length_frames: Positive integer (default: 200)

Number of frames in generated sequence (only used if Match_input_duration is OFF).

  • Fewer frames (50-100): Short sequences, good for testing, loops, or brief textures.
  • Medium frames (150-300): Typical output length, allows full HMM behavior to emerge.
  • More frames (400+): Long sequences, extended evolution, may become repetitive.

Duration calculation: Output duration ≈ Output_length_frames × Frame_hop_ms / 1000 seconds

Example: 200 frames × 10ms = 2.0 seconds

Output Settings

Crossfade_ms: Positive real (default: 5)

Crossfade duration between consecutive frames in milliseconds.

  • No crossfade (0ms): Potential clicks, abrupt transitions. Not recommended.
  • Short crossfade (1-5ms): Minimal smoothing, preserves transients, granular texture.
  • Medium crossfade (5-15ms): Smooth transitions, natural sound, default choice.
  • Long crossfade (20-50ms): Very smooth, blurred, may lose detail.

Rule of thumb: Use 5-10ms for most sounds. Decrease for percussive sounds to preserve attack. Increase for legato/sustained sounds.

Stereo_output: Boolean (default: ON)

Generate stereo (2-channel) or mono (1-channel) output.

  • ON: Two independent sequences with different random seeds. Wider stereo field, richer sound, twice the computation.
  • OFF: Single mono sequence. Faster processing, smaller file size.

Use case: Use stereo for final output, spatial music, or immersive textures. Use mono for testing or when stereo is not needed.

Draw_visualization: Boolean (default: ON)

Generate 6-panel comprehensive visualization in Picture window.

  • ON: Creates detailed visual analysis showing: (1) Original state sequence, (2) Transition matrix heatmap, (3) Generated state sequence, (4) Input/output waveforms, (5) State distribution histograms, (6) Feature space projection
  • OFF: No visualization. Faster processing.

Panels explained:

  1. Original State Sequence: State assignments over time in input sound
  2. Transition Matrix: Learned probabilities as color-coded grid (brighter = higher probability)
  3. Generated Sequence: New state sequence over time
  4. Waveforms: Input (top) and output (bottom) amplitude over time
  5. State Distributions: Histogram comparing state frequencies in input vs output
  6. Feature Space: 2D projection (Pitch vs Centroid) showing state clusters and emission means

Use case: Turn ON for analysis, debugging, or presentation. Turn OFF for batch processing or when visualization is not needed.

Show_info: Boolean (default: ON)

Display detailed statistics and diagnostics in Info window.

  • ON: Prints comprehensive information including: HMM model structure, state statistics, emission parameters, transition probabilities, generation details, convergence info
  • OFF: Minimal output. Only shows completion message.

Use case: Turn ON for first runs, debugging, or when you need technical details. Turn OFF for quiet batch processing.

Parameter Interactions

Applications

1. Algorithmic Composition

Generate musical material that evolves according to learned timbral patterns:

2. Sound Design

Create new textures and timbral evolutions:

3. Audio Analysis

Use HMM as an analysis tool:

4. Generative Music

Real-time or offline generation systems:

5. Music Information Retrieval

Extract meaningful timbral information:

6. Educational Uses

Teaching timbre, probability, and signal processing:

Creative Workflows

Workflow 1: Variation Generator
  1. Select a musical phrase or texture as source
  2. Use Match_input_duration = ON to keep same length
  3. Generate 5-10 variations with different random seeds (run script multiple times)
  4. Arrange variations in a DAW to create evolving section
  5. Layer variations for complex textures
Workflow 2: Texture Morphing
  1. Prepare two contrasting source sounds (e.g., water, fire)
  2. Analyze each with same K value
  3. Manually blend transition matrices (external processing)
  4. Generate sequence from blended matrix
  5. Result: Hybrid texture with elements of both sources
Workflow 3: Timbral Sketching
  1. Record short improvisations exploring different timbres
  2. Analyze with high K (12-16) to capture nuances
  3. Generate long sequences (400+ frames) to develop ideas
  4. Use generated sequences as compositional sketches
  5. Refine and orchestrate based on generated material

Complete Workflow

Beginner Workflow: First Steps

  1. Prepare source sound:
    • Open or record a sound in Praat (5-20 seconds recommended)
    • Listen to it — understand its timbral content
    • Optionally trim to most interesting section
  2. Run script with defaults:
    • Select your Sound object
    • Run HMM_Timbre_Sequencing.praat
    • Keep all defaults (or try a preset like "Fine Grain")
    • Click OK
  3. Examine output:
    • Play the generated sound
    • Look at the 6-panel visualization in Picture window
    • Read the Info window statistics
  4. Understand what happened:
    • Original state sequence (top-left): Shows timbral evolution in source
    • Transition matrix (top-center): Shows which states follow others
    • Generated sequence (top-right): Shows new temporal ordering
    • Waveforms (middle-left): Compare input and output amplitude envelopes
    • State distributions (middle-right): Compare state frequencies
    • Feature space (bottom): Shows timbre clusters in 2D
  5. Experiment:
    • Try different presets to hear their effects
    • Adjust K up and down to change granularity
    • Toggle Stereo_output to hear mono vs stereo differences

Intermediate Workflow: Targeted Results

  1. Define your goal:
    • What kind of output do you want? (smooth, glitchy, rhythmic, textural, etc.)
    • What should it preserve from source? (harmonic content, rhythmic feel, spectral character)
    • What should it change? (temporal order, density, evolution)
  2. Choose appropriate preset as starting point:
    • Smooth, subtle: Fine Grain
    • Bold, chunky: Coarse Grain
    • Dense, complex: Textural
    • Pulsed, metric: Rhythmic
    • Extreme, granular: Experimental
  3. Fine-tune parameters:
    • Adjust K based on source complexity (more variety = higher K)
    • Adjust frame size for time resolution (transients = smaller frames)
    • Adjust crossfade for smoothness vs detail
  4. Iterate:
    • Generate multiple versions with slight parameter variations
    • Compare outputs to find optimal settings
    • Use Show_info to diagnose issues (e.g., states with zero counts)
  5. Post-process:
    • Export to audio editor or DAW
    • Layer multiple generations for richness
    • Apply effects (reverb, EQ, compression) to taste
    • Combine with other sounds in larger composition

Advanced Workflow: Maximum Control

  1. Source preparation:
    • Select source with clear timbral variety
    • Optionally: Pre-process with EQ or dynamics to emphasize certain timbres
    • Optionally: Concatenate multiple sources to learn from diverse material
  2. Parameter optimization:
    • Run with Show_info = ON to examine state statistics
    • Check if any states have very few frames (< 5) — if so, reduce K
    • Check k-means convergence iterations — if hitting max, increase max_iterations
    • Examine transition matrix in visualization — are there clear patterns or just noise?
  3. Feature analysis:
    • Study feature space plot to see state separation
    • If states overlap heavily, consider reducing K or different frame parameters
    • If states are very sparse, consider increasing K
  4. Generation control:
    • For specific duration, set Match_input_duration = OFF and Output_length_frames precisely
    • For stereo variety, generate with Stereo_output = ON and compare left/right channels
    • For multiple takes, run script repeatedly (different random seeds each time)
  5. Batch processing:
    • Script multiple sources with same parameters for consistency
    • Disable Draw_visualization and Show_info for faster processing
    • Use Praat's scripting to automate parameter sweeps
  6. External analysis:
    • Export transition matrix (copy from Info window if Show_info = ON)
    • Analyze in Python/R for deeper statistical understanding
    • Manually modify matrix and use in future generations (requires script modification)

Troubleshooting Common Issues

Problem: Output sounds too random/incoherent

Causes:

  • K too high (over-clustering)
  • Source too varied (no clear patterns)
  • Frame size too small (noisy features)

Solutions:

  • Reduce K to 5-8
  • Use longer frame size (30-50ms)
  • Choose source with clearer timbral structure
Problem: Output sounds too similar to input

Causes:

  • K too low (under-clustering)
  • Transition matrix too diagonal (strong self-transitions)
  • Source too homogeneous

Solutions:

  • Increase K to 12-16
  • Use more varied source material
  • Try different random seed (run again)
Problem: Clicks or artifacts in output

Causes:

  • Crossfade too short or zero
  • Frame size/hop mismatch
  • Very short frames (< 5ms)

Solutions:

  • Increase Crossfade_ms to 10-20ms
  • Use frame hop = 0.5 × frame size
  • Increase frame size to 15-20ms minimum
Problem: Script runs very slowly

Causes:

  • Very long source sound (> 60s)
  • Very small frame hop (many frames)
  • High K (> 20)
  • Visualization enabled

Solutions:

  • Trim source to 10-30 seconds
  • Increase frame hop (reduce overlap)
  • Reduce K to 8-12
  • Disable Draw_visualization
  • Use mono input instead of stereo
Problem: Some states have zero counts

Causes:

  • K too high for available data
  • K-means initialization unlucky
  • Some clusters too small

Solutions:

  • Reduce K
  • Increase max_kmeans_iterations for better convergence
  • Run again (different initialization)
Problem: Stereo output is identical in both channels

Causes:

  • This shouldn't happen — each channel uses different random seed
  • Possible script error or very deterministic transition matrix

Solutions:

  • Check if transition matrix has only one non-zero entry per row (fully deterministic)
  • Try different source or parameters to create more probabilistic matrix

Troubleshooting

Error Messages

"Please select exactly one Sound object"

Cause: No Sound selected, or multiple Sounds selected, or wrong object type selected.

Solution: Select exactly one Sound object in the Objects window before running the script.

Performance Issues

Speed and Memory Tips:

1. Reduce frame count:

  • Shorter source sounds
  • Larger frame_hop (less overlap)
  • Longer frame_size (fewer frames total)

2. Optimize clustering:

  • Smaller K (fewer clusters)
  • Fewer max_kmeans_iterations
  • Disable Show_info and Draw_visualization for faster runs

3. Memory management:

  • Close unused Praat objects before running
  • Use mono sounds instead of stereo (half the data)
  • Lower sample rate if possible (22050 Hz often sufficient)

4. Batch processing:

  • Process multiple sounds in sequence
  • Save results and clear Objects window between runs
  • Use script automation for parameter sweeps

Limitations and Workarounds

Current Limitations:

1. First-order Markov assumption:

  • Limitation: Only considers immediate previous state, no long-term memory
  • Workaround: Use longer frame sizes to capture more context
  • Alternative: Generate multiple short sequences and manually arrange

2. Fixed feature set:

  • Limitation: Only 4 timbre features (intensity, pitch, centroid, slope)
  • Workaround: Modify script to add custom features (requires Praat scripting knowledge)
  • Alternative: Pre-process source to emphasize desired timbral aspects

3. Gaussian emission assumption:

  • Limitation: Assumes features are normally distributed within each state
  • Workaround: Feature normalization helps, but non-Gaussian distributions may not be captured well
  • Alternative: Use sources with relatively consistent timbres within each state

4. K-means initialization:

  • Limitation: K-means can converge to local optima, results may vary between runs
  • Workaround: Run script multiple times and choose best result
  • Alternative: Increase max_kmeans_iterations for better convergence

5. Probabilistic frame selection:

  • Limitation: Random frame selection weighted by emission probability, not guaranteed optimal
  • Workaround: Generate multiple outputs and select best
  • Alternative: Modify script for sequential or deterministic selection

Getting Help

Resources and Support:

1. Documentation:

  • This user guide
  • Script header comments in HMM_Timbre_Sequencing.praat
  • Example files and presets in repository

2. Community:

3. Further Development:

  • Fork GitHub repository for custom modifications
  • Contribute improvements via pull requests
  • Request features via GitHub issues
  • Share your results and creative applications

4. Academic References:

  • Hidden Markov Models: Rabiner (1989) tutorial paper
  • Timbre features: Peeters (2004) CUIDADO project
  • Audio descriptor analysis: McAdams (2013) CMJ paper
  • Granular synthesis: Roads (2001) Microsound book