Hidden Markov Model Timbre Sequencer — User Guide

True Hidden Markov Model (HMM) for timbre-based sequence generation with Gaussian emission models, Viterbi decoding, and comprehensive visualization.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 1.0 (2025) - True HMM Implementation License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools

Contents:

What this does Quick start HMM Theory Timbre Features Algorithm Details Parameters Guide Applications Complete Workflow Troubleshooting

What this does

This script implements a True Hidden Markov Model (HMM) for timbre-based sequence generation. Unlike basic Markov chains, this is a complete HMM with hidden states (timbre classes), Gaussian emission models, learned transition probabilities, and Viterbi decoding for state inference.

HMM Components:

Hidden States: Discovered timbre classes (via k-means initialization)
Observations: 4D feature vectors (intensity, pitch, centroid, slope)
Emission Model: Gaussian distributions per state
Transition Model: Learned state-to-state probabilities
Decoding: Viterbi algorithm to find most likely state path
Generation: Sample states → sample observations → synthesize

Key Features:

True HMM with emission probabilities
Viterbi decoding for state inference
Comprehensive 6-panel visualization
Emission-based probabilistic frame selection
State statistics and diagnostics
Match input duration option
Stereo output support
Optimized crossfade processing

Improvements in v1.0: This version implements a true Hidden Markov Model rather than just a simple Markov chain. Key enhancements include: (1) Gaussian emission models for each state with mean and standard deviation per feature. (2) Viterbi decoding to find the most likely state sequence given observations. (3) Emission-based probabilistic sampling for more natural frame selection. (4) 6-panel visualization system showing original state sequence, transition matrix, generated sequence, input/output waveforms, state distributions, and feature space. (5) Comprehensive state statistics including state counts, durations, and emission parameters. (6) Match input duration option for automatic length matching.

Technical Implementation: (1) Frame-based analysis: Divide audio into overlapping frames. (2) Feature extraction: Extract 4 timbre features per frame: intensity (RMS), pitch (F0), spectral centroid (brightness), spectral slope (high/low balance). (3) Normalization: Z-score normalization (mean=0, std=1). (4) K-means initialization: Group frames into K timbre states. (5) Emission modeling: Compute Gaussian parameters (mean, std) for each state. (6) Transition learning: Build probability matrix from state sequences. (7) Viterbi decoding: Find optimal state path through input. (8) Sequence generation: Generate new state sequence using transition probabilities. (9) Probabilistic synthesis: Sample frames based on emission probabilities. (10) Crossfade concatenation: Smooth grain assembly with configurable overlap.

Quick start

In Praat Objects window, select a single Sound object (5-30 seconds recommended).
Open script: HMM_Timbre_Sequencing.praat
Choose a Preset or use Custom:
- Custom — Use your own parameter values
- Fine Grain — Subtle, 12 states (10ms frames, 5ms hop)
- Coarse Grain — Bold, 5 states (100ms frames, 50ms hop)
- Textural — Dense, 16 states (20ms frames, 10ms hop)
- Rhythmic — Pulse, 8 states (30ms frames, 15ms hop)
- Experimental — Glitchy, 24 states (8ms frames, 4ms hop)
Set Feature Extraction parameters:
- Frame_size_ms: 20ms (analysis window)
- Frame_hop_ms: 10ms (50% overlap)
Configure HMM Parameters:
- Number_of_states_K: 8 (timbre clusters)
- Max_kmeans_iterations: 50 (clustering iterations)
Set Sequence Generation options:
- Match_input_duration: ON (match source length) or OFF (use custom length)
- Output_length_frames: 200 (only used if Match_input_duration is OFF)
Configure Output settings:
- Crossfade_ms: 5ms (grain crossfade)
- Stereo_output: ON (stereo) or OFF (mono)
- Draw_visualization: ON (6-panel visualization)
- Show_info: ON (detailed statistics)
Click OK — script analyzes, learns HMM, generates sequence
Output appears in Objects window with preset name (e.g., "FineGrain_HMM_Sequence")
Check Info window for detailed HMM statistics (if Show_info enabled)
View Picture window for 6-panel visualization (if Draw_visualization enabled)

Quick tip: Use Match_input_duration = ON to automatically generate sequences with the same length as your input sound. This is convenient for creating variations while maintaining duration. If you want a specific length, turn it OFF and set Output_length_frames manually. Remember: output duration = Output_length_frames × Frame_hop_ms. For example, 200 frames × 10ms = 2.0 seconds. Stereo_output creates a richer spatial experience by generating two independent channels. Draw_visualization produces a comprehensive 6-panel view showing the complete HMM analysis and generation process.

Important: Select exactly one Sound object before running the script. The script will exit if you select zero or multiple sounds. For best results, use sounds between 5-30 seconds in length. Very short sounds (< 2 seconds) may not provide enough data for meaningful HMM learning. Very long sounds (> 60 seconds) may slow down processing significantly.

About Presets

The script includes 6 presets that configure all parameters for different sonic results:

Fine Grain: High resolution (12 states, 10ms frames, 5ms hop) — captures subtle timbral nuances, smooth transitions, good for detailed textures
Coarse Grain: Low resolution (5 states, 100ms frames, 50ms hop) — bold timbral shifts, rhythmic chunks, good for dramatic contrasts
Textural: Dense (16 states, 20ms frames, 10ms hop) — complex timbral palette, rich textures, 600 frames output for extended morphing
Rhythmic: Pulsed (8 states, 30ms frames, 15ms hop) — balanced resolution, 128 frames output creates clear rhythmic patterns
Experimental: Extreme (24 states, 8ms frames, 4ms hop) — maximum timbral detail, granular, glitchy, unpredictable sequences
Custom: Manual control — set all parameters yourself for specific needs

When you select a preset, all frame size, hop, K, output length, and crossfade parameters are automatically configured. You can still modify individual parameters after selecting a preset.

HMM Theory

What is a Hidden Markov Model?

A Hidden Markov Model is a statistical model that describes a system with hidden states that cannot be directly observed. Instead, we observe outputs (emissions) that are probabilistically related to the hidden states.

HMM Structure:

States (S): Hidden timbre classes (e.g., "bright", "dark", "noisy", etc.)
Observations (O): Measurable features extracted from audio frames
Transition Probabilities P(St+1|St): Probability of moving from state i to state j
Emission Probabilities P(O|S): Probability of observing features given a state
Initial Probabilities P(S1): Starting state distribution

HMM vs Simple Markov Chain

Aspect	Simple Markov Chain	Hidden Markov Model
States	Directly observed	Hidden, inferred from observations
Observations	States themselves	Separate from states
Emission model	None	Probabilistic (Gaussian in this script)
State inference	Direct assignment	Viterbi algorithm
Generation	Follow transitions	Sample states, then sample observations
Flexibility	Rigid state assignments	Probabilistic, handles uncertainty

Key Algorithms

1. K-means Initialization

Groups similar feature vectors into K clusters to initialize hidden states. Each cluster becomes a timbre class.

2. Emission Modeling

For each state k, compute Gaussian parameters:

Mean μk = average feature vector for all frames in state k
Std σk = standard deviation for each feature dimension

Emission probability: P(observation | state k) ~ N(μk, σk²)

3. Transition Matrix Learning

Count transitions from state i to state j across the input sequence, then normalize to get probabilities.

P(j|i) = count(i→j) / Σ count(i→k) for all k

4. Viterbi Decoding

Finds the most likely state sequence given the observations using dynamic programming. For each frame, computes the most probable path to each state considering both transition and emission probabilities.

5. Sequence Generation

Generates new state sequences by sampling from the transition matrix, starting from a random state and following transition probabilities.

6. Probabilistic Frame Selection

For each generated state, selects audio frames weighted by their emission probability (how "typical" they are for that state). This creates more natural-sounding output than random selection.

Mathematical Formulation

HMM Parameters

λ = (A, B, π)
A = {aij} transition probabilities, aij = P(St+1 = j | St = i)
B = {bj(o)} emission probabilities, bj(o) = P(Ot = o | St = j)
π = {πi} initial state probabilities, πi = P(S1 = i)

Gaussian Emission

bj(o) = (1/√(2πσj²)) × exp(-(o - μj)² / (2σj²))
where μj = mean feature vector for state j
σj = standard deviation for state j

Viterbi Algorithm

δt(j) = max [δt-1(i) × aij] × bj(ot)
i
where δt(j) = probability of most likely path ending in state j at time t

Timbre Features

The script extracts four timbre descriptors from each audio frame to create a 4-dimensional feature space:

1. Intensity (RMS Energy)

What it measures: Overall loudness/energy of the frame

Calculation: Root Mean Square of amplitude values

RMS = sqrt(Σ(x²) / N)

Timbral meaning: Distinguishes loud vs quiet passages, attacks vs decays

Normalization: Z-score normalized across all frames

2. Pitch (Fundamental Frequency)

What it measures: Perceived pitch/F0 in Hertz

Calculation: Autocorrelation-based F0 estimation (Praat's pitch analysis)

Timbral meaning: Harmonic content, spectral structure, perceived height

Handling undefined pitch: Frames without clear pitch are assigned 0 Hz before normalization

Normalization: Z-score normalized across all frames

3. Spectral Centroid

What it measures: "Center of mass" of the spectrum — brightness

Calculation: Weighted average of frequency bins by their magnitudes

Centroid = Σ(f × M(f)) / Σ M(f)
where f = frequency, M(f) = magnitude at f

Timbral meaning: Bright (high centroid) vs dark (low centroid) sounds

Normalization: Z-score normalized across all frames

4. Spectral Slope

What it measures: Balance between high and low frequencies

Calculation: Linear regression of log-magnitude spectrum

Slope = correlation between frequency and log(magnitude)

Timbral meaning: Negative slope = high-heavy (bright), positive slope = low-heavy (dull)

Normalization: Z-score normalized across all frames

Feature Space Visualization

The 6-panel visualization includes a 2D projection of the 4D feature space (Pitch vs Centroid), showing:

Small colored dots: Individual frames colored by their state assignment
Large circles with black outlines: Emission means (μ) for each state
State labels: Numbers inside the mean circles

This reveals how well-separated the timbre classes are and how the Gaussian emission model represents each state in feature space.

Z-score Normalization

All features are normalized to have mean = 0 and standard deviation = 1 before k-means clustering:

z = (x - μ) / σ
where μ = mean across all frames
σ = standard deviation across all frames

This ensures all features contribute equally to clustering regardless of their original scale.

Algorithm Details

Processing Pipeline

Step 1: Frame Extraction

Divide sound into overlapping frames using Hann window:

Frame size: User-defined (e.g., 20ms)
Hop size: User-defined (e.g., 10ms = 50% overlap)
Number of frames: ⌊(duration - frame_size) / hop_size⌋ + 1

Step 2: Feature Extraction

For each frame, extract:

Intensity via Get energy
Pitch via To Pitch (ac) — autocorrelation method
Spectral centroid via To Spectrum → Get centre of gravity
Spectral slope via linear regression on log-magnitude spectrum

Store in 4×N feature matrix where N = number of frames

Step 3: Feature Normalization

Z-score normalization for each feature dimension:

Compute mean μ and std σ across all N frames
Transform: z = (x - μ) / σ
Result: All features have mean 0, std 1

Step 4: K-means Clustering

Initialize K timbre states using k-means:

Random initialization: Pick K random frames as initial centroids
Assignment step: Assign each frame to nearest centroid (Euclidean distance in 4D)
Update step: Recompute centroids as mean of assigned frames
Repeat assignment-update until convergence or max iterations
Result: Each frame has a state label ∈ {1, 2, ..., K}

Step 5: Emission Model Estimation

For each state k, compute Gaussian parameters:

Find all frames assigned to state k
Compute mean μk = average feature vector
Compute std σk = standard deviation for each dimension
Store as emission model: bk(o) ~ N(μk, σk²)

Step 6: Transition Matrix Learning

Build K×K transition matrix A:

Initialize count matrix C[i][j] = 0
For each consecutive pair (St, St+1), increment C[St][St+1]
Normalize rows: A[i][j] = C[i][j] / Σk C[i][k]
Add smoothing for zero-probability transitions (optional)

Step 7: Viterbi Decoding (Optional)

Find optimal state path through input observations:

Initialize: δ1(j) = πj × bj(o1) for all states j
Recursion: δt(j) = max[δt-1(i) × aij] × bj(ot)
Termination: Best path probability = max[δT(j)]
Backtrack to recover state sequence

This gives a "clean" state sequence that respects both transition and emission probabilities.

Step 8: Sequence Generation

Generate new state sequence of desired length:

Choose random initial state (or use input distribution)
For each position: Sample next state from P(St+1|St) using transition matrix row
Result: New state sequence with learned transition probabilities

Step 9: Probabilistic Frame Selection

For each generated state, select representative audio frame:

Find all source frames assigned to this state
Compute emission probability for each: P(frame features | state)
Sample frame weighted by emission probabilities (higher = more likely)
Extract corresponding audio segment from source

This creates more "typical" outputs compared to random frame selection.

Step 10: Audio Reconstruction

Concatenate selected frames with crossfading:

For each frame, extract audio with configured crossfade overlap
Apply crossfade: linear fade-out on previous frame, fade-in on current frame
Sum overlapping regions
If stereo output: Generate two independent sequences (different random seeds)
Result: Smooth concatenation without clicks

Optimization Techniques

Global analysis objects: Create Intensity, Pitch, Spectrum once per source sound, reuse for all frames
Vectorized operations: Store features in arrays for batch processing
Early convergence: Stop k-means when centroids stabilize (< 0.001 change)
Efficient frame extraction: Use Extract part for frame isolation instead of copying entire sound
Memory management: Remove temporary objects immediately after use

Parameters Guide

Preset Selection

Preset: Optionmenu (6 options)

Custom — Manual parameter control
Fine Grain (subtle, 12 states) — 10ms frames, 5ms hop, 12 states, 400 frames output, 3ms crossfade
Coarse Grain (bold, 5 states) — 100ms frames, 50ms hop, 5 states, 80 frames output, 10ms crossfade
Textural (dense, 16 states) — 20ms frames, 10ms hop, 16 states, 600 frames output, 4ms crossfade
Rhythmic (pulse, 8 states) — 30ms frames, 15ms hop, 8 states, 128 frames output, 5ms crossfade
Experimental (glitchy, 24 states) — 8ms frames, 4ms hop, 24 states, 500 frames output, 1ms crossfade

When a preset is selected, all parameters below are automatically set. You can still manually adjust them afterward.

Feature Extraction

Frame_size_ms: Positive real (default: 20)

Duration of each analysis frame in milliseconds.

Smaller values (5-15ms): High time resolution, captures transients, percussive details. More frames = slower processing.
Medium values (20-40ms): Balanced time-frequency resolution. Good for most sounds.
Larger values (50-200ms): Better frequency resolution, smoother features. Captures sustained timbres well.

Rule of thumb: Use smaller frames for fast-changing sounds (drums, speech), larger frames for sustained sounds (drones, pads).

Frame_hop_ms: Positive real (default: 10)

Time step between consecutive frames in milliseconds.

Smaller hop: More overlap, smoother analysis, more frames (slower). 50% overlap is common (hop = 0.5 × frame_size).
Larger hop: Less overlap, faster processing, fewer frames, may miss rapid changes.
No overlap: hop = frame_size. Fastest but may create discontinuities.

Rule of thumb: Use hop = 0.5 × frame_size for most cases. Use smaller hop (0.25×) for very smooth analysis.

HMM Parameters

Number_of_states_K: Positive integer (default: 8)

Number of hidden timbre states (clusters) to discover.

Fewer states (3-6): Coarse timbral categories, bold contrasts, simpler model, faster processing.
Medium states (7-12): Balanced detail, captures most timbral variation without overfitting.
More states (13-20+): Fine timbral distinctions, risk of overfitting, slower processing.

Rule of thumb: Use K ≈ 0.3 × sqrt(number_of_frames) or 5-12 for most sounds. If output sounds too "samey", increase K. If too random/incoherent, decrease K.

Max_kmeans_iterations: Positive integer (default: 50)

Maximum iterations for k-means clustering convergence.

Fewer iterations (10-30): Faster, may not fully converge, less stable clustering.
More iterations (50-100): Better convergence, more stable results, slightly slower.

Rule of thumb: 50 is usually sufficient. The algorithm often converges earlier. Increase if you see warnings about non-convergence in Info window.

Sequence Generation

Match_input_duration: Boolean (default: ON)

Automatically match output duration to input duration.

ON: Output length = number of input frames. Convenient for creating variations of same length.
OFF: Use manual Output_length_frames setting. Allows shorter or longer outputs.

Use case: Turn ON for quick variations. Turn OFF when you need precise control over output length.

Output_length_frames: Positive integer (default: 200)

Number of frames in generated sequence (only used if Match_input_duration is OFF).

Fewer frames (50-100): Short sequences, good for testing, loops, or brief textures.
Medium frames (150-300): Typical output length, allows full HMM behavior to emerge.
More frames (400+): Long sequences, extended evolution, may become repetitive.

Duration calculation: Output duration ≈ Output_length_frames × Frame_hop_ms / 1000 seconds

Example: 200 frames × 10ms = 2.0 seconds

Output Settings

Crossfade_ms: Positive real (default: 5)

Crossfade duration between consecutive frames in milliseconds.

No crossfade (0ms): Potential clicks, abrupt transitions. Not recommended.
Short crossfade (1-5ms): Minimal smoothing, preserves transients, granular texture.
Medium crossfade (5-15ms): Smooth transitions, natural sound, default choice.
Long crossfade (20-50ms): Very smooth, blurred, may lose detail.

Rule of thumb: Use 5-10ms for most sounds. Decrease for percussive sounds to preserve attack. Increase for legato/sustained sounds.

Stereo_output: Boolean (default: ON)

Generate stereo (2-channel) or mono (1-channel) output.

ON: Two independent sequences with different random seeds. Wider stereo field, richer sound, twice the computation.
OFF: Single mono sequence. Faster processing, smaller file size.

Use case: Use stereo for final output, spatial music, or immersive textures. Use mono for testing or when stereo is not needed.

Draw_visualization: Boolean (default: ON)

Generate 6-panel comprehensive visualization in Picture window.

ON: Creates detailed visual analysis showing: (1) Original state sequence, (2) Transition matrix heatmap, (3) Generated state sequence, (4) Input/output waveforms, (5) State distribution histograms, (6) Feature space projection
OFF: No visualization. Faster processing.

Panels explained:

Original State Sequence: State assignments over time in input sound
Transition Matrix: Learned probabilities as color-coded grid (brighter = higher probability)
Generated Sequence: New state sequence over time
Waveforms: Input (top) and output (bottom) amplitude over time
State Distributions: Histogram comparing state frequencies in input vs output
Feature Space: 2D projection (Pitch vs Centroid) showing state clusters and emission means

Use case: Turn ON for analysis, debugging, or presentation. Turn OFF for batch processing or when visualization is not needed.

Show_info: Boolean (default: ON)

Display detailed statistics and diagnostics in Info window.

ON: Prints comprehensive information including: HMM model structure, state statistics, emission parameters, transition probabilities, generation details, convergence info
OFF: Minimal output. Only shows completion message.

Use case: Turn ON for first runs, debugging, or when you need technical details. Turn OFF for quiet batch processing.

Parameter Interactions

Frame_size vs Frame_hop: Typical ratio is 2:1 (e.g., 20ms frame, 10ms hop). Smaller ratios = more overlap = smoother but slower.
K vs Frame count: More frames support more states. Don't use K > 30-40% of frame count.
Output_length vs Frame_hop: Longer hop = longer real-time duration for same frame count.
Crossfade vs Frame_hop: Crossfade should be ≤ Frame_hop to avoid excessive blurring.
Stereo vs Processing time: Stereo roughly doubles generation time (two independent sequences).

Applications

1. Algorithmic Composition

Generate musical material that evolves according to learned timbral patterns:

Variation generation: Create multiple versions of a musical phrase with different temporal orderings but same timbral palette
Motivic development: Extract timbral "motifs" from one sound and apply to another
Form building: Generate long sequences that naturally develop timbre over time
Orchestration ideas: Analyze orchestral textures to discover typical timbre successions

2. Sound Design

Create new textures and timbral evolutions:

Texture synthesis: Generate evolving ambient textures from field recordings
Glitch effects: Use high K and short frames for granular, glitchy results
Morphing sequences: Smooth timbral transitions learned from source material
Rhythmic patterns: Extract and recombine rhythmic elements from loops

3. Audio Analysis

Use HMM as an analysis tool:

Timbre segmentation: Identify distinct timbral regions in recordings
Style analysis: Compare transition matrices between different genres/performers
Performance analysis: Study how performers navigate timbral space over time
Similarity metrics: Use emission models to measure timbral distance

4. Generative Music

Real-time or offline generation systems:

Live performance: Pre-compute HMMs from live input, generate variations on the fly
Installation art: Continuously evolving soundscapes based on environmental recordings
Interactive systems: User selects K, frame size, etc. to explore parameter space
Crossfading installations: Generate multiple sequences and crossfade between them

5. Music Information Retrieval

Extract meaningful timbral information:

Instrument recognition: Train HMMs on different instruments, compare likelihoods
Audio fingerprinting: Use transition matrices as compact timbral signatures
Cover song detection: Compare timbral evolution patterns across versions
Genre classification: Different genres may have characteristic transition patterns

6. Educational Uses

Teaching timbre, probability, and signal processing:

Timbre perception: Demonstrate how timbre can be quantified and manipulated
Markov models: Concrete audio example of abstract probabilistic models
Feature extraction: Visualize spectral features in musical context
Experimental composition: Students create pieces using HMM-generated material

Creative Workflows

Workflow 1: Variation Generator

Select a musical phrase or texture as source
Use Match_input_duration = ON to keep same length
Generate 5-10 variations with different random seeds (run script multiple times)
Arrange variations in a DAW to create evolving section
Layer variations for complex textures

Workflow 2: Texture Morphing

Prepare two contrasting source sounds (e.g., water, fire)
Analyze each with same K value
Manually blend transition matrices (external processing)
Generate sequence from blended matrix
Result: Hybrid texture with elements of both sources

Workflow 3: Timbral Sketching

Record short improvisations exploring different timbres
Analyze with high K (12-16) to capture nuances
Generate long sequences (400+ frames) to develop ideas
Use generated sequences as compositional sketches
Refine and orchestrate based on generated material

Complete Workflow

Beginner Workflow: First Steps

Prepare source sound:
- Open or record a sound in Praat (5-20 seconds recommended)
- Listen to it — understand its timbral content
- Optionally trim to most interesting section
Run script with defaults:
- Select your Sound object
- Run HMM_Timbre_Sequencing.praat
- Keep all defaults (or try a preset like "Fine Grain")
- Click OK
Examine output:
- Play the generated sound
- Look at the 6-panel visualization in Picture window
- Read the Info window statistics
Understand what happened:
- Original state sequence (top-left): Shows timbral evolution in source
- Transition matrix (top-center): Shows which states follow others
- Generated sequence (top-right): Shows new temporal ordering
- Waveforms (middle-left): Compare input and output amplitude envelopes
- State distributions (middle-right): Compare state frequencies
- Feature space (bottom): Shows timbre clusters in 2D
Experiment:
- Try different presets to hear their effects
- Adjust K up and down to change granularity
- Toggle Stereo_output to hear mono vs stereo differences

Intermediate Workflow: Targeted Results

Define your goal:
- What kind of output do you want? (smooth, glitchy, rhythmic, textural, etc.)
- What should it preserve from source? (harmonic content, rhythmic feel, spectral character)
- What should it change? (temporal order, density, evolution)
Choose appropriate preset as starting point:
- Smooth, subtle: Fine Grain
- Bold, chunky: Coarse Grain
- Dense, complex: Textural
- Pulsed, metric: Rhythmic
- Extreme, granular: Experimental
Fine-tune parameters:
- Adjust K based on source complexity (more variety = higher K)
- Adjust frame size for time resolution (transients = smaller frames)
- Adjust crossfade for smoothness vs detail
Iterate:
- Generate multiple versions with slight parameter variations
- Compare outputs to find optimal settings
- Use Show_info to diagnose issues (e.g., states with zero counts)
Post-process:
- Export to audio editor or DAW
- Layer multiple generations for richness
- Apply effects (reverb, EQ, compression) to taste
- Combine with other sounds in larger composition

Advanced Workflow: Maximum Control

Source preparation:
- Select source with clear timbral variety
- Optionally: Pre-process with EQ or dynamics to emphasize certain timbres
- Optionally: Concatenate multiple sources to learn from diverse material
Parameter optimization:
- Run with Show_info = ON to examine state statistics
- Check if any states have very few frames (< 5) — if so, reduce K
- Check k-means convergence iterations — if hitting max, increase max_iterations
- Examine transition matrix in visualization — are there clear patterns or just noise?
Feature analysis:
- Study feature space plot to see state separation
- If states overlap heavily, consider reducing K or different frame parameters
- If states are very sparse, consider increasing K
Generation control:
- For specific duration, set Match_input_duration = OFF and Output_length_frames precisely
- For stereo variety, generate with Stereo_output = ON and compare left/right channels
- For multiple takes, run script repeatedly (different random seeds each time)
Batch processing:
- Script multiple sources with same parameters for consistency
- Disable Draw_visualization and Show_info for faster processing
- Use Praat's scripting to automate parameter sweeps
External analysis:
- Export transition matrix (copy from Info window if Show_info = ON)
- Analyze in Python/R for deeper statistical understanding
- Manually modify matrix and use in future generations (requires script modification)

Troubleshooting Common Issues

Problem: Output sounds too random/incoherent

Causes:

K too high (over-clustering)
Source too varied (no clear patterns)
Frame size too small (noisy features)

Solutions:

Reduce K to 5-8
Use longer frame size (30-50ms)
Choose source with clearer timbral structure

Problem: Output sounds too similar to input

Causes:

K too low (under-clustering)
Transition matrix too diagonal (strong self-transitions)
Source too homogeneous

Solutions:

Increase K to 12-16
Use more varied source material
Try different random seed (run again)

Problem: Clicks or artifacts in output

Causes:

Crossfade too short or zero
Frame size/hop mismatch
Very short frames (< 5ms)

Solutions:

Increase Crossfade_ms to 10-20ms
Use frame hop = 0.5 × frame size
Increase frame size to 15-20ms minimum

Problem: Script runs very slowly

Causes:

Very long source sound (> 60s)
Very small frame hop (many frames)
High K (> 20)
Visualization enabled

Solutions:

Trim source to 10-30 seconds
Increase frame hop (reduce overlap)
Reduce K to 8-12
Disable Draw_visualization
Use mono input instead of stereo

Problem: Some states have zero counts

Causes:

K too high for available data
K-means initialization unlucky
Some clusters too small

Solutions:

Reduce K
Increase max_kmeans_iterations for better convergence
Run again (different initialization)

Problem: Stereo output is identical in both channels

Causes:

This shouldn't happen — each channel uses different random seed
Possible script error or very deterministic transition matrix

Solutions:

Check if transition matrix has only one non-zero entry per row (fully deterministic)
Try different source or parameters to create more probabilistic matrix

Troubleshooting

Error Messages

"Please select exactly one Sound object"

Cause: No Sound selected, or multiple Sounds selected, or wrong object type selected.

Solution: Select exactly one Sound object in the Objects window before running the script.

Performance Issues

Speed and Memory Tips:

1. Reduce frame count:

Shorter source sounds
Larger frame_hop (less overlap)
Longer frame_size (fewer frames total)

2. Optimize clustering:

Smaller K (fewer clusters)
Fewer max_kmeans_iterations
Disable Show_info and Draw_visualization for faster runs

3. Memory management:

Close unused Praat objects before running
Use mono sounds instead of stereo (half the data)
Lower sample rate if possible (22050 Hz often sufficient)

4. Batch processing:

Process multiple sounds in sequence
Save results and clear Objects window between runs
Use script automation for parameter sweeps

Limitations and Workarounds

Current Limitations:

1. First-order Markov assumption:

Limitation: Only considers immediate previous state, no long-term memory
Workaround: Use longer frame sizes to capture more context
Alternative: Generate multiple short sequences and manually arrange

2. Fixed feature set:

Limitation: Only 4 timbre features (intensity, pitch, centroid, slope)
Workaround: Modify script to add custom features (requires Praat scripting knowledge)
Alternative: Pre-process source to emphasize desired timbral aspects

3. Gaussian emission assumption:

Limitation: Assumes features are normally distributed within each state
Workaround: Feature normalization helps, but non-Gaussian distributions may not be captured well
Alternative: Use sources with relatively consistent timbres within each state

4. K-means initialization:

Limitation: K-means can converge to local optima, results may vary between runs
Workaround: Run script multiple times and choose best result
Alternative: Increase max_kmeans_iterations for better convergence

5. Probabilistic frame selection:

Limitation: Random frame selection weighted by emission probability, not guaranteed optimal
Workaround: Generate multiple outputs and select best
Alternative: Modify script for sequential or deterministic selection

Getting Help

Resources and Support:

1. Documentation:

This user guide
Script header comments in HMM_Timbre_Sequencing.praat
Example files and presets in repository

2. Community:

Praat user mailing list (users-list@praat.org)
GitHub issues for bug reports: github.com/ShaiCohen-ops/Praat-plugin_AudioTools/issues
Audio programming forums (KVR, Lines, etc.)

3. Further Development:

Fork GitHub repository for custom modifications
Contribute improvements via pull requests
Request features via GitHub issues
Share your results and creative applications

4. Academic References:

Hidden Markov Models: Rabiner (1989) tutorial paper
Timbre features: Peeters (2004) CUIDADO project
Audio descriptor analysis: McAdams (2013) CMJ paper
Granular synthesis: Roads (2001) Microsound book