Markov Soundscape Weaver — User Guide

AI-driven temporal modeling: deconstructs audio into grains, learns texture states via clustering, models temporal grammar with Markov chains, and generates infinite streams that follow the natural flow of the original.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 0.1 (2025) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools

Contents:

What this does Quick start Markov Chain Theory Analysis Phase Synthesis Phase Applications

What this does

This script implements Markov soundscape weaving — a sophisticated temporal modeling approach that analyzes both the spectral content AND temporal evolution of source audio. Process involves: (1) Granular decomposition: Splits audio into discrete grains (non-overlapping). (2) Texture state learning: K-means clustering groups grains into spectral states. (3) Temporal grammar modeling: Markov chain analysis learns transition probabilities between states. (4) Generative weaving: New sequences generated by sampling grains from current state and transitioning via learned probabilities. Result: infinite streams that preserve both the spectral vocabulary AND temporal flow patterns of the original source.

Key Features:

Dual Learning — Spectral states + temporal transitions
Markov Modeling — First-order probability transitions
Non-overlapping Grains — Discrete temporal units
State-Based Synthesis — Texture-appropriate grain selection
Infinite Generation — Follows natural source flow indefinitely
Memory Management — Block-based processing for long outputs

What are Markov soundscapes? Traditional granular synthesis: random grain recombination. Markov approach: intelligent temporal modeling that learns "what tends to follow what" in the original audio. Benefits: (1) Temporal coherence: Generated sequences follow natural progression patterns. (2) Source authenticity: Preserves both sound quality AND flow. (3) Controllable variation: Same states, different sequences. (4) Musical structure: Captures phrasing, development, narrative flow. (5) Adaptive learning: Different sources yield different Markov personalities. Use cases: Generative music systems, soundscape composition, algorithmic accompaniment, interactive audio, music analysis, computational creativity.

Technical Implementation: (1) Preprocessing: Mono conversion, duration validation. (2) Feature extraction: Non-overlapping grain analysis (spectral centroid, bandwidth, pitch, harmonicity). (3) Clustering: K-means groups grains into k texture states. (4) Markov learning: Analyze state sequences to build k×k transition probability matrix. (5) Generative synthesis: Start at random state, while generating: sample grain from current state, use Markov matrix to determine next state. (6) Block management: Process in blocks of 50 grains for memory efficiency. (7) Output: Concatenate blocks, normalize, name "originalname_MarkovWeave". Processing scales with source duration and state complexity.

Quick start

In Praat, select exactly one Sound object.
Run script… → markov_soundscape_weaver.praat.
Set grain_size_ms (80ms default, smaller = more granular).
Choose number_of_states (5 default, higher = more texture variety).
Set output_duration_sec (e.g., 15.0 for 15-second output).
Enable play_result to audition immediately.
Click OK — soundscape generated, named "originalname_MarkovWeave".

Quick tip: Use rhythmic/temporal sources — music with clear phrases, environmental sounds with natural cycles, speech with sentence structure. Grain size 50-150ms works well — smaller for fine temporal control, larger for smoother evolution. 5-8 states typically sufficient for complex sources. Processing shows stages: "Analyzing audio structure..." → "Learning states (Clustering)..." → "Learning grammar (Markov Chain)..." → "Weaving soundscape..." → "Finalizing...". Output appears as "originalname_MarkovWeave". Sources with clear temporal patterns yield most musical results.

Important: SOURCE TEMPORALITY CRITICAL — script works best with sources having clear temporal patterns (music, speech, environmental cycles). Static/ambient sources yield less interesting Markov models. Minimum source duration: At least 4× grain size for meaningful analysis. State count balance: Too few states = oversimplified model, too many = overfitting to specific moments. Random starting point: Each generation begins at random state → different initial character. Markov memory: First-order chains (only previous state matters) — captures local patterns but not long-term structure. Grain boundaries: Non-overlapping grains can create clicks — Hanning window applied during extraction for smoothness.

Markov Chain Theory

Markov Process Fundamentals

First-Order Markov Chains

Mathematical definition:

A Markov chain is a stochastic process satisfying: P(Xₙ₊₁ = x | X₁ = x₁, X₂ = x₂, ..., Xₙ = xₙ) = P(Xₙ₊₁ = x | Xₙ = xₙ) Where: Xₙ = state at time n P = transition probability "Memoryless" property: future depends only on present state In our context: States = texture clusters (1 to k) Time steps = grain positions Transitions = how textures evolve over time

Why Markov for Audio?

Audio as temporal process:

Musical phrases: Notes/chords follow predictable sequences
Speech patterns: Phonemes follow language rules
Environmental sounds: Natural events have temporal relationships
Emotional arcs: Dynamics and tension follow narrative patterns

Markov advantages:

Captures local structure: What typically follows what
Preserves style: Generated sequences sound "in style" of original
Controllable randomness: Probabilistic but not completely random
Computationally simple: Easy to implement and understand

Transition Matrix Mathematics

Matrix Structure

k×k probability matrix:

Let k = number_of_states Transition matrix T = [tᵢⱼ] where: tᵢⱼ = P(next state = j | current state = i) Properties: 1. 0 ≤ tᵢⱼ ≤ 1 for all i,j 2. ∑ⱼ tᵢⱼ = 1 for each i (rows sum to 1) Example (k=3): State1 State2 State3 State1 [ 0.2 0.5 0.3 ] State2 [ 0.7 0.1 0.2 ] State3 [ 0.4 0.4 0.2 ] Interpretation: If currently in State1: - 20% chance stay in State1 - 50% chance go to State2 - 30% chance go to State3

Matrix Construction

From observed sequences:

INPUT: State sequence s₁, s₂, ..., sₙ STEP 1: Count transitions FOR i = 1 to n-1: current = sᵢ next = sᵢ₊₁ count[current, next] += 1 STEP 2: Normalize rows FOR each state i = 1 to k: row_sum = ∑ⱼ count[i,j] IF row_sum > 0: FOR each state j = 1 to k: T[i,j] = count[i,j] / row_sum ELSE: // Dead state - uniform probabilities T[i,j] = 1/k for all j OUTPUT: Transition matrix T

State Sequence Generation

Markov Chain Simulation

Generative algorithm:

INPUT: Transition matrix T, initial state s₀, length L OUTPUT: State sequence s₁, s₂, ..., sₗ s_current = s₀ FOR time = 1 to L: STEP 1: Get current row probabilities probs = T[s_current, :] // row s_current STEP 2: Sample next state roll = random(0,1) cum_sum = 0 FOR j = 1 to k: cum_sum += probs[j] IF roll ≤ cum_sum: s_next = j BREAK STEP 3: Update and continue s_current = s_next OUTPUT s_current END FOR

Why This Works for Audio

Temporal coherence preservation:

🎵 Musical Interpretation

States as musical elements:

State 1: Quiet, sparse texture
State 2: Building intensity
State 3: Climactic moment
State 4: Resolution
State 5: Transitional material

Markov matrix captures:

Quiet → Building (high probability)
Building → Climax (high probability)
Climax → Resolution (high probability)
Resolution → Quiet (moderate probability)
Unexpected jumps (low but non-zero probability)

Generated sequences follow natural dramatic arcs

Mathematical Properties

Stationary Distribution

Long-term behavior:

For ergodic Markov chains, exists stationary distribution π such that: π = πT Where π is a probability vector satisfying: ∑ᵢ πᵢ = 1, πᵢ ≥ 0 Interpretation: After many steps, probability of being in state i approaches πᵢ In audio terms: Long generated sequences will spend proportion πᵢ of time in each texture state Calculation: Solve eigenvector problem for eigenvalue 1

Chain Classification

Types of Markov chains:

IRREDUCIBLE: Every state reachable from every other state Audio interpretation: All textures eventually appear APERIODIC: No deterministic cycles Audio interpretation: No locked repetitive patterns ERGODIC: Irreducible + aperiodic Guaranteed stationary distribution exists Ideal for infinite audio generation Our implementation: May not be ergodic if source has absorbing states Handling: Uniform probabilities for dead states

Analysis Phase

Granular Decomposition

Non-overlapping Grain Strategy

Discrete temporal units:

Parameters: grain_size_ms = 80 (default) No overlap between grains Calculation: nGrains = floor(total_duration / grain_size_sec) grain_size_sec = grain_size_ms / 1000 Extraction: FOR i = 1 to nGrains: start_time = (i-1) × grain_size_sec end_time = i × grain_size_sec Extract grain: start_time to end_time, Hanning window Why non-overlapping? Creates discrete time steps for Markov analysis Each grain represents one "moment" in sequence Overlap would create temporal ambiguity

Feature Extraction

Four-dimensional feature space:

1. SPECTRAL CENTROID Measures brightness: higher = more high-frequency content Critical for timbral characterization 2. SPECTRAL BANDWIDTH Measures spectral spread: higher = noisier, lower = more focused Distinguishes noisy vs tonal textures 3. PITCH (F0) Fundamental frequency: higher = higher pitch Zero/undefined for unpitched segments Groups similar pitch ranges 4. HARMONICITY (HNR) Harmonic-to-noise ratio: higher = more tonal Primary indicator of musicality vs noisiness Together: Capture timbre, pitch, noisiness → comprehensive texture description

Clustering Phase

K-Means for Texture States

State learning process:

INPUT: nGrains × 4 feature matrix (normalized) STEP 1: Initialize k = number_of_states Randomly select k grains as initial centroids STEP 2: Cluster assignment (E-step) FOR each grain i: Calculate distance to each centroid Assign to nearest centroid (min Euclidean distance) STEP 3: Centroid update (M-step) FOR each cluster c: Recalculate centroid as mean of assigned grains STEP 4: Convergence check Repeat until no reassignments OR max iterations (10) OUTPUT: - Cluster assignments for each grain - Final centroids (state prototypes)

State Interpretation

What each state represents:

Example: 5-state analysis of piano piece

State 1: Low centroid, medium bandwidth, low pitch, high harmonicity
→ Deep, tonal bass notes

State 2: Medium centroid, low bandwidth, medium pitch, high harmonicity
→ Clear mid-range melodies

State 3: High centroid, high bandwidth, high pitch, medium harmonicity
→ Bright, noisy high register

State 4: Low centroid, high bandwidth, undefined pitch, low harmonicity
→ Percussive attacks, noise bursts

State 5: Medium centroid, medium bandwidth, medium pitch, medium harmonicity
→ Transitional, ambiguous textures

Each state captures a distinct "textural character"

Markov Learning Phase

Transition Probability Calculation

From state sequence to probabilities:

INPUT: State sequence s₁, s₂, ..., sₙ (from clustering) STEP 1: Count transitions Create count matrix C[k×k] initialized to 0 FOR i = 1 to n-1: current = sᵢ next = sᵢ₊₁ C[current, next] += 1 STEP 2: Handle edge cases FOR each state i: IF row_sum(C[i,:]) = 0: // Dead state - no outgoing transitions observed // Assign uniform probabilities to avoid getting stuck C[i,j] = 1 for all j // Will normalize to 1/k STEP 3: Normalize to probabilities FOR each state i: total = sum(C[i,:]) FOR each state j: T[i,j] = C[i,j] / total OUTPUT: Transition probability matrix T[k×k]

Matrix Interpretation

Reading the Markov personality:

Example: Environmental recording (forest)
States: 1=Wind, 2=Birds, 3=Leaves, 4=Silence, 5=Rain

Transition Matrix:
1 2 3 4 5
1 [ 0.60 0.05 0.25 0.08 0.02 ] Wind
2 [ 0.10 0.40 0.30 0.15 0.05 ] Birds
3 [ 0.20 0.20 0.35 0.20 0.05 ] Leaves
4 [ 0.15 0.25 0.30 0.20 0.10 ] Silence
5 [ 0.05 0.05 0.10 0.10 0.70 ] Rain

Interpretation:
- Wind tends to persist (0.60) or transition to leaves (0.25)
- Birds often continue (0.40) or move to leaves (0.30)
- Rain strongly persists (0.70) - absorbing state
- Silence leads to various activities (distributed)
This matrix captures the "ecology" of the forest soundscape

Synthesis Phase

Generative Weaving

State-Based Grain Selection

Texture-appropriate sampling:

For each time step: STEP 1: Get current state s STEP 2: Access grain pool for state s grains_in_state = all grains assigned to cluster s STEP 3: Random selection IF grains_in_state not empty: random_index = random(1, count_in_state_s) selected_grain = grains_in_state[random_index] ELSE: // Fallback: select from any state selected_grain = random grain from any state STEP 4: Extract and process grain Extract audio segment for selected_grain Apply Hanning window for smooth boundaries Why state-based selection? Ensures spectral coherence within each moment Generated audio always "makes sense" timbrally Preserves source texture vocabulary

Markov State Transition

Probabilistic progression:

INPUT: Current state s, transition matrix T STEP 1: Get transition probabilities probs = T[s, :] // row s of transition matrix STEP 2: Cumulative distribution cum_probs[1] = probs[1] FOR j = 2 to k: cum_probs[j] = cum_probs[j-1] + probs[j] STEP 3: Random sampling roll = random(0, 1) FOR j = 1 to k: IF roll ≤ cum_probs[j]: next_state = j BREAK OUTPUT: next_state (for next time step) Note: Ensures transitions follow learned probabilities Some transitions likely, others rare but possible Creates natural, source-like progression

Block Management System

Memory-Efficient Processing

Why block-based synthesis?

Problem: Long outputs require many grains Example: 60s output × 100ms grains = 600 grains Each grain = temporary Sound object 600 objects = memory issues in Praat Solution: Process in blocks block_size = 50 grains Process block → concatenate → store → clear memory Repeat until output duration reached Calculation: grains_needed = ceil(output_duration / grain_size_sec) n_blocks = ceil(grains_needed / block_size) Memory usage: ~50 temporary objects at once vs 600

Block Processing Algorithm

Initialize: total_grains_generated = 0 block_list = [] WHILE total_grains_generated < grains_needed: STEP 1: Generate block current_block = [] FOR i = 1 to min(block_size, remaining_grains): // State-based grain selection + Markov transition grain = generate_grain(current_state) current_block.append(grain) total_grains_generated += 1 STEP 2: Process block concatenate all grains in current_block → block_sound block_list.append(block_sound) clear temporary grains STEP 3: Final assembly concatenate all blocks in block_list → final_output normalize final_output rename final_output Output: Single Sound object of desired duration

Complete Synthesis Pipeline

INPUT: Learned states + Markov matrix + Source audio INITIALIZATION: current_state = random(1, k) grains_needed = ceil(output_duration / grain_size) Initialize block system GENERATION LOOP: WHILE grains_generated < grains_needed: // Grain generation grain_index = random selection from state_current grains Extract grain: (grain_index-1)×grain_size to grain_index×grain_size Apply Hanning window Add to current block // State transition roll = random(0,1) cum_prob = 0 FOR next_state = 1 to k: cum_prob += Markov[current_state, next_state] IF roll ≤ cum_prob: current_state = next_state BREAK grains_generated += 1 // Block management IF block full OR generation complete: Concatenate block → store → clear FINALIZATION: Concatenate all blocks → output Normalize peak to 0.99 Rename to "originalname_MarkovWeave"

Parameters & Settings

Analysis Parameters

Parameter	Type	Default	Description
grain_size_ms	positive	80	Duration of analysis grains
number_of_states	integer	5	K-means cluster count

Synthesis Parameters

Parameter	Type	Default	Description
output_duration_sec	positive	15.0	Duration of generated output

Output Parameters

Parameter	Type	Default	Description
play_result	boolean	1	Auto-play after generation

Parameter Guidance

Grain size selection:

20-50ms: Very granular, abstract, good for micro-sounds
80-150ms: Balanced, captures musical phrases, recommended
200-500ms: Macroscopic, preserves longer patterns
>500ms: Sectional, for very long-form structure

State count strategy:

3-5 states: Broad characterization, good for simple sources
6-10 states: Detailed modeling, for complex musical pieces
11-15 states: Fine-grained analysis, for very diverse sources
>15 states: Over-segmentation, usually unnecessary

Rule of thumb: nStates ≈ sqrt(nGrains/10)

Output duration considerations:

Short (5-15s): Quick testing, demonstration
Medium (30-60s): Musical phrases, complete ideas
Long (2-5min): Extended compositions, ambient beds
Very long (>5min): Installation pieces, requires patience

Applications

Generative Music Systems

Use case: Creating endless music in the style of a composer

Technique: Analyze existing compositions, generate new sequences

Example: Bach chorales → infinite Baroque-style counterpoint

Soundscape Composition

Use case: Generating realistic environmental soundscapes

Technique: Analyze field recordings, generate infinite variations

Examples: Forest, city, ocean, rainforest soundscapes

Algorithmic Accompaniment

Use case: Generating responsive background textures for live performance

Technique: Analyze performer's style, generate complementary material

Advantages: Always stylistically appropriate, never repeats exactly

Music Analysis Tool

Use case: Understanding compositional style through Markov modeling

Technique: Analyze transition matrices to identify stylistic fingerprints

Insights: Repetition patterns, phrase structure, dramatic arcs

Practical Workflow Examples

🎹 Infinite Piano Music

Goal: Generate endless music in specific composer's style

Settings:

Source: 2-minute piano piece
Grain size: 120ms (captures notes/phrases)
States: 8 (detailed style capture)
Output: 300.0 (5-minute generation)

Result: 5-minute piece in same style as source

🌳 Dynamic Forest Soundscape

Goal: Generate realistic forest that never repeats

Settings:

Source: 10-minute forest recording
Grain size: 200ms (environmental events)
States: 6 (bird, wind, leaves, etc.)
Output: 1800.0 (30-minute soundscape)

Result: Half-hour forest soundscape with natural flow

🎭 Dramatic Arc Generation

Goal: Create music with narrative tension/release

Settings:

Source: Film score with clear emotional arc
Grain size: 100ms (emotional micro-moments)
States: 7 (calm, building, tension, climax, etc.)
Output: 120.0 (2-minute dramatic piece)

Result: New composition with similar emotional journey

Advanced Techniques

Multi-scale Markov modeling:

Stage 1: Analyze with large grains (500ms) for macro-structure
Stage 2: Analyze with small grains (50ms) for micro-structure
Stage 3: Combine models for hierarchical generation
Result: Coherent at both phrase and note levels

Interactive Markov chains:

Real-time modification: Adjust transition probabilities during generation
State forcing: Manually select next state for compositional control
Probability morphing: Interpolate between different Markov matrices
Result: Hybrid human-AI composition system

Troubleshooting Common Issues

Problem: Output too repetitive/stuck
Cause: Markov matrix has absorbing states or high self-transition probabilities
Solution: Increase state count, use source with more variation, adjust matrix manually

Problem: Output too random/chaotic
Cause: Source lacks clear temporal patterns, too many states
Solution: Use more structured source, reduce state count, increase grain size

Problem: Clicks at grain boundaries
Cause: Spectral discontinuity between adjacent grains
Solution: Ensure Hanning window applied, consider small overlap (modify script)

Problem: Memory errors during long generation
Cause: Too many temporary objects, insufficient RAM
Solution: Reduce block size, use shorter output, close other applications

Algorithmic Extensions

Higher-Order Markov Chains

Beyond First-Order

N-th order Markov processes:

Second-order Markov: P(Xₙ₊₁ = x | Xₙ = xₙ, Xₙ₋₁ = xₙ₋₁) Third-order Markov: P(Xₙ₊₁ = x | Xₙ = xₙ, Xₙ₋₁ = xₙ₋₁, Xₙ₋₂ = xₙ₋₂) Implementation: States become tuples: (current, previous, ...) State space grows exponentially: kⁿ for order n Advantage: Captures longer-term dependencies Musical phrases, harmonic progressions Narrative arcs, dramatic development Disadvantage: Data requirements grow rapidly Need much longer training sequences

Variable-Length Markov Models

Adaptive Context Length

Context-tree weighting:

Idea: Use longer context when informative, shorter when not Implementation: Build context tree of variable-length sequences Each node = conditional probability distribution Prune branches with insufficient data Advantages: Adapts to source complexity Captures both short and long patterns More efficient than fixed-order models Application: Music with mixed temporal scales Speech with phrase-level and word-level structure

Hidden Markov Models

Latent State Modeling

Beyond observable states:

HMM components: Hidden states: True underlying process (unobserved) Observations: Measurable features (what we see/hear) Transition probabilities: Between hidden states Emission probabilities: From hidden states to observations In our context: Hidden states = abstract musical intentions Observations = audio features (centroid, pitch, etc.) Learning: Baum-Welch algorithm Advantage: Models underlying structure, not just surface Captures musical intention behind acoustic surface More robust to performance variations