Markov Soundscape Weaver — User Guide
AI-driven temporal modeling: deconstructs audio into grains, learns texture states via clustering, models temporal grammar with Markov chains, and generates infinite streams that follow the natural flow of the original.
What this does
This script implements Markov soundscape weaving — a sophisticated temporal modeling approach that analyzes both the spectral content AND temporal evolution of source audio. Process involves: (1) Granular decomposition: Splits audio into discrete grains (non-overlapping). (2) Texture state learning: K-means clustering groups grains into spectral states. (3) Temporal grammar modeling: Markov chain analysis learns transition probabilities between states. (4) Generative weaving: New sequences generated by sampling grains from current state and transitioning via learned probabilities. Result: infinite streams that preserve both the spectral vocabulary AND temporal flow patterns of the original source.
Key Features:
- Dual Learning — Spectral states + temporal transitions
- Markov Modeling — First-order probability transitions
- Non-overlapping Grains — Discrete temporal units
- State-Based Synthesis — Texture-appropriate grain selection
- Infinite Generation — Follows natural source flow indefinitely
- Memory Management — Block-based processing for long outputs
What are Markov soundscapes? Traditional granular synthesis: random grain recombination. Markov approach: intelligent temporal modeling that learns "what tends to follow what" in the original audio. Benefits: (1) Temporal coherence: Generated sequences follow natural progression patterns. (2) Source authenticity: Preserves both sound quality AND flow. (3) Controllable variation: Same states, different sequences. (4) Musical structure: Captures phrasing, development, narrative flow. (5) Adaptive learning: Different sources yield different Markov personalities. Use cases: Generative music systems, soundscape composition, algorithmic accompaniment, interactive audio, music analysis, computational creativity.
Technical Implementation: (1) Preprocessing: Mono conversion, duration validation. (2) Feature extraction: Non-overlapping grain analysis (spectral centroid, bandwidth, pitch, harmonicity). (3) Clustering: K-means groups grains into k texture states. (4) Markov learning: Analyze state sequences to build k×k transition probability matrix. (5) Generative synthesis: Start at random state, while generating: sample grain from current state, use Markov matrix to determine next state. (6) Block management: Process in blocks of 50 grains for memory efficiency. (7) Output: Concatenate blocks, normalize, name "originalname_MarkovWeave". Processing scales with source duration and state complexity.
Quick start
- In Praat, select exactly one Sound object.
- Run script… →
markov_soundscape_weaver.praat.
- Set grain_size_ms (80ms default, smaller = more granular).
- Choose number_of_states (5 default, higher = more texture variety).
- Set output_duration_sec (e.g., 15.0 for 15-second output).
- Enable play_result to audition immediately.
- Click OK — soundscape generated, named "originalname_MarkovWeave".
Quick tip: Use rhythmic/temporal sources — music with clear phrases, environmental sounds with natural cycles, speech with sentence structure. Grain size 50-150ms works well — smaller for fine temporal control, larger for smoother evolution. 5-8 states typically sufficient for complex sources. Processing shows stages: "Analyzing audio structure..." → "Learning states (Clustering)..." → "Learning grammar (Markov Chain)..." → "Weaving soundscape..." → "Finalizing...". Output appears as "originalname_MarkovWeave". Sources with clear temporal patterns yield most musical results.
Important: SOURCE TEMPORALITY CRITICAL — script works best with sources having clear temporal patterns (music, speech, environmental cycles). Static/ambient sources yield less interesting Markov models. Minimum source duration: At least 4× grain size for meaningful analysis. State count balance: Too few states = oversimplified model, too many = overfitting to specific moments. Random starting point: Each generation begins at random state → different initial character. Markov memory: First-order chains (only previous state matters) — captures local patterns but not long-term structure. Grain boundaries: Non-overlapping grains can create clicks — Hanning window applied during extraction for smoothness.
Markov Chain Theory
Markov Process Fundamentals
First-Order Markov Chains
Mathematical definition:
A Markov chain is a stochastic process satisfying:
P(Xₙ₊₁ = x | X₁ = x₁, X₂ = x₂, ..., Xₙ = xₙ) = P(Xₙ₊₁ = x | Xₙ = xₙ)
Where:
Xₙ = state at time n
P = transition probability
"Memoryless" property: future depends only on present state
In our context:
States = texture clusters (1 to k)
Time steps = grain positions
Transitions = how textures evolve over time
Why Markov for Audio?
Audio as temporal process:
- Musical phrases: Notes/chords follow predictable sequences
- Speech patterns: Phonemes follow language rules
- Environmental sounds: Natural events have temporal relationships
- Emotional arcs: Dynamics and tension follow narrative patterns
Markov advantages:
- Captures local structure: What typically follows what
- Preserves style: Generated sequences sound "in style" of original
- Controllable randomness: Probabilistic but not completely random
- Computationally simple: Easy to implement and understand
Transition Matrix Mathematics
Matrix Structure
k×k probability matrix:
Let k = number_of_states
Transition matrix T = [tᵢⱼ] where:
tᵢⱼ = P(next state = j | current state = i)
Properties:
1. 0 ≤ tᵢⱼ ≤ 1 for all i,j
2. ∑ⱼ tᵢⱼ = 1 for each i (rows sum to 1)
Example (k=3):
State1 State2 State3
State1 [ 0.2 0.5 0.3 ]
State2 [ 0.7 0.1 0.2 ]
State3 [ 0.4 0.4 0.2 ]
Interpretation:
If currently in State1:
- 20% chance stay in State1
- 50% chance go to State2
- 30% chance go to State3
Matrix Construction
From observed sequences:
INPUT: State sequence s₁, s₂, ..., sₙ
STEP 1: Count transitions
FOR i = 1 to n-1:
current = sᵢ
next = sᵢ₊₁
count[current, next] += 1
STEP 2: Normalize rows
FOR each state i = 1 to k:
row_sum = ∑ⱼ count[i,j]
IF row_sum > 0:
FOR each state j = 1 to k:
T[i,j] = count[i,j] / row_sum
ELSE:
// Dead state - uniform probabilities
T[i,j] = 1/k for all j
OUTPUT: Transition matrix T
State Sequence Generation
Markov Chain Simulation
Generative algorithm:
INPUT: Transition matrix T, initial state s₀, length L
OUTPUT: State sequence s₁, s₂, ..., sₗ
s_current = s₀
FOR time = 1 to L:
STEP 1: Get current row probabilities
probs = T[s_current, :] // row s_current
STEP 2: Sample next state
roll = random(0,1)
cum_sum = 0
FOR j = 1 to k:
cum_sum += probs[j]
IF roll ≤ cum_sum:
s_next = j
BREAK
STEP 3: Update and continue
s_current = s_next
OUTPUT s_current
END FOR
Why This Works for Audio
Temporal coherence preservation:
🎵 Musical Interpretation
States as musical elements:
- State 1: Quiet, sparse texture
- State 2: Building intensity
- State 3: Climactic moment
- State 4: Resolution
- State 5: Transitional material
Markov matrix captures:
- Quiet → Building (high probability)
- Building → Climax (high probability)
- Climax → Resolution (high probability)
- Resolution → Quiet (moderate probability)
- Unexpected jumps (low but non-zero probability)
Generated sequences follow natural dramatic arcs
Mathematical Properties
Stationary Distribution
Long-term behavior:
For ergodic Markov chains, exists stationary distribution π such that:
π = πT
Where π is a probability vector satisfying:
∑ᵢ πᵢ = 1, πᵢ ≥ 0
Interpretation: After many steps, probability of being in state i approaches πᵢ
In audio terms: Long generated sequences will spend proportion πᵢ of time in each texture state
Calculation: Solve eigenvector problem for eigenvalue 1
Chain Classification
Types of Markov chains:
IRREDUCIBLE: Every state reachable from every other state
Audio interpretation: All textures eventually appear
APERIODIC: No deterministic cycles
Audio interpretation: No locked repetitive patterns
ERGODIC: Irreducible + aperiodic
Guaranteed stationary distribution exists
Ideal for infinite audio generation
Our implementation: May not be ergodic if source has absorbing states
Handling: Uniform probabilities for dead states
Analysis Phase
Granular Decomposition
Non-overlapping Grain Strategy
Discrete temporal units:
Parameters:
grain_size_ms = 80 (default)
No overlap between grains
Calculation:
nGrains = floor(total_duration / grain_size_sec)
grain_size_sec = grain_size_ms / 1000
Extraction:
FOR i = 1 to nGrains:
start_time = (i-1) × grain_size_sec
end_time = i × grain_size_sec
Extract grain: start_time to end_time, Hanning window
Why non-overlapping?
Creates discrete time steps for Markov analysis
Each grain represents one "moment" in sequence
Overlap would create temporal ambiguity
Feature Extraction
Four-dimensional feature space:
1. SPECTRAL CENTROID
Measures brightness: higher = more high-frequency content
Critical for timbral characterization
2. SPECTRAL BANDWIDTH
Measures spectral spread: higher = noisier, lower = more focused
Distinguishes noisy vs tonal textures
3. PITCH (F0)
Fundamental frequency: higher = higher pitch
Zero/undefined for unpitched segments
Groups similar pitch ranges
4. HARMONICITY (HNR)
Harmonic-to-noise ratio: higher = more tonal
Primary indicator of musicality vs noisiness
Together: Capture timbre, pitch, noisiness → comprehensive texture description
Clustering Phase
K-Means for Texture States
State learning process:
INPUT: nGrains × 4 feature matrix (normalized)
STEP 1: Initialize
k = number_of_states
Randomly select k grains as initial centroids
STEP 2: Cluster assignment (E-step)
FOR each grain i:
Calculate distance to each centroid
Assign to nearest centroid (min Euclidean distance)
STEP 3: Centroid update (M-step)
FOR each cluster c:
Recalculate centroid as mean of assigned grains
STEP 4: Convergence check
Repeat until no reassignments OR max iterations (10)
OUTPUT:
- Cluster assignments for each grain
- Final centroids (state prototypes)
State Interpretation
What each state represents:
Example: 5-state analysis of piano piece
State 1: Low centroid, medium bandwidth, low pitch, high harmonicity
→ Deep, tonal bass notes
State 2: Medium centroid, low bandwidth, medium pitch, high harmonicity
→ Clear mid-range melodies
State 3: High centroid, high bandwidth, high pitch, medium harmonicity
→ Bright, noisy high register
State 4: Low centroid, high bandwidth, undefined pitch, low harmonicity
→ Percussive attacks, noise bursts
State 5: Medium centroid, medium bandwidth, medium pitch, medium harmonicity
→ Transitional, ambiguous textures
Each state captures a distinct "textural character"
Markov Learning Phase
Transition Probability Calculation
From state sequence to probabilities:
INPUT: State sequence s₁, s₂, ..., sₙ (from clustering)
STEP 1: Count transitions
Create count matrix C[k×k] initialized to 0
FOR i = 1 to n-1:
current = sᵢ
next = sᵢ₊₁
C[current, next] += 1
STEP 2: Handle edge cases
FOR each state i:
IF row_sum(C[i,:]) = 0:
// Dead state - no outgoing transitions observed
// Assign uniform probabilities to avoid getting stuck
C[i,j] = 1 for all j // Will normalize to 1/k
STEP 3: Normalize to probabilities
FOR each state i:
total = sum(C[i,:])
FOR each state j:
T[i,j] = C[i,j] / total
OUTPUT: Transition probability matrix T[k×k]
Matrix Interpretation
Reading the Markov personality:
Example: Environmental recording (forest)
States: 1=Wind, 2=Birds, 3=Leaves, 4=Silence, 5=Rain
Transition Matrix:
1 2 3 4 5
1 [ 0.60 0.05 0.25 0.08 0.02 ] Wind
2 [ 0.10 0.40 0.30 0.15 0.05 ] Birds
3 [ 0.20 0.20 0.35 0.20 0.05 ] Leaves
4 [ 0.15 0.25 0.30 0.20 0.10 ] Silence
5 [ 0.05 0.05 0.10 0.10 0.70 ] Rain
Interpretation:
- Wind tends to persist (0.60) or transition to leaves (0.25)
- Birds often continue (0.40) or move to leaves (0.30)
- Rain strongly persists (0.70) - absorbing state
- Silence leads to various activities (distributed)
This matrix captures the "ecology" of the forest soundscape
Synthesis Phase
Generative Weaving
State-Based Grain Selection
Texture-appropriate sampling:
For each time step:
STEP 1: Get current state s
STEP 2: Access grain pool for state s
grains_in_state = all grains assigned to cluster s
STEP 3: Random selection
IF grains_in_state not empty:
random_index = random(1, count_in_state_s)
selected_grain = grains_in_state[random_index]
ELSE:
// Fallback: select from any state
selected_grain = random grain from any state
STEP 4: Extract and process grain
Extract audio segment for selected_grain
Apply Hanning window for smooth boundaries
Why state-based selection?
Ensures spectral coherence within each moment
Generated audio always "makes sense" timbrally
Preserves source texture vocabulary
Markov State Transition
Probabilistic progression:
INPUT: Current state s, transition matrix T
STEP 1: Get transition probabilities
probs = T[s, :] // row s of transition matrix
STEP 2: Cumulative distribution
cum_probs[1] = probs[1]
FOR j = 2 to k:
cum_probs[j] = cum_probs[j-1] + probs[j]
STEP 3: Random sampling
roll = random(0, 1)
FOR j = 1 to k:
IF roll ≤ cum_probs[j]:
next_state = j
BREAK
OUTPUT: next_state (for next time step)
Note: Ensures transitions follow learned probabilities
Some transitions likely, others rare but possible
Creates natural, source-like progression
Block Management System
Memory-Efficient Processing
Why block-based synthesis?
Problem: Long outputs require many grains
Example: 60s output × 100ms grains = 600 grains
Each grain = temporary Sound object
600 objects = memory issues in Praat
Solution: Process in blocks
block_size = 50 grains
Process block → concatenate → store → clear memory
Repeat until output duration reached
Calculation:
grains_needed = ceil(output_duration / grain_size_sec)
n_blocks = ceil(grains_needed / block_size)
Memory usage: ~50 temporary objects at once vs 600
Block Processing Algorithm
Initialize:
total_grains_generated = 0
block_list = []
WHILE total_grains_generated < grains_needed:
STEP 1: Generate block
current_block = []
FOR i = 1 to min(block_size, remaining_grains):
// State-based grain selection + Markov transition
grain = generate_grain(current_state)
current_block.append(grain)
total_grains_generated += 1
STEP 2: Process block
concatenate all grains in current_block → block_sound
block_list.append(block_sound)
clear temporary grains
STEP 3: Final assembly
concatenate all blocks in block_list → final_output
normalize final_output
rename final_output
Output: Single Sound object of desired duration
Complete Synthesis Pipeline
INPUT: Learned states + Markov matrix + Source audio
INITIALIZATION:
current_state = random(1, k)
grains_needed = ceil(output_duration / grain_size)
Initialize block system
GENERATION LOOP:
WHILE grains_generated < grains_needed:
// Grain generation
grain_index = random selection from state_current grains
Extract grain: (grain_index-1)×grain_size to grain_index×grain_size
Apply Hanning window
Add to current block
// State transition
roll = random(0,1)
cum_prob = 0
FOR next_state = 1 to k:
cum_prob += Markov[current_state, next_state]
IF roll ≤ cum_prob:
current_state = next_state
BREAK
grains_generated += 1
// Block management
IF block full OR generation complete:
Concatenate block → store → clear
FINALIZATION:
Concatenate all blocks → output
Normalize peak to 0.99
Rename to "originalname_MarkovWeave"
Parameters & Settings
Analysis Parameters
| Parameter | Type | Default | Description |
| grain_size_ms | positive | 80 | Duration of analysis grains |
| number_of_states | integer | 5 | K-means cluster count |
Synthesis Parameters
| Parameter | Type | Default | Description |
| output_duration_sec | positive | 15.0 | Duration of generated output |
Output Parameters
| Parameter | Type | Default | Description |
| play_result | boolean | 1 | Auto-play after generation |
Parameter Guidance
Grain size selection:
- 20-50ms: Very granular, abstract, good for micro-sounds
- 80-150ms: Balanced, captures musical phrases, recommended
- 200-500ms: Macroscopic, preserves longer patterns
- >500ms: Sectional, for very long-form structure
State count strategy:
- 3-5 states: Broad characterization, good for simple sources
- 6-10 states: Detailed modeling, for complex musical pieces
- 11-15 states: Fine-grained analysis, for very diverse sources
- >15 states: Over-segmentation, usually unnecessary
Rule of thumb: nStates ≈ sqrt(nGrains/10)
Output duration considerations:
- Short (5-15s): Quick testing, demonstration
- Medium (30-60s): Musical phrases, complete ideas
- Long (2-5min): Extended compositions, ambient beds
- Very long (>5min): Installation pieces, requires patience
Applications
Generative Music Systems
Use case: Creating endless music in the style of a composer
Technique: Analyze existing compositions, generate new sequences
Example: Bach chorales → infinite Baroque-style counterpoint
Soundscape Composition
Use case: Generating realistic environmental soundscapes
Technique: Analyze field recordings, generate infinite variations
Examples: Forest, city, ocean, rainforest soundscapes
Algorithmic Accompaniment
Use case: Generating responsive background textures for live performance
Technique: Analyze performer's style, generate complementary material
Advantages: Always stylistically appropriate, never repeats exactly
Music Analysis Tool
Use case: Understanding compositional style through Markov modeling
Technique: Analyze transition matrices to identify stylistic fingerprints
Insights: Repetition patterns, phrase structure, dramatic arcs
Practical Workflow Examples
🎹 Infinite Piano Music
Goal: Generate endless music in specific composer's style
Settings:
- Source: 2-minute piano piece
- Grain size: 120ms (captures notes/phrases)
- States: 8 (detailed style capture)
- Output: 300.0 (5-minute generation)
Result: 5-minute piece in same style as source
🌳 Dynamic Forest Soundscape
Goal: Generate realistic forest that never repeats
Settings:
- Source: 10-minute forest recording
- Grain size: 200ms (environmental events)
- States: 6 (bird, wind, leaves, etc.)
- Output: 1800.0 (30-minute soundscape)
Result: Half-hour forest soundscape with natural flow
🎭 Dramatic Arc Generation
Goal: Create music with narrative tension/release
Settings:
- Source: Film score with clear emotional arc
- Grain size: 100ms (emotional micro-moments)
- States: 7 (calm, building, tension, climax, etc.)
- Output: 120.0 (2-minute dramatic piece)
Result: New composition with similar emotional journey
Advanced Techniques
Multi-scale Markov modeling:
- Stage 1: Analyze with large grains (500ms) for macro-structure
- Stage 2: Analyze with small grains (50ms) for micro-structure
- Stage 3: Combine models for hierarchical generation
- Result: Coherent at both phrase and note levels
Interactive Markov chains:
- Real-time modification: Adjust transition probabilities during generation
- State forcing: Manually select next state for compositional control
- Probability morphing: Interpolate between different Markov matrices
- Result: Hybrid human-AI composition system
Troubleshooting Common Issues
Problem: Output too repetitive/stuck
Cause: Markov matrix has absorbing states or high self-transition probabilities
Solution: Increase state count, use source with more variation, adjust matrix manually
Problem: Output too random/chaotic
Cause: Source lacks clear temporal patterns, too many states
Solution: Use more structured source, reduce state count, increase grain size
Problem: Clicks at grain boundaries
Cause: Spectral discontinuity between adjacent grains
Solution: Ensure Hanning window applied, consider small overlap (modify script)
Problem: Memory errors during long generation
Cause: Too many temporary objects, insufficient RAM
Solution: Reduce block size, use shorter output, close other applications
Algorithmic Extensions
Higher-Order Markov Chains
Beyond First-Order
N-th order Markov processes:
Second-order Markov:
P(Xₙ₊₁ = x | Xₙ = xₙ, Xₙ₋₁ = xₙ₋₁)
Third-order Markov:
P(Xₙ₊₁ = x | Xₙ = xₙ, Xₙ₋₁ = xₙ₋₁, Xₙ₋₂ = xₙ₋₂)
Implementation:
States become tuples: (current, previous, ...)
State space grows exponentially: kⁿ for order n
Advantage: Captures longer-term dependencies
Musical phrases, harmonic progressions
Narrative arcs, dramatic development
Disadvantage: Data requirements grow rapidly
Need much longer training sequences
Variable-Length Markov Models
Adaptive Context Length
Context-tree weighting:
Idea: Use longer context when informative, shorter when not
Implementation:
Build context tree of variable-length sequences
Each node = conditional probability distribution
Prune branches with insufficient data
Advantages:
Adapts to source complexity
Captures both short and long patterns
More efficient than fixed-order models
Application:
Music with mixed temporal scales
Speech with phrase-level and word-level structure
Hidden Markov Models
Latent State Modeling
Beyond observable states:
HMM components:
Hidden states: True underlying process (unobserved)
Observations: Measurable features (what we see/hear)
Transition probabilities: Between hidden states
Emission probabilities: From hidden states to observations
In our context:
Hidden states = abstract musical intentions
Observations = audio features (centroid, pitch, etc.)
Learning: Baum-Welch algorithm
Advantage: Models underlying structure, not just surface
Captures musical intention behind acoustic surface
More robust to performance variations