Neural Audio Mosaic — User Guide
Content-based audio reconstruction: uses neural feature matching to reconstruct Target audio using only grains from Source audio, creating hybrid sounds that preserve Target structure with Source texture.
What this does
This script implements a neural audio mosaic system that reconstructs one audio file (Target) using only small grains from another audio file (Source). Through sophisticated feature analysis and stochastic search, it finds the best-matching Source grains for each Target grain, creating a hybrid audio result that preserves the structural characteristics of the Target while adopting the textural qualities of the Source.
Key Features:
- Dual-Input Processing – Target provides structure, Source provides texture
- 13-Dimensional Feature Analysis – MFCC coefficients + pitch for comprehensive matching
- Stochastic Neural Search – Efficient probabilistic matching with adjustable search depth
- Weighted Feature Matching – Separate control over spectral and pitch similarity
- Memory-Efficient Granular Synthesis – Block-based processing for stability
- Content-Based Reconstruction – Preserves Target timing and structure
- Professional Output – Peak-normalized results with smooth grain concatenation
Technical Implementation: (1) Dual Analysis: Extract 12 MFCC coefficients + logarithmic pitch from both Target and Source. (2) Feature Normalization: Apply min-max scaling to all features for balanced comparison. (3) Stochastic Search: For each Target grain, probe multiple random Source grains to find best match. (4) Weighted Distance: Combine spectral and pitch similarity with adjustable weights. (5) Block Assembly: Reconstruct Target timeline using matched Source grains with memory-safe concatenation. Key insight: The "neural" aspect comes from using MFCC features (originally developed for speech recognition) that approximate human auditory perception, enabling perceptually meaningful audio matching.
Quick start
- In Praat, select exactly two Sound objects in this order:
- First selection: Target sound (provides structure/timing)
- Second selection: Source sound (provides texture/grains)
- Run script… →
neural_audio_mosaic.praat. - Granular Parameters:
- grain_size_ms – Duration of each analysis grain (default: 50ms)
- overlap_ratio – Overlap between grains (default: 0.0 = no overlap)
- Search Parameters:
- search_probes – Number of random Source grains to test per Target grain (default: 50)
- pitch_weight – Importance of pitch matching (default: 0.3)
- spectral_weight – Importance of spectral/timbre matching (default: 1.0)
- Output Options:
- normalize_volume – Apply peak normalization (recommended: 1)
- play_result – Auto-play after processing
- Click OK – processing begins with progress updates in Info window.
- The script executes five phases:
- Setup & Validation – Check file compatibility
- Feature Extraction – Analyze both Target and Source
- Neural Matching – Find best Source matches for Target grains
- Granular Reconstruction – Build mosaic from matched grains
- Final Assembly – Combine blocks into final result
- Result appears as "TargetName_Mosaic" with the structure of Target and texture of Source.
Mosaic Concept
Dual-Role Audio Processing
🎵 Target + Source = Mosaic
Target Role: Provides structural blueprint – timing, phrasing, duration
Source Role: Provides sonic material – timbre, texture, grain content
Mosaic Result: Target's structure reconstructed with Source's texture
Creative Potential: Infinite hybrid combinations from existing audio
Musical Analogy
Target = Musical Score
- Defines the structure: timing, rhythm, phrasing
- Provides the "what" and "when"
- Like sheet music defining notes and durations
Source = Instrument/Voice
- Provides the sound quality: timbre, texture, character
- Provides the "how" and "with what sound"
- Like choosing violin vs trumpet for the same score
Mosaic = Performance
- Same structure performed with different sound
- Target's composition with Source's instrumentation
- Like playing Beethoven on electric guitar
Ideal Material Combinations
| Target Material | Source Material | Result Character | Parameter Tips |
|---|---|---|---|
| Speech | Speech | Voice transformation | High pitch_weight, medium grain_size |
| Speech | Music | Singing speech | Balanced weights, small grain_size |
| Music | Environmental | Nature music | High spectral_weight, varied grain_size |
| Music | Music | Instrument transfer | Pitch_weight ~0.5, medium search_probes |
| Percussion | Noise textures | Textural rhythms | High spectral_weight, low overlap |
Feature Analysis
MFCC Feature Extraction
🔬 Perceptual Feature Analysis
MFCCs: Mel-Frequency Cepstral Coefficients – psychoacoustically motivated features
Pitch: Logarithmic fundamental frequency for musical matching
Normalization: Min-max scaling for fair feature comparison
Dimensionality: 13 features total (12 MFCC + 1 pitch)
MFCC Mathematics
Mel-frequency cepstral analysis:
13-Dimensional Feature Space
| Feature Group | Features | Description | Perceptual Correlate |
|---|---|---|---|
| MFCC 1-2 | 2 features | Overall spectral shape | Brightness, spectral tilt |
| MFCC 3-5 | 3 features | Mid-frequency characteristics | Formant structure, resonance |
| MFCC 6-8 | 3 features | High-frequency detail | Noise character, frication |
| MFCC 9-12 | 4 features | Fine spectral detail | Timbre nuances, articulation |
| Pitch | 1 feature | Logarithmic fundamental frequency | Perceived pitch, tonality |
Feature Normalization
Min-Max Scaling
Manual normalization implementation:
Neural Matching
Stochastic Search Algorithm
🎯 Efficient Probabilistic Matching
Search strategy: Probe multiple random Source grains per Target grain
Efficiency: Avoids exhaustive search through large Source collections
Quality control: More probes = better matches but slower processing
Weighted distance: Customizable spectral vs pitch importance
Matching Mathematics
Weighted Euclidean distance calculation:
Search Quality vs Speed
Probe Count Optimization
| Search Probes | Matching Quality | Processing Time | Recommended Use |
|---|---|---|---|
| 10-20 | Basic, approximate | Very fast | Quick experiments, very large files |
| 30-50 | Good, musical | Moderate | General purpose, most applications |
| 60-100 | Very good, precise | Slow | High-quality results, important projects |
| 100-200 | Excellent, near-optimal | Very slow | Critical applications, small files |
| 200+ | Diminishing returns | Extremely slow | Special cases only |
Weight Optimization
Balancing Spectral vs Pitch Matching
- Matches based only on timbre/spectral character
- Ignores pitch completely
- Good for: Noise textures, unpitched sounds
- Result: Timbre transfer regardless of pitch
spectral_weight = 1.0, pitch_weight = 0.3
- Primarily spectral matching with pitch guidance
- Default balanced setting
- Good for: Most musical and vocal material
- Result: Natural-sounding hybrids
spectral_weight = 0.5, pitch_weight = 1.0
- Primarily pitch matching with spectral guidance
- Emphasizes melodic/harmonic content
- Good for: Pitched instruments, melodic lines
- Result: Pitch-preserving texture transfer
spectral_weight = 0.0, pitch_weight = 1.0
- Matches based only on pitch
- Ignores timbre completely
- Good for: Pure pitch-based experiments
- Result: Melodic reconstruction with random timbres
Parameters
Granular Parameters
| Parameter | Type | Range | Default | Description |
|---|---|---|---|---|
| grain_size_ms | positive | 20-200 ms | 50 ms | Duration of each analysis and synthesis grain |
| overlap_ratio | real | 0.0-0.9 | 0.0 | Overlap between consecutive grains (0.0 = no overlap) |
Search Parameters
| Parameter | Type | Range | Default | Description |
|---|---|---|---|---|
| search_probes | integer | 10-500 | 50 | Number of random Source grains tested per Target grain |
| pitch_weight | positive | 0.0-2.0 | 0.3 | Importance of pitch matching in similarity calculation |
| spectral_weight | positive | 0.0-2.0 | 1.0 | Importance of spectral/timbre matching in similarity calculation |
Output Parameters
| Parameter | Type | Range | Default | Description |
|---|---|---|---|---|
| normalize_volume | boolean | 0/1 | 1 | Apply peak normalization to output (recommended) |
| play_result | boolean | 0/1 | 1 | Auto-play result after processing |
Parameter Effects Guide
20-40 ms: Very granular, micro-sound character
40-80 ms: Standard granular synthesis
80-120 ms: Smooth, more continuous results
120-200 ms: Almost sampling-like, clear source identity
overlap_ratio Effects:
0.0: No overlap, potentially choppy
0.1-0.3: Slight overlap, some smoothness
0.3-0.5: Moderate overlap, smooth transitions
0.5-0.7: Heavy overlap, very smooth
0.7-0.9: Extreme overlap, almost continuous
search_probes Effects:
10-20: Fast but approximate matches
30-50: Good balance of speed/quality
60-100: High quality, noticeable improvement
100+: Excellent quality, diminishing returns
Weight Ratio Effects (pitch:spectral):
0.0:1.0 → Pure timbre matching
0.3:1.0 → Balanced (default)
0.7:1.0 → Pitch-emphasized
1.0:0.0 → Pure pitch matching
Applications
Voice Transformation
Use case: Transforming one speaker's voice to sound like another
Technique: Use speech as both Target and Source with different speakers
Examples: Voice conversion, accent modification, vocal character transfer
Instrument Hybridization
Use case: Creating new hybrid instruments from existing ones
Technique: Use instrumental recordings as Target and Source
Results: Piano that sounds like guitar, violin with flute character, etc.
Textural Composition
Use case: Applying environmental textures to musical structures
Technique: Use music as Target, environmental sounds as Source
Applications: Nature music, soundscape composition, textural scores
Creative Sound Design
Use case: Generating novel sounds from existing audio material
Technique: Experiment with contrasting Target/Source combinations
Results: Unconventional hybrids, experimental textures, sound art
Practical Workflow Examples
🎤 Voice Transformation
Goal: Make Speaker A sound like Speaker B
Settings:
- Target: Speaker A recording
- Source: Speaker B recording
- grain_size_ms: 40 (capture phoneme detail)
- search_probes: 80 (high quality for voice)
- pitch_weight: 0.4, spectral_weight: 1.0
- overlap_ratio: 0.2 (smooth speech)
Result: Speaker A's speech with Speaker B's vocal character and timbre.
🎹 Piano to String Ensemble
Goal: Transform piano piece into string ensemble version
Settings:
- Target: Piano recording
- Source: String ensemble recording
- grain_size_ms: 60 (musical phrase capture)
- search_probes: 60 (good musical matching)
- pitch_weight: 0.6, spectral_weight: 1.0
- overlap_ratio: 0.3 (smooth musical lines)
Result: Piano composition performed with string ensemble timbre and texture.
🌊 Ocean Rhythm
Goal: Apply ocean texture to drum rhythm
Settings:
- Target: Drum loop
- Source: Ocean waves recording
- grain_size_ms: 80 (waveform capture)
- search_probes: 40 (texture doesn't need precision)
- pitch_weight: 0.1, spectral_weight: 1.0
- overlap_ratio: 0.0 (rhythmic clarity)
Result: Drum rhythm patterns expressed through ocean wave sounds.
Advanced Techniques
- Cross-domain hybrids: Speech Target + Music Source = singing speech
- Texture layering: Process same Target with multiple Sources and mix
- Progressive transformation: Use intermediate Sources for gradual changes
- Extreme contrasts: Combine very different materials for experimental results
Experiment with unconventional Source materials – field recordings, noise, processed sounds – for unique creative outcomes
- Extreme grain sizes: 20ms for granular clouds, 150ms for clear sampling
- Weight extremes: Pure spectral or pure pitch matching for special effects
- Overlap variations: From completely separate to completely smooth grains
- Search depth: Low for chaotic results, high for precise reconstruction
Troubleshooting Common Issues
Cause: Source too different from Target, insufficient search_probes
Solution: Use more similar materials, increase search_probes, adjust weights
Cause: Long files, high search_probes, many grains
Solution: Use shorter selections, reduce search_probes, increase grain size
Cause: No overlap between grains, abrupt transitions
Solution: Increase overlap_ratio, use smaller grain sizes, try Hanning windows
Cause: Very long files, too many grains, system limits
Solution: Use shorter Target files, increase grain size, close other applications