Neural Audio Mosaic — User Guide

Content-based audio reconstruction: uses neural feature matching to reconstruct Target audio using only grains from Source audio, creating hybrid sounds that preserve Target structure with Source texture.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 0.1 (2025) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools

Contents:

What this does Quick start Mosaic Concept Feature Analysis Neural Matching Parameters Applications

What this does

This script implements a neural audio mosaic system that reconstructs one audio file (Target) using only small grains from another audio file (Source). Through sophisticated feature analysis and stochastic search, it finds the best-matching Source grains for each Target grain, creating a hybrid audio result that preserves the structural characteristics of the Target while adopting the textural qualities of the Source.

Key Features:

Dual-Input Processing – Target provides structure, Source provides texture
13-Dimensional Feature Analysis – MFCC coefficients + pitch for comprehensive matching
Stochastic Neural Search – Efficient probabilistic matching with adjustable search depth
Weighted Feature Matching – Separate control over spectral and pitch similarity
Memory-Efficient Granular Synthesis – Block-based processing for stability
Content-Based Reconstruction – Preserves Target timing and structure
Professional Output – Peak-normalized results with smooth grain concatenation

What is neural audio mosaicing? Traditional audio mosaicing: replaces Target segments with similar-sounding Source segments. Neural audio mosaicing: uses machine learning features to find the most perceptually similar Source grains for each Target grain. The system works through four phases: (1) Feature extraction: Analyze both Target and Source using MFCC and pitch features, (2) Normalization: Scale features for fair comparison, (3) Neural matching: Find best Source matches for each Target grain, (4) Granular reconstruction: Assemble matching Source grains in Target sequence. Advantages: (1) Content-aware: Matches based on perceptual features, (2) Structure preservation: Maintains Target timing and phrasing, (3) Texture transfer: Applies Source sonic character to Target content, (4) Creative hybridation: Creates entirely new sonic identities from existing material.

Technical Implementation: (1) Dual Analysis: Extract 12 MFCC coefficients + logarithmic pitch from both Target and Source. (2) Feature Normalization: Apply min-max scaling to all features for balanced comparison. (3) Stochastic Search: For each Target grain, probe multiple random Source grains to find best match. (4) Weighted Distance: Combine spectral and pitch similarity with adjustable weights. (5) Block Assembly: Reconstruct Target timeline using matched Source grains with memory-safe concatenation. Key insight: The "neural" aspect comes from using MFCC features (originally developed for speech recognition) that approximate human auditory perception, enabling perceptually meaningful audio matching.

Quick start

In Praat, select exactly two Sound objects in this order:
- First selection: Target sound (provides structure/timing)
- Second selection: Source sound (provides texture/grains)
Run script… → neural_audio_mosaic.praat.
Granular Parameters:
- grain_size_ms – Duration of each analysis grain (default: 50ms)
- overlap_ratio – Overlap between grains (default: 0.0 = no overlap)
Search Parameters:
- search_probes – Number of random Source grains to test per Target grain (default: 50)
- pitch_weight – Importance of pitch matching (default: 0.3)
- spectral_weight – Importance of spectral/timbre matching (default: 1.0)
Output Options:
- normalize_volume – Apply peak normalization (recommended: 1)
- play_result – Auto-play after processing
Click OK – processing begins with progress updates in Info window.
The script executes five phases:
- Setup & Validation – Check file compatibility
- Feature Extraction – Analyze both Target and Source
- Neural Matching – Find best Source matches for Target grains
- Granular Reconstruction – Build mosaic from matched grains
- Final Assembly – Combine blocks into final result
Result appears as "TargetName_Mosaic" with the structure of Target and texture of Source.

Quick tip: Choose compatible Target and Source pairs – speech with speech, music with music, or contrasting materials for creative effects. Use 50-80ms grain sizes for most material. Higher search_probes (100-200) improve matching quality but increase processing time. Balance pitch_weight and spectral_weight based on your materials – for pitch-heavy content, increase pitch_weight; for textural content, increase spectral_weight. The algorithm works best when Source material is richer and more varied than Target material.

Important: SELECTION ORDER MATTERS – first selected sound becomes Target (structure), second becomes Source (texture). Sampling rate compatibility: Both files must have the same sampling rate. Source richness: Source should contain diverse sonic material for effective matching. Processing time: Can be significant for long files or high search_probes values. Memory requirements: Very long outputs may require substantial memory – use shorter Target files if experiencing issues. Stereo handling: Both files are converted to mono for analysis – process stereo files by channel if needed.

Mosaic Concept

Dual-Role Audio Processing

🎵 Target + Source = Mosaic

Target Role: Provides structural blueprint – timing, phrasing, duration

Source Role: Provides sonic material – timbre, texture, grain content

Mosaic Result: Target's structure reconstructed with Source's texture

Creative Potential: Infinite hybrid combinations from existing audio

Musical Analogy

Think of it as musical translation:

Target = Musical Score
- Defines the structure: timing, rhythm, phrasing
- Provides the "what" and "when"
- Like sheet music defining notes and durations

Source = Instrument/Voice
- Provides the sound quality: timbre, texture, character
- Provides the "how" and "with what sound"
- Like choosing violin vs trumpet for the same score

Mosaic = Performance
- Same structure performed with different sound
- Target's composition with Source's instrumentation
- Like playing Beethoven on electric guitar

Ideal Material Combinations

Target Material	Source Material	Result Character	Parameter Tips
Speech	Speech	Voice transformation	High pitch_weight, medium grain_size
Speech	Music	Singing speech	Balanced weights, small grain_size
Music	Environmental	Nature music	High spectral_weight, varied grain_size
Music	Music	Instrument transfer	Pitch_weight ~0.5, medium search_probes
Percussion	Noise textures	Textural rhythms	High spectral_weight, low overlap

Feature Analysis

MFCC Feature Extraction

🔬 Perceptual Feature Analysis

MFCCs: Mel-Frequency Cepstral Coefficients – psychoacoustically motivated features

Pitch: Logarithmic fundamental frequency for musical matching

Normalization: Min-max scaling for fair feature comparison

Dimensionality: 13 features total (12 MFCC + 1 pitch)

MFCC Mathematics

Mel-frequency cepstral analysis:

MFCC Extraction Pipeline: 1. Frame blocking → Divide into short frames 2. Windowing → Apply window function (Hanning) 3. FFT → Convert to frequency domain 4. Mel filterbank → Warp to mel frequency scale 5. Log compression → Compress dynamic range 6. DCT → Decorrelate features → MFCCs Mel frequency scale: mel(f) = 2595 × log₁₀(1 + f/700) This approximates human frequency perception: - Linear below 1 kHz - Logarithmic above 1 kHz - Emphasizes perceptually important frequencies Script implementation: To MFCC: 12, 0.025, stepSec, 100, 100, 0 Extracts 12 coefficients per grain

13-Dimensional Feature Space

Feature Group	Features	Description	Perceptual Correlate
MFCC 1-2	2 features	Overall spectral shape	Brightness, spectral tilt
MFCC 3-5	3 features	Mid-frequency characteristics	Formant structure, resonance
MFCC 6-8	3 features	High-frequency detail	Noise character, frication
MFCC 9-12	4 features	Fine spectral detail	Timbre nuances, articulation
Pitch	1 feature	Logarithmic fundamental frequency	Perceived pitch, tonality

Feature Normalization

Min-Max Scaling

Manual normalization implementation:

FOR each sound (Target and Source): FOR each feature column c from 1 to 13: // Find min and max min_val = very_large_number max_val = very_small_number FOR each grain r from 1 to n_grains: val = Get value: r, c IF val < min_val: min_val = val IF val > max_val: max_val = val // Calculate range range = max_val - min_val IF range = 0: range = 1 (avoid division by zero) // Apply normalization FOR each grain r from 1 to n_grains: val = Get value: r, c normalized = (val - min_val) / range Set value: r, c, normalized This ensures: All features scaled to 0-1 range Equal contribution to distance calculations Robust to different feature value ranges

Neural Matching

Stochastic Search Algorithm

Matching Mathematics

Weighted Euclidean distance calculation:

FOR each Target grain i from 1 to n_Target: // Get Target feature vector FOR c from 1 to 12: t_mfcc_c = MFCC_c[i] t_pitch = pitch[i] // Stochastic search best_distance = very_large_number best_match = 1 FOR probe from 1 to search_probes: random_source_idx = random(1, n_Source) // Get Source feature vector FOR c from 1 to 12: s_mfcc_c = MFCC_c[random_source_idx] s_pitch = pitch[random_source_idx] // Calculate weighted distance distance = 0 // Spectral distance (MFCC 1-12) FOR c from 1 to 12: diff = t_mfcc_c - s_mfcc_c distance = distance + (diff × diff × spectral_weight) // Pitch distance (if both have pitch) IF t_pitch > 0 AND s_pitch > 0: pitch_diff = t_pitch - s_pitch distance = distance + (pitch_diff × pitch_diff × pitch_weight) // Update best match IF distance < best_distance: best_distance = distance best_match = random_source_idx // Store best match mosaic_sequence[i] = best_match

Search Quality vs Speed

Probe Count Optimization

Search Probes	Matching Quality	Processing Time	Recommended Use
10-20	Basic, approximate	Very fast	Quick experiments, very large files
30-50	Good, musical	Moderate	General purpose, most applications
60-100	Very good, precise	Slow	High-quality results, important projects
100-200	Excellent, near-optimal	Very slow	Critical applications, small files
200+	Diminishing returns	Extremely slow	Special cases only

Weight Optimization

Balancing Spectral vs Pitch Matching

spectral_weight = 1.0, pitch_weight = 0.0
- Matches based only on timbre/spectral character
- Ignores pitch completely
- Good for: Noise textures, unpitched sounds
- Result: Timbre transfer regardless of pitch

spectral_weight = 1.0, pitch_weight = 0.3
- Primarily spectral matching with pitch guidance
- Default balanced setting
- Good for: Most musical and vocal material
- Result: Natural-sounding hybrids

spectral_weight = 0.5, pitch_weight = 1.0
- Primarily pitch matching with spectral guidance
- Emphasizes melodic/harmonic content
- Good for: Pitched instruments, melodic lines
- Result: Pitch-preserving texture transfer

spectral_weight = 0.0, pitch_weight = 1.0
- Matches based only on pitch
- Ignores timbre completely
- Good for: Pure pitch-based experiments
- Result: Melodic reconstruction with random timbres

Parameters

Granular Parameters

Parameter	Type	Range	Default	Description
grain_size_ms	positive	20-200 ms	50 ms	Duration of each analysis and synthesis grain
overlap_ratio	real	0.0-0.9	0.0	Overlap between consecutive grains (0.0 = no overlap)

Search Parameters

Parameter	Type	Range	Default	Description
search_probes	integer	10-500	50	Number of random Source grains tested per Target grain
pitch_weight	positive	0.0-2.0	0.3	Importance of pitch matching in similarity calculation
spectral_weight	positive	0.0-2.0	1.0	Importance of spectral/timbre matching in similarity calculation

Output Parameters

Parameter	Type	Range	Default	Description
normalize_volume	boolean	0/1	1	Apply peak normalization to output (recommended)
play_result	boolean	0/1	1	Auto-play result after processing

Parameter Effects Guide

grain_size_ms Effects:
20-40 ms: Very granular, micro-sound character
40-80 ms: Standard granular synthesis
80-120 ms: Smooth, more continuous results
120-200 ms: Almost sampling-like, clear source identity

overlap_ratio Effects:
0.0: No overlap, potentially choppy
0.1-0.3: Slight overlap, some smoothness
0.3-0.5: Moderate overlap, smooth transitions
0.5-0.7: Heavy overlap, very smooth
0.7-0.9: Extreme overlap, almost continuous

search_probes Effects:
10-20: Fast but approximate matches
30-50: Good balance of speed/quality
60-100: High quality, noticeable improvement
100+: Excellent quality, diminishing returns

Weight Ratio Effects (pitch:spectral):
0.0:1.0 → Pure timbre matching
0.3:1.0 → Balanced (default)
0.7:1.0 → Pitch-emphasized
1.0:0.0 → Pure pitch matching

Applications

Voice Transformation

Use case: Transforming one speaker's voice to sound like another

Technique: Use speech as both Target and Source with different speakers

Examples: Voice conversion, accent modification, vocal character transfer

Instrument Hybridization

Use case: Creating new hybrid instruments from existing ones

Technique: Use instrumental recordings as Target and Source

Results: Piano that sounds like guitar, violin with flute character, etc.

Textural Composition

Use case: Applying environmental textures to musical structures

Technique: Use music as Target, environmental sounds as Source

Applications: Nature music, soundscape composition, textural scores

Creative Sound Design

Use case: Generating novel sounds from existing audio material

Technique: Experiment with contrasting Target/Source combinations

Results: Unconventional hybrids, experimental textures, sound art

Practical Workflow Examples

🎤 Voice Transformation

Goal: Make Speaker A sound like Speaker B

Settings:

Target: Speaker A recording
Source: Speaker B recording
grain_size_ms: 40 (capture phoneme detail)
search_probes: 80 (high quality for voice)
pitch_weight: 0.4, spectral_weight: 1.0
overlap_ratio: 0.2 (smooth speech)

Result: Speaker A's speech with Speaker B's vocal character and timbre.

🎹 Piano to String Ensemble

Goal: Transform piano piece into string ensemble version

Settings:

Target: Piano recording
Source: String ensemble recording
grain_size_ms: 60 (musical phrase capture)
search_probes: 60 (good musical matching)
pitch_weight: 0.6, spectral_weight: 1.0
overlap_ratio: 0.3 (smooth musical lines)

Result: Piano composition performed with string ensemble timbre and texture.

🌊 Ocean Rhythm

Goal: Apply ocean texture to drum rhythm

Settings:

Target: Drum loop
Source: Ocean waves recording
grain_size_ms: 80 (waveform capture)
search_probes: 40 (texture doesn't need precision)
pitch_weight: 0.1, spectral_weight: 1.0
overlap_ratio: 0.0 (rhythmic clarity)

Result: Drum rhythm patterns expressed through ocean wave sounds.

Advanced Techniques

Creative material combinations:

Cross-domain hybrids: Speech Target + Music Source = singing speech
Texture layering: Process same Target with multiple Sources and mix
Progressive transformation: Use intermediate Sources for gradual changes
Extreme contrasts: Combine very different materials for experimental results

Experiment with unconventional Source materials – field recordings, noise, processed sounds – for unique creative outcomes

Parameter experimentation:

Extreme grain sizes: 20ms for granular clouds, 150ms for clear sampling
Weight extremes: Pure spectral or pure pitch matching for special effects
Overlap variations: From completely separate to completely smooth grains
Search depth: Low for chaotic results, high for precise reconstruction

Troubleshooting Common Issues

Problem: Mosaic sounds chaotic/unrecognizable
Cause: Source too different from Target, insufficient search_probes
Solution: Use more similar materials, increase search_probes, adjust weights

Problem: Processing extremely slow
Cause: Long files, high search_probes, many grains
Solution: Use shorter selections, reduce search_probes, increase grain size

Problem: Output has clicks/pops
Cause: No overlap between grains, abrupt transitions
Solution: Increase overlap_ratio, use smaller grain sizes, try Hanning windows

Problem: Memory errors during processing
Cause: Very long files, too many grains, system limits
Solution: Use shorter Target files, increase grain size, close other applications

What this does

Quick start

Mosaic Concept

Dual-Role Audio Processing

🎵 Target + Source = Mosaic

Musical Analogy

Ideal Material Combinations

Feature Analysis

MFCC Feature Extraction

🔬 Perceptual Feature Analysis

MFCC Mathematics

13-Dimensional Feature Space

Feature Normalization

Min-Max Scaling

Neural Matching

Stochastic Search Algorithm

🎯 Efficient Probabilistic Matching

Matching Mathematics

Search Quality vs Speed

Probe Count Optimization

Weight Optimization

Balancing Spectral vs Pitch Matching

Parameters

Granular Parameters

Search Parameters

Output Parameters

Parameter Effects Guide

Applications

Voice Transformation

Instrument Hybridization

Textural Composition

Creative Sound Design

Practical Workflow Examples

🎤 Voice Transformation

🎹 Piano to String Ensemble

🌊 Ocean Rhythm

Advanced Techniques

Troubleshooting Common Issues