Neural Audio Mosaic — User Guide

Content-based audio reconstruction: uses neural feature matching to reconstruct Target audio using only grains from Source audio, creating hybrid sounds that preserve Target structure with Source texture.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 0.1 (2025) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools
Contents:

What this does

This script implements a neural audio mosaic system that reconstructs one audio file (Target) using only small grains from another audio file (Source). Through sophisticated feature analysis and stochastic search, it finds the best-matching Source grains for each Target grain, creating a hybrid audio result that preserves the structural characteristics of the Target while adopting the textural qualities of the Source.

Key Features:

What is neural audio mosaicing? Traditional audio mosaicing: replaces Target segments with similar-sounding Source segments. Neural audio mosaicing: uses machine learning features to find the most perceptually similar Source grains for each Target grain. The system works through four phases: (1) Feature extraction: Analyze both Target and Source using MFCC and pitch features, (2) Normalization: Scale features for fair comparison, (3) Neural matching: Find best Source matches for each Target grain, (4) Granular reconstruction: Assemble matching Source grains in Target sequence. Advantages: (1) Content-aware: Matches based on perceptual features, (2) Structure preservation: Maintains Target timing and phrasing, (3) Texture transfer: Applies Source sonic character to Target content, (4) Creative hybridation: Creates entirely new sonic identities from existing material.

Technical Implementation: (1) Dual Analysis: Extract 12 MFCC coefficients + logarithmic pitch from both Target and Source. (2) Feature Normalization: Apply min-max scaling to all features for balanced comparison. (3) Stochastic Search: For each Target grain, probe multiple random Source grains to find best match. (4) Weighted Distance: Combine spectral and pitch similarity with adjustable weights. (5) Block Assembly: Reconstruct Target timeline using matched Source grains with memory-safe concatenation. Key insight: The "neural" aspect comes from using MFCC features (originally developed for speech recognition) that approximate human auditory perception, enabling perceptually meaningful audio matching.

Quick start

  1. In Praat, select exactly two Sound objects in this order:
    • First selection: Target sound (provides structure/timing)
    • Second selection: Source sound (provides texture/grains)
  2. Run script…neural_audio_mosaic.praat.
  3. Granular Parameters:
    • grain_size_ms – Duration of each analysis grain (default: 50ms)
    • overlap_ratio – Overlap between grains (default: 0.0 = no overlap)
  4. Search Parameters:
    • search_probes – Number of random Source grains to test per Target grain (default: 50)
    • pitch_weight – Importance of pitch matching (default: 0.3)
    • spectral_weight – Importance of spectral/timbre matching (default: 1.0)
  5. Output Options:
    • normalize_volume – Apply peak normalization (recommended: 1)
    • play_result – Auto-play after processing
  6. Click OK – processing begins with progress updates in Info window.
  7. The script executes five phases:
    • Setup & Validation – Check file compatibility
    • Feature Extraction – Analyze both Target and Source
    • Neural Matching – Find best Source matches for Target grains
    • Granular Reconstruction – Build mosaic from matched grains
    • Final Assembly – Combine blocks into final result
  8. Result appears as "TargetName_Mosaic" with the structure of Target and texture of Source.
Quick tip: Choose compatible Target and Source pairs – speech with speech, music with music, or contrasting materials for creative effects. Use 50-80ms grain sizes for most material. Higher search_probes (100-200) improve matching quality but increase processing time. Balance pitch_weight and spectral_weight based on your materials – for pitch-heavy content, increase pitch_weight; for textural content, increase spectral_weight. The algorithm works best when Source material is richer and more varied than Target material.
Important: SELECTION ORDER MATTERS – first selected sound becomes Target (structure), second becomes Source (texture). Sampling rate compatibility: Both files must have the same sampling rate. Source richness: Source should contain diverse sonic material for effective matching. Processing time: Can be significant for long files or high search_probes values. Memory requirements: Very long outputs may require substantial memory – use shorter Target files if experiencing issues. Stereo handling: Both files are converted to mono for analysis – process stereo files by channel if needed.

Mosaic Concept

Dual-Role Audio Processing

🎵 Target + Source = Mosaic

Target Role: Provides structural blueprint – timing, phrasing, duration

Source Role: Provides sonic material – timbre, texture, grain content

Mosaic Result: Target's structure reconstructed with Source's texture

Creative Potential: Infinite hybrid combinations from existing audio

Musical Analogy

Think of it as musical translation:

Target = Musical Score
- Defines the structure: timing, rhythm, phrasing
- Provides the "what" and "when"
- Like sheet music defining notes and durations

Source = Instrument/Voice
- Provides the sound quality: timbre, texture, character
- Provides the "how" and "with what sound"
- Like choosing violin vs trumpet for the same score

Mosaic = Performance
- Same structure performed with different sound
- Target's composition with Source's instrumentation
- Like playing Beethoven on electric guitar

Ideal Material Combinations

Target Material Source Material Result Character Parameter Tips
Speech Speech Voice transformation High pitch_weight, medium grain_size
Speech Music Singing speech Balanced weights, small grain_size
Music Environmental Nature music High spectral_weight, varied grain_size
Music Music Instrument transfer Pitch_weight ~0.5, medium search_probes
Percussion Noise textures Textural rhythms High spectral_weight, low overlap

Feature Analysis

MFCC Feature Extraction

🔬 Perceptual Feature Analysis

MFCCs: Mel-Frequency Cepstral Coefficients – psychoacoustically motivated features

Pitch: Logarithmic fundamental frequency for musical matching

Normalization: Min-max scaling for fair feature comparison

Dimensionality: 13 features total (12 MFCC + 1 pitch)

MFCC Mathematics

Mel-frequency cepstral analysis:

MFCC Extraction Pipeline: 1. Frame blocking → Divide into short frames 2. Windowing → Apply window function (Hanning) 3. FFT → Convert to frequency domain 4. Mel filterbank → Warp to mel frequency scale 5. Log compression → Compress dynamic range 6. DCT → Decorrelate features → MFCCs Mel frequency scale: mel(f) = 2595 × log₁₀(1 + f/700) This approximates human frequency perception: - Linear below 1 kHz - Logarithmic above 1 kHz - Emphasizes perceptually important frequencies Script implementation: To MFCC: 12, 0.025, stepSec, 100, 100, 0 Extracts 12 coefficients per grain

13-Dimensional Feature Space

Feature Group Features Description Perceptual Correlate
MFCC 1-2 2 features Overall spectral shape Brightness, spectral tilt
MFCC 3-5 3 features Mid-frequency characteristics Formant structure, resonance
MFCC 6-8 3 features High-frequency detail Noise character, frication
MFCC 9-12 4 features Fine spectral detail Timbre nuances, articulation
Pitch 1 feature Logarithmic fundamental frequency Perceived pitch, tonality

Feature Normalization

Min-Max Scaling

Manual normalization implementation:

FOR each sound (Target and Source): FOR each feature column c from 1 to 13: // Find min and max min_val = very_large_number max_val = very_small_number FOR each grain r from 1 to n_grains: val = Get value: r, c IF val < min_val: min_val = val IF val > max_val: max_val = val // Calculate range range = max_val - min_val IF range = 0: range = 1 (avoid division by zero) // Apply normalization FOR each grain r from 1 to n_grains: val = Get value: r, c normalized = (val - min_val) / range Set value: r, c, normalized This ensures: All features scaled to 0-1 range Equal contribution to distance calculations Robust to different feature value ranges

Neural Matching

Stochastic Search Algorithm

Matching Mathematics

Weighted Euclidean distance calculation:

FOR each Target grain i from 1 to n_Target: // Get Target feature vector FOR c from 1 to 12: t_mfcc_c = MFCC_c[i] t_pitch = pitch[i] // Stochastic search best_distance = very_large_number best_match = 1 FOR probe from 1 to search_probes: random_source_idx = random(1, n_Source) // Get Source feature vector FOR c from 1 to 12: s_mfcc_c = MFCC_c[random_source_idx] s_pitch = pitch[random_source_idx] // Calculate weighted distance distance = 0 // Spectral distance (MFCC 1-12) FOR c from 1 to 12: diff = t_mfcc_c - s_mfcc_c distance = distance + (diff × diff × spectral_weight) // Pitch distance (if both have pitch) IF t_pitch > 0 AND s_pitch > 0: pitch_diff = t_pitch - s_pitch distance = distance + (pitch_diff × pitch_diff × pitch_weight) // Update best match IF distance < best_distance: best_distance = distance best_match = random_source_idx // Store best match mosaic_sequence[i] = best_match

Search Quality vs Speed

Probe Count Optimization

Search Probes Matching Quality Processing Time Recommended Use
10-20 Basic, approximate Very fast Quick experiments, very large files
30-50 Good, musical Moderate General purpose, most applications
60-100 Very good, precise Slow High-quality results, important projects
100-200 Excellent, near-optimal Very slow Critical applications, small files
200+ Diminishing returns Extremely slow Special cases only

Weight Optimization

Balancing Spectral vs Pitch Matching

spectral_weight = 1.0, pitch_weight = 0.0
- Matches based only on timbre/spectral character
- Ignores pitch completely
- Good for: Noise textures, unpitched sounds
- Result: Timbre transfer regardless of pitch

spectral_weight = 1.0, pitch_weight = 0.3
- Primarily spectral matching with pitch guidance
- Default balanced setting
- Good for: Most musical and vocal material
- Result: Natural-sounding hybrids

spectral_weight = 0.5, pitch_weight = 1.0
- Primarily pitch matching with spectral guidance
- Emphasizes melodic/harmonic content
- Good for: Pitched instruments, melodic lines
- Result: Pitch-preserving texture transfer

spectral_weight = 0.0, pitch_weight = 1.0
- Matches based only on pitch
- Ignores timbre completely
- Good for: Pure pitch-based experiments
- Result: Melodic reconstruction with random timbres

Parameters

Granular Parameters

Parameter Type Range Default Description
grain_size_ms positive 20-200 ms 50 ms Duration of each analysis and synthesis grain
overlap_ratio real 0.0-0.9 0.0 Overlap between consecutive grains (0.0 = no overlap)

Search Parameters

Parameter Type Range Default Description
search_probes integer 10-500 50 Number of random Source grains tested per Target grain
pitch_weight positive 0.0-2.0 0.3 Importance of pitch matching in similarity calculation
spectral_weight positive 0.0-2.0 1.0 Importance of spectral/timbre matching in similarity calculation

Output Parameters

Parameter Type Range Default Description
normalize_volume boolean 0/1 1 Apply peak normalization to output (recommended)
play_result boolean 0/1 1 Auto-play result after processing

Parameter Effects Guide

grain_size_ms Effects:
20-40 ms: Very granular, micro-sound character
40-80 ms: Standard granular synthesis
80-120 ms: Smooth, more continuous results
120-200 ms: Almost sampling-like, clear source identity

overlap_ratio Effects:
0.0: No overlap, potentially choppy
0.1-0.3: Slight overlap, some smoothness
0.3-0.5: Moderate overlap, smooth transitions
0.5-0.7: Heavy overlap, very smooth
0.7-0.9: Extreme overlap, almost continuous

search_probes Effects:
10-20: Fast but approximate matches
30-50: Good balance of speed/quality
60-100: High quality, noticeable improvement
100+: Excellent quality, diminishing returns

Weight Ratio Effects (pitch:spectral):
0.0:1.0 → Pure timbre matching
0.3:1.0 → Balanced (default)
0.7:1.0 → Pitch-emphasized
1.0:0.0 → Pure pitch matching

Applications

Voice Transformation

Use case: Transforming one speaker's voice to sound like another

Technique: Use speech as both Target and Source with different speakers

Examples: Voice conversion, accent modification, vocal character transfer

Instrument Hybridization

Use case: Creating new hybrid instruments from existing ones

Technique: Use instrumental recordings as Target and Source

Results: Piano that sounds like guitar, violin with flute character, etc.

Textural Composition

Use case: Applying environmental textures to musical structures

Technique: Use music as Target, environmental sounds as Source

Applications: Nature music, soundscape composition, textural scores

Creative Sound Design

Use case: Generating novel sounds from existing audio material

Technique: Experiment with contrasting Target/Source combinations

Results: Unconventional hybrids, experimental textures, sound art

Practical Workflow Examples

🎤 Voice Transformation

Goal: Make Speaker A sound like Speaker B

Settings:

  • Target: Speaker A recording
  • Source: Speaker B recording
  • grain_size_ms: 40 (capture phoneme detail)
  • search_probes: 80 (high quality for voice)
  • pitch_weight: 0.4, spectral_weight: 1.0
  • overlap_ratio: 0.2 (smooth speech)

Result: Speaker A's speech with Speaker B's vocal character and timbre.

🎹 Piano to String Ensemble

Goal: Transform piano piece into string ensemble version

Settings:

  • Target: Piano recording
  • Source: String ensemble recording
  • grain_size_ms: 60 (musical phrase capture)
  • search_probes: 60 (good musical matching)
  • pitch_weight: 0.6, spectral_weight: 1.0
  • overlap_ratio: 0.3 (smooth musical lines)

Result: Piano composition performed with string ensemble timbre and texture.

🌊 Ocean Rhythm

Goal: Apply ocean texture to drum rhythm

Settings:

  • Target: Drum loop
  • Source: Ocean waves recording
  • grain_size_ms: 80 (waveform capture)
  • search_probes: 40 (texture doesn't need precision)
  • pitch_weight: 0.1, spectral_weight: 1.0
  • overlap_ratio: 0.0 (rhythmic clarity)

Result: Drum rhythm patterns expressed through ocean wave sounds.

Advanced Techniques

Creative material combinations:
  • Cross-domain hybrids: Speech Target + Music Source = singing speech
  • Texture layering: Process same Target with multiple Sources and mix
  • Progressive transformation: Use intermediate Sources for gradual changes
  • Extreme contrasts: Combine very different materials for experimental results

Experiment with unconventional Source materials – field recordings, noise, processed sounds – for unique creative outcomes

Parameter experimentation:
  • Extreme grain sizes: 20ms for granular clouds, 150ms for clear sampling
  • Weight extremes: Pure spectral or pure pitch matching for special effects
  • Overlap variations: From completely separate to completely smooth grains
  • Search depth: Low for chaotic results, high for precise reconstruction

Troubleshooting Common Issues

Problem: Mosaic sounds chaotic/unrecognizable
Cause: Source too different from Target, insufficient search_probes
Solution: Use more similar materials, increase search_probes, adjust weights
Problem: Processing extremely slow
Cause: Long files, high search_probes, many grains
Solution: Use shorter selections, reduce search_probes, increase grain size
Problem: Output has clicks/pops
Cause: No overlap between grains, abrupt transitions
Solution: Increase overlap_ratio, use smaller grain sizes, try Hanning windows
Problem: Memory errors during processing
Cause: Very long files, too many grains, system limits
Solution: Use shorter Target files, increase grain size, close other applications