MotionControl — Gesture-to-Sound Transformation

Offline motion-controlled sound transformation: webcam captures hand movement, extracts three control streams (energy, vertical position, horizontal position), and applies parallel amplitude, pitch, and spectral brightness modulations.

Author: Shai Cohen Affiliation: Department of Music, Bar-Ilan University, Israel Version: 1.0 (2026) License: MIT License Repo: https://github.com/ShaiCohen-ops/Praat-plugin_AudioTools

Contents:

What this does Quick start Control mappings Parameters & Presets Capture pipeline Applications

What this does

This script implements motion-controlled sound transformation — a pipeline that uses a webcam to capture free-hand gestures and maps them to three parallel audio modulations. A Python worker opens the camera, records 10 seconds of motion (after 2 seconds of background calibration), and extracts three normalized control channels via frame differencing and motion-weighted centroid tracking. Praat then applies three offline transformations to the selected Sound:

Motion energy → amplitude envelope (AmplitudeTier multiplication)
Vertical hand position → pitch contour (Manipulation + PitchTier)
Horizontal hand position → spectral brightness (time-varying HPF modulation)

The pipeline is entirely file-based (CSV + markers), no real-time streaming between Praat and Python, guaranteeing reproducibility and offline processing.

What are motion-controlled transformations? Traditional sound processing uses fixed parameters or drawn automation. MotionControl translates physical gestures into control signals: raise your hand for higher pitch, move right for brighter timbre, increase gesture energy for volume swells. Advantages: (1) Expressive: natural human gesture shapes sound. (2) Reproducible: same gesture sequence = same transformation. (3) Three dimensions: simultaneous control of amplitude, pitch, and brightness. (4) No sensors: only a standard webcam. (5) Offline: capture once, apply to any sound.

Technical Implementation: (1) Python worker: Opens webcam, captures CAL_SEC (2s) for background modelling, then CAPTURE_SEC (10s) of free motion. Extracts per-frame motion energy and motion-weighted centroid (X,Y). (2) Control smoothing: EMA filter, percentile stretch, deadband, hysteresis. (3) CSV export: time, motion_energy, vertical_pos, horizontal_pos (all 0..1). (4) Amplitude mapping: energy -> AmplitudeTier with user-defined min/max, multiply sound. (5) Pitch mapping: vertical pos -> semitone shift (-range..+range), original pitch extracted via Manipulation, shifted PitchTier, resynthesis. (6) Brightness mapping: horizontal pos -> high-pass filtered copy added/subtracted (right = brighter, left = darker). Key insight: three independent gesture channels produce rich, multidimensional sound transformation with no real-time constraints.

Quick start

In Praat, select exactly one Sound object.
Run script… → MotionControl.praat.
Choose a preset (Subtle gesture, Expressive performer, Wild motion, Meditative) or select "Custom".
Adjust parameters: pitch range (semitones), amplitude min/max, brightness range, smoothing frames, control fps.
A webcam preview window opens. Hold still for 2 seconds (calibration), then move freely for 10 seconds.
After capture, the script applies three transformations automatically. Output named originalname_motion.

Quick tip: Start with Expressive performer preset for balanced results. Ensure good lighting and a plain background for best tracking. Enable Draw_visualization to see control curves and waveforms after processing. If the webcam fails, the script writes neutral fallback data (positions at 0.5, moderate energy) so processing continues. Python dependencies: numpy and opencv-python — install via pip install numpy opencv-python.

Important: REPLACES AMPLITUDE, PITCH, AND SPECTRAL CONTENT — all three transformations are applied destructively (creates a new Sound). The original Sound is preserved. Brightness modulation via HPF can cause clipping if brightness_range > 1.0; the script automatically scales peak to 0.97. Low tracking confidence (<30%) indicates poor lighting or insufficient motion; consider adjusting camera position. The Python worker must complete before Praat continues — do not close the preview window manually. Fallback mode ensures the script never crashes even if the camera is unavailable.

Control mappings

1. Motion energy → Amplitude envelope

Mapping: energy (0..1) -> amplitude (amplitude_min .. amplitude_max) Linear interpolation: A = min + energy × (max - min) Example: amplitude_min=0.20, amplitude_max=1.00 energy=0.0 → amplitude=0.20 (soft) energy=0.5 → amplitude=0.60 (medium) energy=1.0 → amplitude=1.00 (full) Applied via AmplitudeTier and Multiply: sample × envelope(t)

2. Vertical position → Pitch contour

Mapping: vertical_pos (0..1) -> semitone shift (-range .. +range) neutral at 0.5 → shift = 0 semitones top (1.0) → shift = +range_st bottom (0.0) → shift = -range_st Shift formula: semitones = (vertical_pos - 0.5) × 2 × pitch_range_st Pitch factor = 2^(semitones / 12) Original F0 values are extracted via Manipulation → PitchTier, then multiplied by factor. Unvoiced regions use reference F0 (150 Hz) to avoid gaps.

3. Horizontal position → Spectral brightness (HPF modulation)

Mapping: horizontal_pos (0..1) -> high-pass filter gain neutral (0.5) → no change (HPF gain = 0) right (1.0) → add HPF × brightness_range (brighter) left (0.0) → subtract HPF × brightness_range (darker) Implementation: pos_gain = max(0, (horizontal - 0.5) × 2 × brightness_range) neg_gain = max(0, (0.5 - horizontal) × 2 × brightness_range) output = original + pos_gain×HPF - neg_gain×HPF HPF cutoff = max(1000 Hz, sampling_rate / 22). HPF copy created once.

Motion-weighted centroid

For each video frame, after background subtraction and frame differencing:

Energy = mean(diff)

Vertical centroid = Σ(diff × y) / Σ(diff) (inverted: top=1, bottom=0)

Horizontal centroid = Σ(diff × x) / Σ(diff) (left=0, right=1)

If total motion is below threshold, positions snap to neutral (0.5). This avoids jitter when still.

Parameters & Presets

Common Parameters

Parameter	Type	Default	Description
Preset	optionmenu	Expressive performer	Subtle, Expressive, Wild, Meditative, or Custom
Pitch_range_st	real	6.0	Semitone shift range (±)
Amplitude_min	real	0.20	Minimum amplitude (energy=0)
Amplitude_max	real	1.00	Maximum amplitude (energy=1)
Brightness_range	real	0.80	Max HPF gain at extremes
Smooth_frames	integer	5	EMA window for control smoothing
Control_fps	integer	25	Output control rate (frames/sec)
Draw_visualization	boolean	yes	Generate Praat picture with curves
Play_result	boolean	yes	Auto-play after processing

Built-in presets

Preset	Pitch range (st)	Amp min/max	Brightness range	Smooth frames	Character
Subtle gesture	3.0	0.50–1.00	0.40	9	Gentle, refined movements, narrow pitch range
Expressive performer	6.0	0.20–1.00	0.80	5	Balanced, musical — recommended default
Wild motion	12.0	0.08–1.00	1.20	3	Dramatic, snappy, extreme modulation
Meditative	2.0	0.55–0.90	0.30	18	Slow inertia, narrow range — drones & sustained tones

Parameter clamping: Pitch_range_st 0–24, amplitude_min 0–1, amplitude_max 0–2 (must be ≥ min), brightness_range 0–2, smooth_frames 1–50, control_fps 10–100. Values outside ranges are automatically clamped.

Capture & processing pipeline

Stage 1: Python dependency check Verify numpy and opencv-python are installed. Exit with helpful message if missing. Stage 2: Webcam calibration (2 seconds) Capture frames while user holds still. Build per-pixel background mean and standard deviation. bg_std is floored at 1.0 to prevent over-suppression. Stage 3: Motion capture (10 seconds) Optional preview window with motion heat overlay and countdown timer. Each frame: convert to grayscale, store with timestamp. Stage 4: Feature extraction For each consecutive pair: frame differencing, subtract noise floor (1.5×bg_std). Compute motion energy (mean diff), motion-weighted centroid (X,Y). Vertical inverted so top of frame = 1.0. Stage 5: Resampling & smoothing Raw features at camera fps → interpolate to uniform control_fps grid. Exponential moving average (window = smooth_frames). Percentile stretch (5th–95th) to map full range to [0,1]. Deadband: energy < DEADBAND (0.04) → positions glide to neutral (0.5). Hysteresis: lazy follower on positions to reduce zipper noise. Stage 6: Write control CSV + stats CSV: time,motion_energy,vertical_pos,horizontal_pos Stats: duration, camera_fps, n_raw_frames, n_ctrl_frames, tracking_confidence, warnings Stage 7: Praat offline transformations Amplitude: Create AmplitudeTier from energy → Multiply with sound. Pitch: Extract original PitchTier via Manipulation, create shifted PitchTier (vertical → semitones), replace and resynthesize. Brightness: Create HPF copy (band-pass Hann: cutoff to Nyquist), create two AmplitudeTiers (positive and negative gains), Multiply and combine: original + brightHPF - darkHPF. Normalize peak to 0.97 if clipping. Output: "originalname_motion" appears in Objects window.

Fallback mechanism: If the webcam cannot be opened (missing camera, permission denied, or hardware error), the Python worker writes neutral control data: energy = 0.25, vertical = 0.5, horizontal = 0.5 for the entire duration. The done marker contains "fallback" instead of "ok". Praat proceeds with these neutral values, so the sound remains mostly unchanged (slight amplitude reduction to 25%, pitch and brightness neutral). This ensures the script never fails catastrophically.

Applications

Expressive performance capture

Use case: Record a gestural performance and apply it to any audio source — transform a simple pad into an expressive lead, or add human nuance to synthesized textures.

Technique: Use "Expressive performer" preset. Move your whole arm; vertical gestures affect pitch, horizontal gestures affect brightness, energy controls dynamics.

Experimental composition & sound design

Use case: Create evolving, gesture-driven soundscapes where amplitude, pitch, and timbre are coupled to physical motion.

Technique: "Wild motion" preset with wide pitch range and high brightness range. Record chaotic gestures, then apply to granular textures or field recordings.

Psychoacoustics research

Use case: Reproducible gesture-to-sound mappings for perception studies.

Advantages: Exact specification via preset parameters, identical across trials, documented transformation chain, no subjective variability.

Pedagogical tool

Use case: Demonstrate embodied music interaction and sensorimotor mapping in classrooms.

Learning outcomes: Understand relationship between gesture and sound parameters, explore real-time feature extraction, learn about offline audio processing pipelines.

Practical workflow example: Swell + pitch rise + brightening

Gesture: Start with hand low and left, move diagonally upward to the right while increasing gesture speed.

Resulting transformation: Amplitude swells (energy increase), pitch rises (vertical up), brightness increases (horizontal right). Creates a dramatic "sweep" effect.

Settings: Wild motion preset or custom: pitch_range=12, brightness_range=1.2, amplitude_min=0.1, amplitude_max=1.0.

Troubleshooting common issues:
• Python not found or missing packages: Install numpy and opencv-python: pip install numpy opencv-python. Verify with python -c "import numpy, cv2".
• Webcam preview black / no motion detected: Improve lighting, avoid moving background, wear contrasting clothing. Calibration requires 2 seconds of stillness.
• Low tracking confidence warning (<30%): Increase gesture amplitude, move closer to camera, or reduce distance to background.
• Output clipping: The script auto-scales peak to 0.97. If still clipping, reduce brightness_range or amplitude_max.
• Unexpected pitch changes: Original sound must have clear pitch content (voiced sounds, tonal instruments). For noisy sounds, consider using a different source.

Advanced: Customizing the Python worker

Parameters passed from Praat to Python: python motion_control.py control_csv stats_txt done_marker capture_sec control_fps smooth_frames show_preview Modifiable constants in the Python script: CAL_SEC = 2 # calibration duration (seconds) DEADBAND = 0.04 # energy threshold for neutral snap HYSTERESIS = 0.35 # lazy follower blend alpha CAM_INDEX = 0 # webcam device index These can be edited directly in motion_control.py for advanced users.