Apple researchers develop StereoFoley for object-aware stereo audio generation from video

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Apple researchers have developed StereoFoley, a new framework for generating stereo audio from video that is semantically aligned, temporally synchronized, and spatially accurate. The system addresses limitations in existing models by creating object-aware stereo imaging, overcoming the lack of suitable datasets through a synthetic data generation pipeline. This pipeline combines video analysis, object tracking, and audio synthesis with dynamic panning and distance controls to produce realistic soundscapes, setting a new benchmark for video-to-audio generation. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Sets a new benchmark for generating spatially accurate stereo audio from video content.

RANK_REASON This is a research paper detailing a new framework for audio generation from video.

Read on Apple Machine Learning Research →

paper
other

COVERAGE [1]

Apple Machine Learning Research TIER_1 · 2026-04-28 00:00

StereoFoley: Object-Aware Stereo Audio Generation from Video

We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely rema…

COVERAGE [1]

StereoFoley: Object-Aware Stereo Audio Generation from Video

RELATED ENTITIES

RELATED TOPICS