Apple researchers have developed StereoFoley, a new framework for generating stereo audio from video that is semantically aligned, temporally synchronized, and spatially accurate. The system addresses limitations in existing models by creating object-aware stereo imaging, overcoming the lack of suitable datasets through a synthetic data generation pipeline. This pipeline combines video analysis, object tracking, and audio synthesis with dynamic panning and distance controls to produce realistic soundscapes, setting a new benchmark for video-to-audio generation. AI
影响 Sets a new benchmark for generating spatially accurate stereo audio from video content.
排序理由 This is a research paper detailing a new framework for audio generation from video.
在 Apple Machine Learning Research 阅读 →
- Alessandro Toso
- Apple
- NeurIPS
- ImmerseDiffusion
- Joshua Atkins
- Kuan-Lin Chen
- Mehrez Souden
- Mojtaba Heydari
- ICASSP
- Robert Henzel
- StereoFoley
- Tornike Karchkhadze
- UC San Diego
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →