New SSNAPS method uses diffusion for audio-visual speech separation

By PulseAugur Editorial · [1 sources] · 2026-06-16 04:00

Researchers have developed SSNAPS, a novel unsupervised method for separating speech from background noise using audio-visual cues. The approach employs diffusion inverse sampling, modeling clean speech and ambient noise with distinct diffusion priors to reconstruct all sources. This technique demonstrates superior performance compared to supervised baselines in word error rate across various noisy conditions, even handling multiple speakers and off-screen separation. The high fidelity of the separated noise component also enables downstream acoustic scene detection. AI

RANK_REASON The cluster contains a research paper detailing a new method for audio-visual speech separation. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.LG TIER_1 English(EN) · Yochai Yemini, Yoav Ellinson, Rami Ben-Ari, Sharon Gannot, Ethan Fetaya · 2026-06-16 04:00

SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

arXiv:2602.01394v2 Announce Type: replace-cross Abstract: This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model…

COVERAGE [1]

SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

RELATED ENTITIES

RELATED TOPICS