SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling
Researchers have developed SSNAPS, a novel unsupervised method for separating speech from background noise using audio-visual cues. The approach employs diffusion inverse sampling, modeling clean speech and ambient noise with distinct diffusion priors to reconstruct all sources. This technique demonstrates superior performance compared to supervised baselines in word error rate across various noisy conditions, even handling multiple speakers and off-screen separation. The high fidelity of the separated noise component also enables downstream acoustic scene detection. AI