Brief · PulseAugur

TOOL · arXiv cs.LG English(EN) · 8h

SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

Researchers have developed SSNAPS, a novel unsupervised method for separating speech from background noise using audio-visual cues. The approach employs diffusion inverse sampling, modeling clean speech and ambient noise with distinct diffusion priors to reconstruct all sources. This technique demonstrates superior performance compared to supervised baselines in word error rate across various noisy conditions, even handling multiple speakers and off-screen separation. The high fidelity of the separated noise component also enables downstream acoustic scene detection. AI

Hugging Face
arXiv
DagsHub
alphaXiv
ScienceCast
CatalyzeX
Gotit.pub
Influence Flower
SSNAPS
Diffusion Inverse Sampling
Yochai Yemini