Whisfusion uses masked diffusion for faster, more accurate speech recognition

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-10 04:00

Researchers have developed Whisfusion, a novel non-autoregressive system for automatic speech recognition (ASR) that utilizes masked diffusion models. This approach aims to match the accuracy of traditional autoregressive models while significantly improving inference speed. Whisfusion achieves this by training a diffusion decoder on top of frozen Whisper-large-v3 audio embeddings, enabling parallel decoding and outperforming existing models in both speed and accuracy across multiple languages. AI

影响 Establishes masked diffusion as a viable, high-throughput alternative for multilingual ASR, potentially accelerating real-time transcription applications.

排序理由 This is a research paper detailing a new model architecture for ASR. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Taeyoun Kwon, Junhyuk Ahn, Taegeun Yun, Heeju Jwa, Yoonchae Choi, Siwon Park, Jongchan Kim, Hyungon Ryu, Hyuk-Jae Lee, Nam-Joon Kim · 2026-06-10 04:00

Whisfusion: Parallel ASR Decoding with Masked Diffusion

arXiv:2508.07048v2 Announce Type: replace-cross Abstract: Autoregressive (AR) encoder-decoder models dominate high-quality multilingual ASR, but their left-to-right decoders make inference latency scale with transcript length. A natural alternative, CTC-style non-autoregressive (…

报道来源 [1]

Whisfusion: Parallel ASR Decoding with Masked Diffusion

相关实体

相关话题