Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 8h

Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation

Researchers have introduced ELF-S2T, a novel approach to speech-to-text systems that operates in a continuous latent space rather than discrete text tokens. This model, built on the Embedded Language Flows (ELF) backbone, uses audio conditioning and flow-matching denoising for both speech recognition and translation tasks. Experiments on standard datasets demonstrate competitive performance and reveal that errors in both recognition and translation stem from similar confusions within this continuous latent space. AI

IMPACT This research suggests a unified approach to speech recognition and translation by leveraging continuous latent spaces, potentially simplifying future model development.

Whisper
LibriSpeech
ELF-S2T
CoVoST2
Embedded Language Flows (ELF)