A new comprehensive test suite, Hearing to Translate, has been developed to evaluate the effectiveness of integrating speech modality directly into Large Language Models (LLMs) for speech-to-text translation. The study benchmarks six state-of-the-art SpeechLLMs against sixteen cascaded systems, analyzing performance across 16 benchmarks, 13 language pairs, and 9 challenging conditions. Findings indicate that while cascaded systems remain the most reliable overall, recent SpeechLLMs can match or surpass them in specific scenarios, whereas standalone Speech Foundation Models (SFMs) generally lag behind. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT New benchmarks for SpeechLLMs may accelerate research into more efficient and accurate speech translation systems.
RANK_REASON This is a research paper introducing a new benchmark suite for evaluating SpeechLLMs.