HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec
Researchers have introduced HybridCodec, a novel neural audio codec designed for speech tokenization in multimodal large language models. This architecture unifies two existing approaches by employing separate semantic and acoustic branches while also distilling semantic information from SSL representations. The resulting model achieves strong semantic disentanglement without needing an SSL model during inference, demonstrating superior semantic specialization and competitive reconstruction capabilities. HybridCodec also offers a 3x speedup over previous dual-stream models and shows robustness in out-of-domain and zero-shot cross-lingual scenarios. AI
IMPACT Enhances speech tokenization for multimodal LLMs, potentially improving cross-lingual and zero-shot capabilities.