X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs
Researchers have developed X-OPD, a new framework to improve the capabilities of speech-based Large Language Models (LLMs). This method addresses the performance gap often seen between end-to-end speech LLMs and their text-based counterparts, which standard training techniques fail to close. X-OPD uses a text-based teacher model to provide feedback on the speech LLM's explorations, effectively distilling the teacher's knowledge into the student model's multi-modal representations. Experiments show X-OPD significantly reduces this performance gap on complex tasks while retaining the speech LLM's inherent abilities. AI
IMPACT This framework could lead to more capable and aligned speech-based AI systems, reducing the performance disparity with text-only models.