OmniOPSD: Rationale-Privileged On-Policy Self-Distillation for Affective Computing
Researchers have introduced OmniOPSD, a novel framework designed to improve reinforcement learning for multimodal large language models (MLLMs), particularly in complex reasoning tasks where reward sparsity is a significant challenge. This approach utilizes rationale-privileged on-policy self-distillation, where generated rationales serve as privileged evidence for a teacher model rather than direct imitation targets for the student model. Experiments conducted on the MER-UniBench benchmark demonstrated that OmniOPSD achieved state-of-the-art performance with an average score of 84.19, validating the effectiveness of this rationale-privileged teacher guidance. AI
IMPACT This framework could improve the reasoning capabilities of multimodal LLMs in complex, human-centered tasks by addressing reward sparsity and the cost of annotations.