Brief · PulseAugur

TOOL · arXiv cs.CV English(EN) · 8h

OmniOPSD: Rationale-Privileged On-Policy Self-Distillation for Affective Computing

Researchers have introduced OmniOPSD, a novel framework designed to improve reinforcement learning for multimodal large language models (MLLMs), particularly in complex reasoning tasks where reward sparsity is a significant challenge. This approach utilizes rationale-privileged on-policy self-distillation, where generated rationales serve as privileged evidence for a teacher model rather than direct imitation targets for the student model. Experiments conducted on the MER-UniBench benchmark demonstrated that OmniOPSD achieved state-of-the-art performance with an average score of 84.19, validating the effectiveness of this rationale-privileged teacher guidance. AI

IMPACT This framework could improve the reasoning capabilities of multimodal LLMs in complex, human-centered tasks by addressing reward sparsity and the cost of annotations.

Hugging Face
arXiv
OmniOPSD
MER-UniBench