Towards Accurate Emotion-Attributed Video Captioning via Fine-grained Emotion-Cause Pair Extraction
Researchers have developed a new framework for generating more accurate and emotionally rich video captions. This approach focuses on extracting fine-grained emotion-cause pairs within videos, rather than relying on global visual features which can lead to information redundancy. The proposed method enhances visual features by incorporating scene, object, and motion concepts, and refines emotional features using visual temporal dynamics and VAD-vector constraints. Experiments on three datasets showed significant improvements, including a 4.4% increase in BLEU-2 and a 5.4% increase in ROUGE-L on the EVC-MSVD dataset. AI
IMPACT Introduces a novel method for improving the accuracy and emotional depth of video captioning, potentially benefiting content analysis and accessibility tools.