MERVIN: A Unified Framework for Multimodal Event Retrieval in Vietnamese News Videos
Researchers have developed MERVIN, a unified multimodal framework designed for event retrieval in Vietnamese news videos. This system integrates visual features, transcripts, and video summaries, enhancing transcript quality with Gemini 1.5 Flash and using a Perception Encoder for visual data. MERVIN achieved high scores in the AI Challenge HCMC 2025, successfully retrieving all query results in the final round. AI
IMPACT This framework could improve how users search and retrieve specific events from large archives of Vietnamese news videos.