CASTLE2026 Team WDL Technical Report
The CASTLE Challenge @ EgoVis 2026 evaluated long-form egocentric video question answering using over 600 hours of recordings. The winning system, developed by the CASTLE2026 Team WDL, employs a multimodal reasoning pipeline based on the Qwen model. This pipeline parses question hints, retrieves relevant audio transcriptions, and integrates auxiliary images and video frames to answer questions requiring evidence from various sources. Techniques like LoRA and frame sampling significantly improved performance, leading to a first-place ranking in the challenge. AI
IMPACT Demonstrates advanced multimodal reasoning for egocentric video understanding, potentially improving future AI systems for video analysis and QA.