PulseAugur
EN
LIVE 12:28:02

CASTLE2026 Team WDL wins video QA challenge with Qwen-based system

The CASTLE Challenge @ EgoVis 2026 evaluated long-form egocentric video question answering using over 600 hours of recordings. The winning system, developed by the CASTLE2026 Team WDL, employs a multimodal reasoning pipeline based on the Qwen model. This pipeline parses question hints, retrieves relevant audio transcriptions, and integrates auxiliary images and video frames to answer questions requiring evidence from various sources. Techniques like LoRA and frame sampling significantly improved performance, leading to a first-place ranking in the challenge. AI

IMPACT Demonstrates advanced multimodal reasoning for egocentric video understanding, potentially improving future AI systems for video analysis and QA.

RANK_REASON The cluster describes a technical report detailing a system that won a specific challenge, which falls under research achievements. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.CV TIER_1 English(EN) · Zhengyang Li, Zhenglin Du, Yi Wen, Fang Liu, Shuo Li, Xu Liu ·

    CASTLE2026 Team WDL Technical Report

    arXiv:2606.00712v1 Announce Type: new Abstract: The CASTLE Challenge @ EgoVis 2026 evaluates long-form egocentric video question answering over 600+ hours of multi-perspective recordings. Each four-choice question requires evidence from videos, transcripts, auxiliary photos, peop…