ScoutVLA: UAV-Centric Active Perception via a Dual-Expert VLA Model for Open-World Embodied Question Answering
Researchers have introduced ScoutVLA, a novel dual-expert vision-language-action model designed for aerial embodied question answering. This model addresses the limitations of existing systems by enabling unmanned aerial vehicles (UAVs) to actively adjust their viewpoints for fine-grained evidence gathering, inspired by the 'waggle dance' of scout bees. ScoutVLA features a decoupled architecture with separate experts for semantic intent inference and continuous trajectory generation, trained with a knowledge insulation mechanism to preserve multimodal reasoning. Field studies and simulations show ScoutVLA significantly outperforms current state-of-the-art methods, achieving a 10.48x higher average strict success rate and a 7.72x higher average QA correctness. AI
IMPACT Introduces a new model architecture for embodied AI, potentially improving robotic perception and task completion in complex environments.