A Benchmark for Omni-Modal Reasoning in Long Videos
Researchers have introduced LongShOTBench, a new benchmark designed to evaluate omni-modal reasoning capabilities in long videos. This benchmark integrates vision, speech, and ambient audio, offering detailed rubrics for diagnostic evaluation. Alongside the benchmark, they developed LongShOTAgent, a training-free agent that demonstrates strong performance on the new testbed, outperforming current multi-modal large language models. AI