Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins
Researchers have developed OR3, a novel text-to-video retrieval system designed to enhance operating room safety by accurately identifying specific surgical events. Unlike previous methods that relied on global embeddings, OR3 converts video clips into action-driven digital twins (ActDTs), which group subject-action-object triplets within temporal intervals. This approach allows for imagination-based retrieval, where a large language model generates hypothetical ActDTs from queries, enabling more precise intra-modal matching. The system was tested on a benchmark of robotic knee procedures, demonstrating superior performance in fine-grained discrimination between visually similar clips. AI
IMPACT Enhances operating room safety by enabling precise retrieval of surgical events through advanced AI reasoning.