HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
Researchers have developed a new training-free framework called HiDe to improve the performance of Multimodal Large Language Models (MLLMs) on high-resolution images. HiDe addresses background interference rather than object size as the primary cause of performance degradation. The framework uses Token-wise Attention Decoupling (TAD) and Layout-Preserving Decoupling (LPD) to isolate key visual information and eliminate distracting background elements. This approach has achieved state-of-the-art results on benchmarks like V*Bench, HRBench4K, and HRBench8K, significantly boosting models such as Qwen2.5-VL 7B and InternVL3 8B. AI
IMPACT Enhances MLLM capabilities for high-resolution image analysis, potentially improving applications in fields like medical imaging and satellite imagery.