Brief · PulseAugur

TOOL · arXiv cs.CV English(EN) · 7h

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

Researchers have developed a new training-free framework called HiDe to improve the performance of Multimodal Large Language Models (MLLMs) on high-resolution images. HiDe addresses background interference rather than object size as the primary cause of performance degradation. The framework uses Token-wise Attention Decoupling (TAD) and Layout-Preserving Decoupling (LPD) to isolate key visual information and eliminate distracting background elements. This approach has achieved state-of-the-art results on benchmarks like V*Bench, HRBench4K, and HRBench8K, significantly boosting models such as Qwen2.5-VL 7B and InternVL3 8B. AI

IMPACT Enhances MLLM capabilities for high-resolution image analysis, potentially improving applications in fields like medical imaging and satellite imagery.

Multimodal Large Language Models
V*Bench
Xianjie Liu
Qwen2.5-VL 7B
InternVL3 8B
HiDe
HRBench8K
HRBench4K