Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 8h

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

Researchers have developed a new reinforcement learning framework called Cross-modal Identity Mapping (CIM) to improve image captioning in Large Vision-Language Models (LVLMs). CIM quantifies information loss by measuring the similarity between images retrieved via text search using generated captions and the original images. This approach aims to minimize information loss without requiring additional annotations, leading to more precise descriptions. Experiments show CIM significantly enhances image captioning performance, achieving a 20% improvement in relation reasoning on the Qwen2.5-VL-7B model when tested on the COCO-LN500 benchmark. AI

IMPACT This research introduces a novel method to improve the accuracy of image descriptions generated by LVLMs, potentially leading to more reliable multimodal AI systems.

reinforcement learning
Large Vision-Language Models
Qwen2.5-VL-7B
Cross-modal Identity Mapping
Gallery Representation Consistency
Query-gallery Image Relevance
COCO-LN500
Haonan Jia