ToolFG: Towards Well-Grounded Fine-Grained Image Classification
Researchers have introduced ToolFG, a novel framework designed for fine-grained image classification that integrates multimodal large language models (MLLMs) with external tools. This approach allows MLLMs to autonomously use tools to interact with images and gather verifiable visual cues, enhancing the reliability of distinguishing between highly similar categories. The framework employs an MCTS-guided knowledge distillation mechanism and a model-tool co-evolution process to refine both the tools and the model's tool-use policy for specialized FGIC tasks. AI
IMPACT Introduces a new method for fine-grained image classification by integrating MLLMs with external tools, potentially improving accuracy in distinguishing similar visual categories.