PulseAugur
EN
LIVE 10:50:06

SIEVES method boosts multimodal LLM coverage on visual tasks with evidence scoring

Researchers have developed SIEVES, a novel method for improving the reliability of multimodal large language models (MLLMs) in out-of-distribution scenarios. SIEVES works by learning to estimate the quality of visual evidence provided by a reasoning model, enabling selective prediction. This approach significantly enhances model coverage, increasing it by up to three times on challenging benchmarks. Notably, SIEVES can be applied to proprietary models like Gemini-3-Pro without requiring access to their internal weights or logits. AI

IMPACT Enhances MLLM reliability in real-world scenarios by improving selective prediction and generalization to unseen data.

RANK_REASON Academic paper introducing a new method for multimodal LLM generalization.

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

SIEVES method boosts multimodal LLM coverage on visual tasks with evidence scoring

COVERAGE [3]

  1. Hugging Face Daily Papers TIER_1 English(EN) ·

    SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

    Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) …

  2. arXiv cs.CV TIER_1 English(EN) · Hector G. Rodriguez, Marcus Rohrbach ·

    SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

    arXiv:2604.25855v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tol…

  3. arXiv cs.CV TIER_1 English(EN) · Marcus Rohrbach ·

    SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

    Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) …