Researchers have developed new methods for training long-context visual document understanding models, achieving state-of-the-art performance on benchmarks like MMLongBenchDoc. One study focuses on continued pretraining, supervised finetuning, and preference optimization for models up to 32B parameters, finding that training context lengths should match evaluation lengths and that page indices significantly improve performance. The other paper introduces a synthetic data pipeline for reasoning in long-document understanding, using 'think' traces and 'cot' control tokens to internalize reasoning, which notably allowed a 32B parameter model to surpass a much larger one on MMLongBenchDoc. AI
IMPACT These advancements could significantly improve AI's ability to process and understand lengthy documents in various enterprise, legal, and scientific applications.
RANK_REASON Two research papers published on arXiv detailing new methods for training long-context visual document understanding models.
- arXiv
- Austin Veselka
- Hugging Face
- Mistral Small 3.1 24B
- MMLBD-C
- MMLongBenchDoc
- Qwen3 VL
- Qwen3 VL 235B
- Qwen3 VL 32B
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →