Source-Grounded Data Generation for Text-to-JSON Learning
Researchers have developed STAGE, a novel pipeline for generating training data for text-to-JSON conversion. This method uses large language models to synthesize reports and JSON schemas, with ground-truth values validated against underlying spreadsheets. STAGE-Eval, a new benchmark dataset, demonstrates STAGE's effectiveness, significantly improving the performance of the Qwen3-4B model on exact match and value accuracy tasks. AI
IMPACT Enhances structured data extraction capabilities, potentially improving efficiency in industries reliant on document analysis.