PulseAugur
EN
LIVE 02:27:55

New pipeline unlocks visual data in materials science literature

Researchers have developed MatMMExtract, an open-source pipeline designed to unlock the visual data within materials science literature. This system decomposes complex scientific figures into individual sub-panels and generates structured annotations using a large language model and a specialized taxonomy. Applied to over 14,000 articles, it created MatSciFig, a dataset of nearly 400,000 image-text pairs, each with detailed categorization and summaries. The project also introduced MaterialScope, a detection dataset that improved a YOLO12-m model's accuracy for localizing figure panels, and found Gemini 3.1 Flash Lite to be the most cost-effective LLM for generating annotations. AI

IMPACT Enables large-scale AI analysis of visual data in scientific literature, potentially accelerating materials science discovery.

RANK_REASON The cluster describes a new dataset and pipeline for processing scientific literature, fitting the research category. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New pipeline unlocks visual data in materials science literature

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Subham Ghosh, Shubham Tiwari, Mohammad Ibrahim, Abhishek Tewari ·

    Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

    arXiv:2606.29667v1 Announce Type: cross Abstract: The materials science literature encodes decades of experimental knowledge in figures, yet this visual record remains locked away and inaccessible to AI at scale. The core difficulty is structural: most scientific figures are comp…