Jolia model enhances 3D CT analysis with concept-level vision-language alignment

By PulseAugur Editorial · [3 sources] · 2026-06-23 13:35

Researchers have developed Jolia, a new 3D CT foundation model that enhances vision-language alignment for medical imaging. Unlike standard CLIP-style pretraining, Jolia uses a method called ConQuer (Concept Queries) to create localized alignments for specific concepts within radiological reports. This approach allows the model to better capture details from lengthy medical texts and provides built-in spatial interpretability by generating attention maps for each concept. Jolia has demonstrated superior performance on various benchmarks for tasks like classification and report generation, outperforming baseline models. AI

IMPACT This research could lead to more accurate and interpretable AI tools for medical diagnosis and report generation.

RANK_REASON The cluster describes a new research paper detailing a novel AI model and method for medical imaging analysis.

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

Jolia model enhances 3D CT analysis with concept-level vision-language alignment

COVERAGE [3]

arXiv cs.CV TIER_1 English(EN) · Jianpeng Zhang · 2026-06-24 08:24

Disease-Centric Vision-Language Pretraining with Hybrid Visual Encoding for 3D Computed Tomography

Vision-language pre-training (VLP) holds great promise for general-purpose medical AI by leveraging radiology reports as rich textual supervision, yet existing methods struggle with 3D CT imaging due to inefficient visual backbones and coarse semantic alignment. To address these …
arXiv cs.CV TIER_1 English(EN) · Julien Khlaut, Charles Corbi\`ere, Baptiste Callard, Amaury Prat, Leo Butsanets, Antoine Saporta, Th\'eo Danielou, Leo Machado, Korentin Le Floch, Tom Boeken, Pierre Manceron, Corentin Dancette · 2026-06-24 04:00

Jolia: Concept-Level Vision-Language Alignment for 3D CT Contrastive Learning

arXiv:2606.24570v1 Announce Type: new Abstract: Vision-language contrastive pretraining has become the dominant recipe for 3D medical foundation models, leveraging the large volumes of paired scans and reports produced in clinical practice. However, medical images usually span do…
arXiv cs.CV TIER_1 English(EN) · Corentin Dancette · 2026-06-23 13:35

Jolia: Concept-Level Vision-Language Alignment for 3D CT Contrastive Learning

Vision-language contrastive pretraining has become the dominant recipe for 3D medical foundation models, leveraging the large volumes of paired scans and reports produced in clinical practice. However, medical images usually span dozens of organs, and radiological reports are muc…

COVERAGE [3]

Disease-Centric Vision-Language Pretraining with Hybrid Visual Encoding for 3D Computed Tomography

Jolia: Concept-Level Vision-Language Alignment for 3D CT Contrastive Learning

Jolia: Concept-Level Vision-Language Alignment for 3D CT Contrastive Learning

RELATED ENTITIES

RELATED TOPICS