New methods enhance multimodal LLM continual learning

By PulseAugur Editorial · [9 sources] · 2026-06-01 17:11

Researchers are developing new methods for multimodal continual instruction tuning to improve the efficiency and performance of large language models. One approach, CRAM, uses centroid-routing and adaptive Mixture of Experts to isolate task-specific patterns and efficiently allocate parameters, mitigating catastrophic forgetting. Another method, ProtoAda, employs prototype-guided adaptive tuning with format-aware task prototypes to improve routing and parameter consolidation. Additionally, a framework called PROXY-MIX learns a dynamic replay controller on a small proxy model and transfers it to larger models to preserve capabilities and alignment behavior during continual tuning. AI

IMPACT These advancements aim to make multimodal LLMs more adaptable and efficient in real-world applications by improving their ability to learn new tasks without forgetting previous ones.

RANK_REASON Multiple research papers introducing novel methods for multimodal continual instruction tuning.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 9 sources. How we write summaries →

New methods enhance multimodal LLM continual learning

COVERAGE [9]

arXiv cs.AI TIER_1 English(EN) · Wayner Barrios, Andr\'es Villa, Juan Le\'on Alc\'azar, SouYoung Jin, Bernard Ghanem · 2026-06-08 04:00

MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

arXiv:2506.01850v2 Announce Type: replace-cross Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often strug…
arXiv cs.CL TIER_1 English(EN) · Luis Palacios, Lorenzo Basile, Diego Doimo, Alberto Cazzaniga · 2026-06-03 04:00

Visual Instruction Tuning Aligns Modalities through Abstraction

arXiv:2606.03871v1 Announce Type: cross Abstract: Visual instruction tuning effectively adapts a pre-trained Large Language Model (LLM) to process image information alongside text. Yet, it remains unclear how visual features are embedded into the layer-wise hierarchy of abstracti…
arXiv cs.CL TIER_1 English(EN) · Alberto Cazzaniga · 2026-06-02 16:42

Visual Instruction Tuning Aligns Modalities through Abstraction

Visual instruction tuning effectively adapts a pre-trained Large Language Model (LLM) to process image information alongside text. Yet, it remains unclear how visual features are embedded into the layer-wise hierarchy of abstractions of the LLM backbone. Across a diverse set of v…
arXiv cs.CL TIER_1 English(EN) · Jun-Tao Tang, Zhen-Hao Xie, Yu-Cheng Shi, Da-Wei Zhou · 2026-06-02 04:00

CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning

arXiv:2606.02502v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) unify heterogeneous vision-language tasks under a shared generative framework via instruction tuning, yet real-world deployment demands continuous capability expansion, making Multimodal Cont…
arXiv cs.LG TIER_1 English(EN) · Ibne Farabi Shihab, Fariya Afrin, Anuj Sharma · 2026-06-02 04:00

Dynamic Proxy-Mixing: Transferring Replay Controllers from Small to Large Models for Continual Instruction Tuning

arXiv:2606.00400v1 Announce Type: new Abstract: Continual instruction tuning updates a language model through a sequence of new domains, yet each update can progressively erode previously learned capabilities and alignment behavior. Replay is the standard mitigation, but fixed re…
arXiv cs.LG TIER_1 English(EN) · Yu-Cheng Shi, Zhen-Hao Xie, Jun-Tao Tang, Da-Wei Zhou · 2026-06-02 04:00

ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning

arXiv:2606.02576v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making Multimodal Continual Instructi…
arXiv cs.LG TIER_1 English(EN) · Da-Wei Zhou · 2026-06-01 17:59

ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. To reduce inter-task i…
arXiv cs.CL TIER_1 English(EN) · Da-Wei Zhou · 2026-06-01 17:11

CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning

Multimodal Large Language Models (MLLMs) unify heterogeneous vision-language tasks under a shared generative framework via instruction tuning, yet real-world deployment demands continuous capability expansion, making Multimodal Continual Instruction Tuning (MCIT) essential. Exist…
arXiv cs.CV TIER_1 English(EN) · Ziqi Wang, Chang Che, Qi Wang, Hui Ma, Zenglin Shi, Cees G. M. Snoek, Meng Wang · 2026-06-05 04:00

Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs

arXiv:2511.20158v2 Announce Type: replace Abstract: While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores th…

COVERAGE [9]

RELATED ENTITIES

RELATED TOPICS