New AI research tackles multimodal finetuning, image editing, and verification

By PulseAugur Editorial · [24 sources] · 2025-05-29 00:00

Researchers have developed TRACER, a novel method for robust multimodal finetuning that addresses catastrophic forgetting by using a Weighted Moving Average (WMA) teacher. This approach improves out-of-distribution accuracy and calibration in models like CLIP. Separately, OmniVerifier-M1 introduces a multimodal meta-verifier that uses symbolic outputs for more reliable and fine-grained verification in foundation models. Additionally, BlazeEdit offers an efficient, compact image-to-image diffusion model for on-device editing, and Alterbute enables editing of intrinsic object attributes like color and shape while preserving identity. AI

IMPACT Introduces new methods for robust model training, efficient on-device AI, and improved verification, potentially accelerating deployment and capability.

RANK_REASON Multiple research papers on novel AI techniques and models.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 24 sources. How we write summaries →

New AI research tackles multimodal finetuning, image editing, and verification

COVERAGE [24]

arXiv cs.AI TIER_1 Română(RO) · Hesam Asadollahzadeh, Feng Liu, Christopher Leckie, Sarah M. Erfani · 2026-05-29 04:00

TRACER: Persistent Regularization for Robust Multimodal Finetuning

arXiv:2605.29380v1 Announce Type: cross Abstract: Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal …
Hugging Face Daily Papers TIER_1 Română(RO) · 2026-05-28 05:34

TRACER: Persistent Regularization for Robust Multimodal Finetuning

Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed-form solut…
arXiv cs.CL TIER_1 English(EN) · Li Lei, Madalina Ciobanu, Qingqing Mao, Ritankar Das · 2026-05-28 04:00

Interpretability-Guided Layer Selection over Subspace Projection: SAEs as Stethoscopes, Not Scalpels, for Raw Task Vector Model Editing

arXiv:2605.28649v1 Announce Type: cross Abstract: LLMs increasingly require surgical model editing to enhance domain-specific capabilities without incurring the computational cost or catastrophic forgetting associated with full fine-tuning. Sparse Autoencoders (SAEs) have emerged…
arXiv cs.AI TIER_1 English(EN) · Fei Deng, Yanwu Xu, Zhipeng Bao, Zhixing Zhang, Haolin Jia, Karthik Raveendran, Jianing Wei · 2026-05-28 04:00

BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion Models

arXiv:2605.28067v1 Announce Type: new Abstract: The remarkable generation quality of modern diffusion models often comes at the cost of massive parameter counts, which necessitate server-side inference with significant computational costs and potential privacy risks. Consequently…
arXiv cs.AI TIER_1 English(EN) · Xinchen Zhang, Bowei Liu, Jiale Liu, Chufan Shi, Yizhen Zhang, Junhong Liu, Youliang Zhang, Zhiheng Li, Yujiu Yang, Ling Yang · 2026-05-28 04:00

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

arXiv:2605.28805v1 Announce Type: cross Abstract: Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verificat…
arXiv cs.CL TIER_1 English(EN) · Ritankar Das · 2026-05-27 15:52

Interpretability-Guided Layer Selection over Subspace Projection: SAEs as Stethoscopes, Not Scalpels, for Raw Task Vector Model Editing

LLMs increasingly require surgical model editing to enhance domain-specific capabilities without incurring the computational cost or catastrophic forgetting associated with full fine-tuning. Sparse Autoencoders (SAEs) have emerged as a promising tool in this setting, in principle…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 00:00

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

Multimodal meta-verification using symbolic rationales and decoupled reinforcement learning enables robust visual verification and fine-grained error localization in generalist foundation models.
arXiv cs.AI TIER_1 English(EN) · Ziyang Liu · 2026-05-26 04:00

Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

arXiv:2604.18170v2 Announce Type: replace-cross Abstract: LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured d…
arXiv cs.AI TIER_1 English(EN) · Xiangyuan Wang, Honghao Cai, Yunhao Bai, Chao Hui, Tianze Zhou, Haohua Chen, Hao Shi, Yuling Wu, Yao Hu, Xu Tang, Yibo Chen, Wei Zhu · 2026-05-26 04:00

EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis

arXiv:2604.08213v2 Announce Type: replace-cross Abstract: High-quality source-target image pairs with precise editing instructions are essential for instruction-guided image editing, yet constructing such training triplets at scale remains costly. Recent pipelines often rely on v…
arXiv cs.CV TIER_1 English(EN) · Tal Reiss, Daniel Winter, Matan Cohen, Alex Rav-Acha, Yael Pritch, Ariel Shamir, Yedid Hoshen · 2026-05-28 04:00

Alterbute: Editing Intrinsic Attributes of Objects in Images

arXiv:2601.10714v2 Announce Type: replace Abstract: We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and …
arXiv cs.CV TIER_1 English(EN) · Ling Yang · 2026-05-27 17:56

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales…
arXiv cs.CV TIER_1 English(EN) · Han Zou, Yan Zhang, Ruiqi Yu, Cong Xie, Jie Huang, Zhenpeng Zhan · 2026-05-27 04:00

SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing

arXiv:2512.14140v2 Announce Type: replace Abstract: Sketch editing requires jointly handling high-level semantic changes and precise local redrawing, a combination that is particularly challenging for sparse, style-sensitive line art. Unlike natural images, sketches rely on minim…
arXiv cs.CV TIER_1 English(EN) · Tong Wang, Meng Zou, Chengjing Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu, Ting Liu · 2026-05-27 04:00

MiVE: Multiscale Vision-language features for reference-guided video Editing

arXiv:2605.14664v2 Announce Type: replace Abstract: Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existi…
arXiv cs.CV TIER_1 English(EN) · Yuanye Liu, Siyuan Zhou, Ke Zhang, Lei Li, Wei Chen, Xiahai Zhuang · 2026-05-26 04:00

X-Edit: Exact, Explicit, and Explainable Null-Space Editing for Medical Vision Transformers

arXiv:2605.24932v1 Announce Type: new Abstract: Pre-trained Vision Transformers (ViTs) are increasingly deployed for medical image classification. However, correcting their inevitable failure cases in dynamic clinical scenarios poses a critical challenge. Conventional fine-tuning…
arXiv cs.CV TIER_1 English(EN) · Yuke Li, Lianli Gao, Ji Zhang, Pengpeng Zeng, Lichuan Xiang, Hongkai Wen, Heng Tao Shen, Jingkuan Song · 2026-05-26 04:00

Reversible Inversion for Training-Free Exemplar-guided Image Editing

arXiv:2512.01382v4 Announce Type: replace Abstract: Exemplar-guided Image Editing (EIE) aims to modify a source image according to a visual reference. Existing approaches often require large-scale pre-training to learn relationships between the source and reference images, incurr…
arXiv cs.CV TIER_1 English(EN) · Mingyi Xu, Jinpeng Lin, Min Zhou, Tiezheng Ge, Ming Zeng · 2026-05-26 04:00

Rethinking Scribble-Guided Image Editing: Generalization, Instruction Adherence, and Multi-Tasking

arXiv:2605.25568v1 Announce Type: new Abstract: Scribble-guided image editing allows users to combine simple scribble annotations with text prompts to specify both where and how an image should be edited, enabling flexible interaction with precise spatial control. However, existi…
arXiv cs.CV TIER_1 English(EN) · Yumeng He, Xiaoying Wang, Peihao Li, Yanjia Huang, Joe Masterjohn, Jiajun Wu, Leonidas Guibas, Yin Yang, Ying Jiang, Chenfanfu Jiang · 2026-05-26 04:00

Fishbone: From One 3D Asset to a Million Controllable Edits

arXiv:2605.24805v1 Announce Type: new Abstract: Large-scale controllable 3D assets are critical for computer graphics, embodied AI, robotics, and interactive content creation, yet creating diverse 3D assets remains challenging due to the high cost of manual modeling and rigging. …
arXiv cs.CV TIER_1 English(EN) · Zhizhou Chen, Shanyan Guan, Zhanxin Gao, En Ci, Yanhao Ge, Wei Li, Zhenyu Zhang, Jian Yang, Ying Tai · 2026-05-25 04:00

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

arXiv:2605.23518v1 Announce Type: new Abstract: Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency texture details. We introduce VINS-120K, the first large-…
arXiv cs.CV TIER_1 English(EN) · Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, Hongsheng Li · 2026-05-25 04:00

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

arXiv:2605.21487v2 Announce Type: replace Abstract: Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex mult…
arXiv cs.CV TIER_1 English(EN) · Ying Tai · 2026-05-22 11:33

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency texture details. We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image ed…
arXiv cs.CV TIER_1 English(EN) · Hongzhu Yi, Yujia Yang, Yuanxiang Wang, Tong Li, Zhenyu Guan, Tianyu Zong, Jiahuan Chen, Chenxi Bao, Tiankun Yang, Haopeng Jin, Yixuan Yuan, Xinming Wang, Tao Yu, Ruilin Gao, Ruiwen Tao, Haijin Liang, Jin Ma, Jinwen Luo, Yeshani, Xinyu Zuo, Jungang Xu · 2026-05-22 04:00

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

arXiv:2602.00122v2 Announce Type: replace Abstract: In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplore…
arXiv cs.CV TIER_1 English(EN) · Huanyu Zhang, Xuehai Bai, Chengzu Li, Chen Liang, Haochen Tian, Haodong Li, Ruichuan An, Yifan Zhang, Anna Korhonen, Zhang Zhang, Liang Wang, Tieniu Tan · 2026-05-22 04:00

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

arXiv:2602.01851v2 Announce Type: replace Abstract: Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructio…
雷峰网 (Leiphone) TIER_1 中文(ZH) · 2026-05-29 07:13

CVPR 2026 Image Editing Trends: From Referencing One Image to Fusing the Entire Visual World

<section style="text-align: center; margin: 0px 16px; line-height: 1.75em; display: block;"><img class="rich_pages wxw-img" src="https://static.leiphone.com/uploads/new/images/20260529/6a193c383fadd.jpg?imageMogr2/quality/90" style="width: 100%; display: inline-block; text-align:…
Together AI blog TIER_1 English(EN) · 2025-05-29 00:00

FLUX.1 Kontext models: Character consistency and precise image editing without fine-tuning

COVERAGE [24]

RELATED ENTITIES

RELATED TOPICS