English(EN) EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis

新AI研究聚焦多模态微调、图像编辑和验证

作者 PulseAugur 编辑部 · [24 个来源] · 2025-05-29 00:00

研究人员开发了TRACER，一种新颖的鲁棒多模态微调方法，通过使用加权移动平均（WMA）教师来解决灾难性遗忘问题。该方法提高了CLIP等模型在分布外准确性和校准性。另外，OmniVerifier-M1引入了一种多模态元验证器，它使用符号输出来实现更可靠和细粒度的基础模型验证。此外，BlazeEdit提供了一种高效、紧凑的图像到图像扩散模型，用于设备端编辑，而Alterbute则能够编辑内在对象属性（如颜色和形状），同时保持身份不变。 AI

影响引入了鲁棒模型训练、高效设备端AI和改进验证的新方法，可能加速部署和能力提升。

排序理由多篇关于新颖AI技术和模型的研究论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 24 个来源。我们如何撰写摘要 →

报道来源 [24]

arXiv cs.AI TIER_1 Română(RO) · Hesam Asadollahzadeh, Feng Liu, Christopher Leckie, Sarah M. Erfani · 2026-05-29 04:00

TRACER：用于鲁棒多模态微调的持续正则化

arXiv:2605.29380v1 Announce Type: cross Abstract: Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal …
Hugging Face Daily Papers TIER_1 Română(RO) · 2026-05-28 05:34

TRACER：用于鲁棒多模态微调的持久化正则化

Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed-form solut…
arXiv cs.CL TIER_1 English(EN) · Li Lei, Madalina Ciobanu, Qingqing Mao, Ritankar Das · 2026-05-28 04:00

基于可解释性的子空间投影层选择：SAEs 作为听诊器而非手术刀，用于原始任务向量模型编辑

arXiv:2605.28649v1 Announce Type: cross Abstract: LLMs increasingly require surgical model editing to enhance domain-specific capabilities without incurring the computational cost or catastrophic forgetting associated with full fine-tuning. Sparse Autoencoders (SAEs) have emerged…
arXiv cs.AI TIER_1 English(EN) · Fei Deng, Yanwu Xu, Zhipeng Bao, Zhixing Zhang, Haolin Jia, Karthik Raveendran, Jianing Wei · 2026-05-28 04:00

BlazeEdit：基于图像到图像扩散模型的移动设备通用图像编辑

arXiv:2605.28067v1 Announce Type: new Abstract: The remarkable generation quality of modern diffusion models often comes at the cost of massive parameter counts, which necessitate server-side inference with significant computational costs and potential privacy risks. Consequently…
arXiv cs.AI TIER_1 English(EN) · Xinchen Zhang, Bowei Liu, Jiale Liu, Chufan Shi, Yizhen Zhang, Junhong Liu, Youliang Zhang, Zhiheng Li, Yujiu Yang, Ling Yang · 2026-05-28 04:00

OmniVerifier-M1：具有显式结构化校准的多模态元验证器

arXiv:2605.28805v1 Announce Type: cross Abstract: Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verificat…
arXiv cs.CL TIER_1 English(EN) · Ritankar Das · 2026-05-27 15:52

基于可解释性的子空间投影层选择：SAEs作为听诊器而非手术刀，用于原始任务向量模型编辑

LLMs increasingly require surgical model editing to enhance domain-specific capabilities without incurring the computational cost or catastrophic forgetting associated with full fine-tuning. Sparse Autoencoders (SAEs) have emerged as a promising tool in this setting, in principle…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 00:00

OmniVerifier-M1：具有显式结构化重新校准的多模态元验证器

Multimodal meta-verification using symbolic rationales and decoupled reinforcement learning enables robust visual verification and fine-grained error localization in generalist foundation models.
arXiv cs.AI TIER_1 English(EN) · Ziyang Liu · 2026-05-26 04:00

Copy-as-Decode: 语法约束的并行预填充用于LLM编辑

arXiv:2604.18170v2 Announce Type: replace-cross Abstract: LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured d…
arXiv cs.AI TIER_1 English(EN) · Xiangyuan Wang, Honghao Cai, Yunhao Bai, Chao Hui, Tianze Zhou, Haohua Chen, Hao Shi, Yuling Wu, Yao Hu, Xu Tang, Yibo Chen, Wei Zhu · 2026-05-26 04:00

EditCaption：用于图像编辑指令合成的人工精炼SFT和HAE-DPO

arXiv:2604.08213v2 Announce Type: replace-cross Abstract: High-quality source-target image pairs with precise editing instructions are essential for instruction-guided image editing, yet constructing such training triplets at scale remains costly. Recent pipelines often rely on v…
arXiv cs.CV TIER_1 English(EN) · Tal Reiss, Daniel Winter, Matan Cohen, Alex Rav-Acha, Yael Pritch, Ariel Shamir, Yedid Hoshen · 2026-05-28 04:00

Alterbute：编辑图像中对象的内在属性

arXiv:2601.10714v2 Announce Type: replace Abstract: We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and …
arXiv cs.CV TIER_1 English(EN) · Ling Yang · 2026-05-27 17:56

OmniVerifier-M1：具有显式结构化校准的多模态元验证器

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales…
arXiv cs.CV TIER_1 English(EN) · Han Zou, Yan Zhang, Ruiqi Yu, Cong Xie, Jie Huang, Zhenpeng Zhan · 2026-05-27 04:00

SketchAssist：用于语义编辑和精确局部重绘的实用助手

arXiv:2512.14140v2 Announce Type: replace Abstract: Sketch editing requires jointly handling high-level semantic changes and precise local redrawing, a combination that is particularly challenging for sparse, style-sensitive line art. Unlike natural images, sketches rely on minim…
arXiv cs.CV TIER_1 English(EN) · Tong Wang, Meng Zou, Chengjing Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu, Ting Liu · 2026-05-27 04:00

MiVE：用于参考引导视频编辑的多尺度视觉语言特征

arXiv:2605.14664v2 Announce Type: replace Abstract: Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existi…
arXiv cs.CV TIER_1 English(EN) · Yuanye Liu, Siyuan Zhou, Ke Zhang, Lei Li, Wei Chen, Xiahai Zhuang · 2026-05-26 04:00

X-Edit：用于医学视觉 Transformer 的精确、明确且可解释的零空间编辑

arXiv:2605.24932v1 Announce Type: new Abstract: Pre-trained Vision Transformers (ViTs) are increasingly deployed for medical image classification. However, correcting their inevitable failure cases in dynamic clinical scenarios poses a critical challenge. Conventional fine-tuning…
arXiv cs.CV TIER_1 English(EN) · Yuke Li, Lianli Gao, Ji Zhang, Pengpeng Zeng, Lichuan Xiang, Hongkai Wen, Heng Tao Shen, Jingkuan Song · 2026-05-26 04:00

面向训练无关的示例引导图像编辑的可逆反演

arXiv:2512.01382v4 Announce Type: replace Abstract: Exemplar-guided Image Editing (EIE) aims to modify a source image according to a visual reference. Existing approaches often require large-scale pre-training to learn relationships between the source and reference images, incurr…
arXiv cs.CV TIER_1 English(EN) · Mingyi Xu, Jinpeng Lin, Min Zhou, Tiezheng Ge, Ming Zeng · 2026-05-26 04:00

重思涂鸦引导的图像编辑：泛化性、指令遵循与多任务处理

arXiv:2605.25568v1 Announce Type: new Abstract: Scribble-guided image editing allows users to combine simple scribble annotations with text prompts to specify both where and how an image should be edited, enabling flexible interaction with precise spatial control. However, existi…
arXiv cs.CV TIER_1 English(EN) · Yumeng He, Xiaoying Wang, Peihao Li, Yanjia Huang, Joe Masterjohn, Jiajun Wu, Leonidas Guibas, Yin Yang, Ying Jiang, Chenfanfu Jiang · 2026-05-26 04:00

Fishbone：从一个3D资产到一百万次可控编辑

arXiv:2605.24805v1 Announce Type: new Abstract: Large-scale controllable 3D assets are critical for computer graphics, embodied AI, robotics, and interactive content creation, yet creating diverse 3D assets remains challenging due to the high cost of manual modeling and rigging. …
arXiv cs.CV TIER_1 English(EN) · Zhizhou Chen, Shanyan Guan, Zhanxin Gao, En Ci, Yanhao Ge, Wei Li, Zhenyu Zhang, Jian Yang, Ying Tai · 2026-05-25 04:00

VINS-120K：使用大规模数据集进行超高分辨率图像编辑

arXiv:2605.23518v1 Announce Type: new Abstract: Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency texture details. We introduce VINS-120K, the first large-…
arXiv cs.CV TIER_1 English(EN) · Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, Hongsheng Li · 2026-05-25 04:00

Uni-Edit：智能编辑是统一模型微调的通用任务

arXiv:2605.21487v2 Announce Type: replace Abstract: Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex mult…
arXiv cs.CV TIER_1 English(EN) · Ying Tai · 2026-05-22 11:33

VINS-120K：使用大规模数据集进行超高分辨率图像编辑

Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency texture details. We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image ed…
arXiv cs.CV TIER_1 English(EN) · Hongzhu Yi, Yujia Yang, Yuanxiang Wang, Tong Li, Zhenyu Guan, Tianyu Zong, Jiahuan Chen, Chenxi Bao, Tiankun Yang, Haopeng Jin, Yixuan Yuan, Xinming Wang, Tao Yu, Ruilin Gao, Ruiwen Tao, Haijin Liang, Jin Ma, Jinwen Luo, Yeshani, Xinyu Zuo, Jungang Xu · 2026-05-22 04:00

VDE基准：评估图像编辑模型修改视觉文档的能力

arXiv:2602.00122v2 Announce Type: replace Abstract: In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplore…
arXiv cs.CV TIER_1 English(EN) · Huanyu Zhang, Xuehai Bai, Chengzu Li, Chen Liang, Haochen Tian, Haodong Li, Ruichuan An, Yifan Zhang, Anna Korhonen, Zhang Zhang, Liang Wang, Tieniu Tan · 2026-05-22 04:00

模型在多大程度上能遵循视觉指令？VIBE：一个用于视觉指令驱动图像编辑的系统性基准

arXiv:2602.01851v2 Announce Type: replace Abstract: Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructio…
雷峰网 (Leiphone) TIER_1 中文(ZH) · 2026-05-29 07:13

CVPR 2026 图像编辑趋势：从参考单张图像到融合整个视觉世界

<section style="text-align: center; margin: 0px 16px; line-height: 1.75em; display: block;"><img class="rich_pages wxw-img" src="https://static.leiphone.com/uploads/new/images/20260529/6a193c383fadd.jpg?imageMogr2/quality/90" style="width: 100%; display: inline-block; text-align:…
Together AI blog TIER_1 English(EN) · 2025-05-29 00:00

FLUX.1 Kontext 模型：无需微调即可实现角色一致性和精确图像编辑

报道来源 [24]

相关实体

相关话题