New benchmarks and models advance AI image editing capabilities

By PulseAugur Editorial · [12 sources] · 2025-05-29 00:00

Researchers have introduced new benchmarks and datasets for evaluating image editing models, addressing limitations in current systems. VINS-120K offers a large-scale dataset for ultra-high-resolution image editing, while VDE Bench focuses on modifying visual documents with dense text in multiple languages. VIBE, another benchmark, assesses models' ability to follow visual instructions, revealing that proprietary models currently outperform open-source alternatives but still struggle with complex tasks. Additionally, Together AI has launched FLUX.1 Kontext models, which enable in-context image generation and editing using both text and image prompts without requiring fine-tuning. AI

IMPACT New benchmarks and models are pushing the boundaries of AI image editing, enabling more precise control and higher resolutions.

RANK_REASON The cluster contains multiple research papers introducing new benchmarks and datasets for image editing, alongside a product launch of new models.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 12 sources. How we write summaries →

COVERAGE [12]

arXiv cs.AI TIER_1 English(EN) · Xiangyuan Wang, Honghao Cai, Yunhao Bai, Chao Hui, Tianze Zhou, Haohua Chen, Hao Shi, Yuling Wu, Yao Hu, Xu Tang, Yibo Chen, Wei Zhu · 2026-05-26 04:00

EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis

arXiv:2604.08213v2 Announce Type: replace-cross Abstract: High-quality source-target image pairs with precise editing instructions are essential for instruction-guided image editing, yet constructing such training triplets at scale remains costly. Recent pipelines often rely on v…
arXiv cs.AI TIER_1 English(EN) · Ziyang Liu · 2026-05-26 04:00

Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

arXiv:2604.18170v2 Announce Type: replace-cross Abstract: LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured d…
arXiv cs.CV TIER_1 English(EN) · Yuke Li, Lianli Gao, Ji Zhang, Pengpeng Zeng, Lichuan Xiang, Hongkai Wen, Heng Tao Shen, Jingkuan Song · 2026-05-26 04:00

Reversible Inversion for Training-Free Exemplar-guided Image Editing

arXiv:2512.01382v4 Announce Type: replace Abstract: Exemplar-guided Image Editing (EIE) aims to modify a source image according to a visual reference. Existing approaches often require large-scale pre-training to learn relationships between the source and reference images, incurr…
arXiv cs.CV TIER_1 English(EN) · Yumeng He, Xiaoying Wang, Peihao Li, Yanjia Huang, Joe Masterjohn, Jiajun Wu, Leonidas Guibas, Yin Yang, Ying Jiang, Chenfanfu Jiang · 2026-05-26 04:00

Fishbone: From One 3D Asset to a Million Controllable Edits

arXiv:2605.24805v1 Announce Type: new Abstract: Large-scale controllable 3D assets are critical for computer graphics, embodied AI, robotics, and interactive content creation, yet creating diverse 3D assets remains challenging due to the high cost of manual modeling and rigging. …
arXiv cs.CV TIER_1 English(EN) · Yuanye Liu, Siyuan Zhou, Ke Zhang, Lei Li, Wei Chen, Xiahai Zhuang · 2026-05-26 04:00

X-Edit: Exact, Explicit, and Explainable Null-Space Editing for Medical Vision Transformers

arXiv:2605.24932v1 Announce Type: new Abstract: Pre-trained Vision Transformers (ViTs) are increasingly deployed for medical image classification. However, correcting their inevitable failure cases in dynamic clinical scenarios poses a critical challenge. Conventional fine-tuning…
arXiv cs.CV TIER_1 English(EN) · Mingyi Xu, Jinpeng Lin, Min Zhou, Tiezheng Ge, Ming Zeng · 2026-05-26 04:00

Rethinking Scribble-Guided Image Editing: Generalization, Instruction Adherence, and Multi-Tasking

arXiv:2605.25568v1 Announce Type: new Abstract: Scribble-guided image editing allows users to combine simple scribble annotations with text prompts to specify both where and how an image should be edited, enabling flexible interaction with precise spatial control. However, existi…
arXiv cs.CV TIER_1 English(EN) · Zhizhou Chen, Shanyan Guan, Zhanxin Gao, En Ci, Yanhao Ge, Wei Li, Zhenyu Zhang, Jian Yang, Ying Tai · 2026-05-25 04:00

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

arXiv:2605.23518v1 Announce Type: new Abstract: Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency texture details. We introduce VINS-120K, the first large-…
arXiv cs.CV TIER_1 English(EN) · Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, Hongsheng Li · 2026-05-25 04:00

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

arXiv:2605.21487v2 Announce Type: replace Abstract: Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex mult…
arXiv cs.CV TIER_1 English(EN) · Ying Tai · 2026-05-22 11:33

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency texture details. We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image ed…
arXiv cs.CV TIER_1 English(EN) · Huanyu Zhang, Xuehai Bai, Chengzu Li, Chen Liang, Haochen Tian, Haodong Li, Ruichuan An, Yifan Zhang, Anna Korhonen, Zhang Zhang, Liang Wang, Tieniu Tan · 2026-05-22 04:00

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

arXiv:2602.01851v2 Announce Type: replace Abstract: Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructio…
arXiv cs.CV TIER_1 English(EN) · Hongzhu Yi, Yujia Yang, Yuanxiang Wang, Tong Li, Zhenyu Guan, Tianyu Zong, Jiahuan Chen, Chenxi Bao, Tiankun Yang, Haopeng Jin, Yixuan Yuan, Xinming Wang, Tao Yu, Ruilin Gao, Ruiwen Tao, Haijin Liang, Jin Ma, Jinwen Luo, Yeshani, Xinyu Zuo, Jungang Xu · 2026-05-22 04:00

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

arXiv:2602.00122v2 Announce Type: replace Abstract: In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplore…
Together AI blog TIER_1 English(EN) · 2025-05-29 00:00

FLUX.1 Kontext models: Character consistency and precise image editing without fine-tuning

COVERAGE [12]

RELATED ENTITIES

RELATED TOPICS