English(EN) Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Tuna-2 模型摒弃视觉编码器，采用直接像素嵌入，达到 SOTA 水平

作者 PulseAugur 编辑部 · [3 个来源] · 2026-04-27 17:59

研究人员开发了 Tuna-2，这是一种新颖的统一多模态模型，它绕过了传统的视觉编码器来进行视觉理解和生成。通过直接处理像素嵌入，Tuna-2 简化了架构，并实现了从原始像素到端到端的优化。实验表明，这种像素空间方法在多模态基准测试中取得了最先进的成果，在生成高质量图像方面优于潜在空间方法，并展现出卓越的多模态理解能力，尤其是在需要详细视觉感知的任务上。 AI

影响消除了多模态模型中对预训练视觉编码器的需求，可能简化架构并提高性能。

排序理由这是一篇描述新模型及其在基准测试中表现的研究论文。

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-04-27 17:59

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native…
arXiv cs.CV TIER_1 English(EN) · Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, Yuren Cong · 2026-04-28 04:00

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

arXiv:2604.24763v1 Announce Type: new Abstract: Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimizatio…
arXiv cs.CV TIER_1 English(EN) · Yuren Cong · 2026-04-27 17:59

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native…

报道来源 [3]

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

相关实体

相关话题