English(EN) Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Google DeepMind 发布适用于笔记本电脑的 Gemma 4 12B 多模态模型

作者 PulseAugur 编辑部 · [20 个来源] · 2026-06-08 08:08

Google DeepMind 发布了 Gemma 4 12B，这是一款专为在具有 16GB VRAM 的笔记本电脑上本地运行而设计的新型多模态模型。该模型采用新颖的统一架构，将音频和视觉输入直接集成到 LLM 主干中，无需单独的编码器，从而降低了延迟和内存使用量。Gemma 4 12B 旨在将先进的代理多模态能力带到日常硬件上，其性能接近其较大的 26B MoE 版本，并通过开放许可和与流行工具的集成获得广泛的开发者支持。 AI

影响此次发布将先进的多模态能力带到了消费级硬件上，有望加速本地 AI 代理的开发和使用。

排序理由前沿实验室模型发布，附带系统卡

在 Google DeepMind 阅读 →

AI 生成摘要 · Google Gemini · 来自 20 个来源。我们如何撰写摘要 →

报道来源 [20]

Google DeepMind TIER_1 English(EN) · 2026-06-09 14:10

Introducing Gemma 4 12B: a unified, encoder-free multimodal model
arXiv cs.AI TIER_1 Italiano(IT) · Huyen Vo, Isabel Valera · 2026-06-12 04:00

Hellinger Multimodal Variational Autoencoders

arXiv:2601.06572v4 Announce Type: replace-cross Abstract: Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of expert…
arXiv cs.AI TIER_1 English(EN) · Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song, Changlin Li, Junzhe Li, Tao Huang, Xiao Zhang, Yang Li, Jianbing Wu, Miles Yang, Zhao Zhong, Liefeng Bo, Limin Wang · 2026-06-12 04:00

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

arXiv:2606.13289v1 Announce Type: cross Abstract: Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video to…
arXiv cs.AI TIER_1 Dansk(DA) · Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin, Rohan Sanda, Steven Gao, Mark Horowitz, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang · 2026-06-12 04:00

M*: A Modular, Extensible, Serving System for Multimodal Models

arXiv:2606.12688v1 Announce Type: cross Abstract: We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such …
arXiv cs.AI TIER_1 English(EN) · Limin Wang · 2026-06-11 12:46

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT…
arXiv cs.AI TIER_1 English(EN) · Hui Wang, Tianyu Ren, Joseph Butler, Christopher Baker, Karen Rafferty, Simon McDade · 2026-06-11 04:00

Latent World Recovery for Multimodal Learning with Missing Modalities

arXiv:2606.12362v1 Announce Type: cross Abstract: We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modalities are often only partially available when decisions need to be made. We propose Latent …
arXiv cs.LG TIER_1 English(EN) · Jiaqi Luo · 2026-06-11 04:00

Parameter-Efficient Adapter Tuning for Tabular-Image Multimodal Learning

arXiv:2606.11682v1 Announce Type: cross Abstract: Tabular-image multimodal learning aims to improve predictive modeling by jointly using structured tabular attributes and visual data. Although pretrained encoders provide strong modality-specific representations, full fine-tuning …
arXiv cs.AI TIER_1 English(EN) · Zequn Yang, Yake Wei, Haotian Ni, Zhihao Xu, Di Hu · 2026-06-11 04:00

Information-Theoretic Decomposition for Multimodal Interaction Learning

arXiv:2606.11614v1 Announce Type: cross Abstract: Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interac…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 00:00

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

HYDRA-X presents a unified multimodal model that integrates image and video tokenization within a single Vision Transformer, addressing spatiotemporal reconstruction and semantic awareness through causal temporal attention and hierarchical compression.
arXiv cs.AI TIER_1 English(EN) · Simon McDade · 2026-06-10 17:31

具有缺失模态的多模态学习的潜在世界恢复

We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modalities are often only partially available when decisions need to be made. We propose Latent World Recovery (LWR), a framework built on two key…
arXiv cs.LG TIER_1 English(EN) · Konstantinos Kontras, Teodora Gagaleska, Thomas Strypsteen, Christos Chatzichristos, Matthew Blaschko, Maarten De Vos, Paul Pu Liang · 2026-06-10 04:00

SynIB: Informational Bottleneck for Maximizing Synergy in Multimodal Learning

arXiv:2606.09853v1 Announce Type: new Abstract: A central objective in multimodal learning is to capture synergy: task-relevant information that arises only from the joint use of multiple modalities, and is not available from any single modality alone. While most approaches opera…
arXiv cs.AI TIER_1 English(EN) · Lingyi Meng, Zecong Tang, Haoran Li, Tengju Ru, Zhejun Cui, Weitong Lian, Qi Kang, Hangshuo Cao, Yichen Zhu, Yechi Liu, Kaixuan Wang, Yu-Jie Yuan, Chunwei Wang, Yu Zhang, Bo Dai · 2026-06-09 04:00

IMUG-Bench：用于交错理解和生成的统一多模态模型基准测试

arXiv:2606.09169v1 Announce Type: new Abstract: In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 00:00

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

ARM demonstrates a unified autoregressive framework for image understanding, generation, and editing through discrete semantic tokenization and reinforcement learning optimization.
arXiv cs.CV TIER_1 English(EN) · Peixi Wu, Biao Yang, Feipeng Ma, Bosong Chai, Bo Lin, Wei Yuan, Fan Yang, Tingting Gao, Hebei Li, Xiaoyan Sun · 2026-06-12 04:00

LaME: Learning to Think in Latent Space for Multimodal Embedding via Information Bottleneck

arXiv:2606.13061v1 Announce Type: new Abstract: Reasoning-driven universal multimodal embedding has advanced rapidly by introducing Chain-of-Thought (CoT) reasoning into the embedding pipeline. Despite the strong performance across both general and complex tasks, this paradigm su…
arXiv cs.CV TIER_1 English(EN) · Garvita Allabadi, Matteo Sodano, Roberto Estev\~ao, Yuxiong Wang, Vikram Adve, Emre Kiciman, Ranveer Chandra · 2026-06-12 04:00

GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models

arXiv:2606.12744v1 Announce Type: new Abstract: In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) r…
arXiv cs.CV TIER_1 English(EN) · Xiaoyan Sun · 2026-06-11 08:47

LaME: Learning to Think in Latent Space for Multimodal Embedding via Information Bottleneck

Reasoning-driven universal multimodal embedding has advanced rapidly by introducing Chain-of-Thought (CoT) reasoning into the embedding pipeline. Despite the strong performance across both general and complex tasks, this paradigm suffers from two core limitations: (i) autoregress…
arXiv cs.CV TIER_1 English(EN) · Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan · 2026-06-11 04:00

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

arXiv:2511.16672v4 Announce Type: replace Abstract: Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting…
arXiv cs.CV TIER_1 English(EN) · Junke Wang, Xiao Wang, Jiacheng Pan, Xuefeng Hu, Feng Li, Jingxiang Sun, Chaorui Deng, Zilong Chen, Yunpeng Chen, Kaibin Tian, Matthew Gwilliam, Hao Chen, Danhui Guan, Kun Xu, Weilin Huang, Zuxuan Wu, Haoqi Fan, Yu-Gang Jiang, Zhenheng Yang · 2026-06-10 04:00

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

arXiv:2606.11188v1 Announce Type: new Abstract: This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a dis…
arXiv cs.CV TIER_1 English(EN) · Zhenheng Yang · 2026-06-09 17:59

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images…
arXiv cs.CV TIER_1 English(EN) · Bo Dai · 2026-06-08 08:08

IMUG-Bench：用于交错理解和生成的统一多模态模型基准测试

In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmark…

报道来源 [20]

相关实体

相关话题