Google DeepMind releases Gemma 4 12B multimodal model for laptops

By PulseAugur Editorial · [20 sources] · 2026-06-08 08:08

Google DeepMind has released Gemma 4 12B, a new multimodal model designed for local execution on laptops with 16GB of VRAM. This model features a novel unified architecture that integrates audio and vision inputs directly into the LLM backbone without separate encoders, reducing latency and memory usage. Gemma 4 12B aims to bring advanced agentic multimodal capabilities to everyday hardware, with performance nearing its larger 26B MoE counterpart and broad developer support through open licensing and integration with popular tools. AI

IMPACT This release brings advanced multimodal capabilities to consumer hardware, potentially accelerating local AI agent development and use.

RANK_REASON Frontier-lab model release with system card

Read on Google DeepMind →

AI-generated summary · Google Gemini · from 20 sources. How we write summaries →

Google DeepMind releases Gemma 4 12B multimodal model for laptops

COVERAGE [20]

Google DeepMind TIER_1 English(EN) · 2026-06-09 14:10

Introducing Gemma 4 12B: a unified, encoder-free multimodal model
arXiv cs.AI TIER_1 Italiano(IT) · Huyen Vo, Isabel Valera · 2026-06-12 04:00

Hellinger Multimodal Variational Autoencoders

arXiv:2601.06572v4 Announce Type: replace-cross Abstract: Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of expert…
arXiv cs.AI TIER_1 English(EN) · Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song, Changlin Li, Junzhe Li, Tao Huang, Xiao Zhang, Yang Li, Jianbing Wu, Miles Yang, Zhao Zhong, Liefeng Bo, Limin Wang · 2026-06-12 04:00

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

arXiv:2606.13289v1 Announce Type: cross Abstract: Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video to…
arXiv cs.AI TIER_1 Dansk(DA) · Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin, Rohan Sanda, Steven Gao, Mark Horowitz, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang · 2026-06-12 04:00

M*: A Modular, Extensible, Serving System for Multimodal Models

arXiv:2606.12688v1 Announce Type: cross Abstract: We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such …
arXiv cs.AI TIER_1 English(EN) · Limin Wang · 2026-06-11 12:46

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT…
arXiv cs.AI TIER_1 English(EN) · Hui Wang, Tianyu Ren, Joseph Butler, Christopher Baker, Karen Rafferty, Simon McDade · 2026-06-11 04:00

Latent World Recovery for Multimodal Learning with Missing Modalities

arXiv:2606.12362v1 Announce Type: cross Abstract: We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modalities are often only partially available when decisions need to be made. We propose Latent …
arXiv cs.LG TIER_1 English(EN) · Jiaqi Luo · 2026-06-11 04:00

Parameter-Efficient Adapter Tuning for Tabular-Image Multimodal Learning

arXiv:2606.11682v1 Announce Type: cross Abstract: Tabular-image multimodal learning aims to improve predictive modeling by jointly using structured tabular attributes and visual data. Although pretrained encoders provide strong modality-specific representations, full fine-tuning …
arXiv cs.AI TIER_1 English(EN) · Zequn Yang, Yake Wei, Haotian Ni, Zhihao Xu, Di Hu · 2026-06-11 04:00

Information-Theoretic Decomposition for Multimodal Interaction Learning

arXiv:2606.11614v1 Announce Type: cross Abstract: Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interac…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 00:00

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

HYDRA-X presents a unified multimodal model that integrates image and video tokenization within a single Vision Transformer, addressing spatiotemporal reconstruction and semantic awareness through causal temporal attention and hierarchical compression.
arXiv cs.AI TIER_1 English(EN) · Simon McDade · 2026-06-10 17:31

Latent World Recovery for Multimodal Learning with Missing Modalities

We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modalities are often only partially available when decisions need to be made. We propose Latent World Recovery (LWR), a framework built on two key…
arXiv cs.LG TIER_1 English(EN) · Konstantinos Kontras, Teodora Gagaleska, Thomas Strypsteen, Christos Chatzichristos, Matthew Blaschko, Maarten De Vos, Paul Pu Liang · 2026-06-10 04:00

SynIB: Informational Bottleneck for Maximizing Synergy in Multimodal Learning

arXiv:2606.09853v1 Announce Type: new Abstract: A central objective in multimodal learning is to capture synergy: task-relevant information that arises only from the joint use of multiple modalities, and is not available from any single modality alone. While most approaches opera…
arXiv cs.AI TIER_1 English(EN) · Lingyi Meng, Zecong Tang, Haoran Li, Tengju Ru, Zhejun Cui, Weitong Lian, Qi Kang, Hangshuo Cao, Yichen Zhu, Yechi Liu, Kaixuan Wang, Yu-Jie Yuan, Chunwei Wang, Yu Zhang, Bo Dai · 2026-06-09 04:00

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

arXiv:2606.09169v1 Announce Type: new Abstract: In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 00:00

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

ARM demonstrates a unified autoregressive framework for image understanding, generation, and editing through discrete semantic tokenization and reinforcement learning optimization.
arXiv cs.CV TIER_1 English(EN) · Peixi Wu, Biao Yang, Feipeng Ma, Bosong Chai, Bo Lin, Wei Yuan, Fan Yang, Tingting Gao, Hebei Li, Xiaoyan Sun · 2026-06-12 04:00

LaME: Learning to Think in Latent Space for Multimodal Embedding via Information Bottleneck

arXiv:2606.13061v1 Announce Type: new Abstract: Reasoning-driven universal multimodal embedding has advanced rapidly by introducing Chain-of-Thought (CoT) reasoning into the embedding pipeline. Despite the strong performance across both general and complex tasks, this paradigm su…
arXiv cs.CV TIER_1 English(EN) · Garvita Allabadi, Matteo Sodano, Roberto Estev\~ao, Yuxiong Wang, Vikram Adve, Emre Kiciman, Ranveer Chandra · 2026-06-12 04:00

GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models

arXiv:2606.12744v1 Announce Type: new Abstract: In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) r…
arXiv cs.CV TIER_1 English(EN) · Xiaoyan Sun · 2026-06-11 08:47

LaME: Learning to Think in Latent Space for Multimodal Embedding via Information Bottleneck

Reasoning-driven universal multimodal embedding has advanced rapidly by introducing Chain-of-Thought (CoT) reasoning into the embedding pipeline. Despite the strong performance across both general and complex tasks, this paradigm suffers from two core limitations: (i) autoregress…
arXiv cs.CV TIER_1 English(EN) · Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan · 2026-06-11 04:00

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

arXiv:2511.16672v4 Announce Type: replace Abstract: Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting…
arXiv cs.CV TIER_1 English(EN) · Junke Wang, Xiao Wang, Jiacheng Pan, Xuefeng Hu, Feng Li, Jingxiang Sun, Chaorui Deng, Zilong Chen, Yunpeng Chen, Kaibin Tian, Matthew Gwilliam, Hao Chen, Danhui Guan, Kun Xu, Weilin Huang, Zuxuan Wu, Haoqi Fan, Yu-Gang Jiang, Zhenheng Yang · 2026-06-10 04:00

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

arXiv:2606.11188v1 Announce Type: new Abstract: This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a dis…
arXiv cs.CV TIER_1 English(EN) · Zhenheng Yang · 2026-06-09 17:59

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images…
arXiv cs.CV TIER_1 English(EN) · Bo Dai · 2026-06-08 08:08

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmark…

COVERAGE [20]

RELATED ENTITIES

RELATED TOPICS