Google DeepMind releases Gemma 4 12B multimodal model for laptops
ByPulseAugur Editorial·[20 sources]·
Google DeepMind has released Gemma 4 12B, a new multimodal model designed for local execution on laptops with 16GB of VRAM. This model features a novel unified architecture that integrates audio and vision inputs directly into the LLM backbone without separate encoders, reducing latency and memory usage. Gemma 4 12B aims to bring advanced agentic multimodal capabilities to everyday hardware, with performance nearing its larger 26B MoE counterpart and broad developer support through open licensing and integration with popular tools.
AI
IMPACT
This release brings advanced multimodal capabilities to consumer hardware, potentially accelerating local AI agent development and use.
RANK_REASON
Frontier-lab model release with system card
arXiv:2601.06572v4 Announce Type: replace-cross Abstract: Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of expert…
arXiv cs.AI
TIER_1English(EN)·Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song, Changlin Li, Junzhe Li, Tao Huang, Xiao Zhang, Yang Li, Jianbing Wu, Miles Yang, Zhao Zhong, Liefeng Bo, Limin Wang·
arXiv:2606.13289v1 Announce Type: cross Abstract: Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video to…
arXiv cs.AI
TIER_1Dansk(DA)·Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin, Rohan Sanda, Steven Gao, Mark Horowitz, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang·
arXiv:2606.12688v1 Announce Type: cross Abstract: We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such …
Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT…
arXiv cs.AI
TIER_1English(EN)·Hui Wang, Tianyu Ren, Joseph Butler, Christopher Baker, Karen Rafferty, Simon McDade·
arXiv:2606.12362v1 Announce Type: cross Abstract: We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modalities are often only partially available when decisions need to be made. We propose Latent …
arXiv:2606.11682v1 Announce Type: cross Abstract: Tabular-image multimodal learning aims to improve predictive modeling by jointly using structured tabular attributes and visual data. Although pretrained encoders provide strong modality-specific representations, full fine-tuning …
arXiv cs.AI
TIER_1English(EN)·Zequn Yang, Yake Wei, Haotian Ni, Zhihao Xu, Di Hu·
arXiv:2606.11614v1 Announce Type: cross Abstract: Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interac…
HYDRA-X presents a unified multimodal model that integrates image and video tokenization within a single Vision Transformer, addressing spatiotemporal reconstruction and semantic awareness through causal temporal attention and hierarchical compression.
We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modalities are often only partially available when decisions need to be made. We propose Latent World Recovery (LWR), a framework built on two key…
arXiv cs.LG
TIER_1English(EN)·Konstantinos Kontras, Teodora Gagaleska, Thomas Strypsteen, Christos Chatzichristos, Matthew Blaschko, Maarten De Vos, Paul Pu Liang·
arXiv:2606.09853v1 Announce Type: new Abstract: A central objective in multimodal learning is to capture synergy: task-relevant information that arises only from the joint use of multiple modalities, and is not available from any single modality alone. While most approaches opera…
arXiv:2606.09169v1 Announce Type: new Abstract: In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real…
ARM demonstrates a unified autoregressive framework for image understanding, generation, and editing through discrete semantic tokenization and reinforcement learning optimization.
arXiv:2606.13061v1 Announce Type: new Abstract: Reasoning-driven universal multimodal embedding has advanced rapidly by introducing Chain-of-Thought (CoT) reasoning into the embedding pipeline. Despite the strong performance across both general and complex tasks, this paradigm su…
arXiv:2606.12744v1 Announce Type: new Abstract: In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) r…
Reasoning-driven universal multimodal embedding has advanced rapidly by introducing Chain-of-Thought (CoT) reasoning into the embedding pipeline. Despite the strong performance across both general and complex tasks, this paradigm suffers from two core limitations: (i) autoregress…
arXiv cs.CV
TIER_1English(EN)·Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan·
arXiv:2511.16672v4 Announce Type: replace Abstract: Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting…
arXiv:2606.11188v1 Announce Type: new Abstract: This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a dis…
This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images…
In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmark…