Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?
Users are encountering issues with Google's Gemma 4 12B unified model, which is designed to process audio, vision, and text simultaneously. While the model responds well to audio with short text prompts, it appears to lose its ability to attend to speech when presented with large, dense system prompts. This limitation has been observed across multiple serving frameworks, suggesting a potential issue with the model's architecture or attention mechanisms when handling competing inputs. AI
IMPACT Highlights potential limitations in unified multimodal models when processing long contexts, impacting voice assistant development.