New MoA FFN Design Enhances LLM Expressivity and Scaling

By PulseAugur Editorial · [2 sources] · 2026-05-26 07:30

Researchers have introduced a novel feedforward network (FFN) design called Mixture of Activations (MoA) for large language models (LLMs). MoA utilizes token-adaptive activation mixing, allowing different activation functions to be applied to different tokens based on lightweight, input-dependent gates. This approach theoretically offers greater expressivity than fixed-activation FFNs and learnable activations (LA). Empirical evaluations on models ranging from 0.12B to 2B parameters show that MoA consistently achieves lower terminal loss and better scaling behavior with minimal overhead. AI

IMPACT This new FFN design could lead to more efficient and powerful LLMs by improving their nonlinear expressivity and scaling behavior.

RANK_REASON The cluster contains an academic paper detailing a new method for improving feedforward network layers in LLMs.

Read on arXiv stat.ML →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New MoA FFN Design Enhances LLM Expressivity and Scaling

COVERAGE [2]

arXiv stat.ML TIER_1 English(EN) · Mingze Wang, Jinbo Wang, Yikuan Xia, Kai Shen, Shu Zhong · 2026-05-27 04:00

More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

arXiv:2605.26647v1 Announce Type: cross Abstract: Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, m…
arXiv stat.ML TIER_1 English(EN) · Shu Zhong · 2026-05-26 07:30

More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activatio…

COVERAGE [2]

More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

RELATED ENTITIES

RELATED TOPICS