transformers
PulseAugur coverage of transformers — every cluster mentioning transformers across labs, papers, and developer communities, ranked by signal.
- used by KV cache 90%
- used by vLLM 70%
- used by llama.cpp 70%
- used by Ollama 70%
- competes with State space models: Univariate representation of a multivariate model, partial interpolation and periodic convergence 70%
- used by CNNS 70%
- used by AdamW 70%
- competes with State Space Models 70%
- instance of grokking 70%
- used by llama-cpp-python 70%
- used by functional magnetic resonance imaging 70%
- used by SGD 70%
- 2026-05-13 research_milestone A paper was published analyzing the impact of data representation and tokenization on Transformer context effectiveness. source
25 day(s) with sentiment data
-
LLMs' arithmetic skills boosted by pedagogy and geometric analysis
Researchers are exploring how to improve large language models' (LLMs) arithmetic capabilities through novel training methods and geometric analysis. One approach uses Indonesian mathematics pedagogy to train a small GP…
-
AI training framed as Hamilton-Jacobi PDE problem
Researchers have formulated neural network training as a Hamilton-Jacobi initial-value problem. This framework connects gradient steps to solving viscous Hamilton-Jacobi equations, revealing shared mathematical structur…
-
JetBrains releases Mellum2 reasoning model with 131K context
JetBrains has released its Mellum2 model family, including the Mellum2-12B-A2.5B-Thinking variant, which is designed for complex reasoning tasks. This model utilizes a Mixture-of-Experts architecture with a large contex…
-
New Interdomain Attention Merges Transformers and SSMs
Researchers have introduced Interdomain Attention, a novel mechanism that merges the strengths of Transformers and deep state space models (SSMs). This new approach integrates an SSM into an attention module using kerne…
-
Transformers learn Spanish morphome differently than humans
Researchers investigated whether transformers can learn the Spanish L-shaped morphome, an irregular morphological pattern, by training models on varying frequencies of irregular verbs. The study found that while transfo…
-
Krause Attention improves Transformers with localized interactions
Researchers have introduced Krause Attention, a novel mechanism designed to improve Transformer models by addressing issues like representation collapse and attention sinks. This new approach replaces global aggregation…
-
Researchers Uncover How Transformers Achieve Analogical Reasoning
Two new research papers explore the mechanisms behind analogical reasoning in Transformer models. The first paper formalizes analogy as inferring correspondences between categories, identifying geometric alignment and f…
-
New method identifies attention-head circuits in transformers
Researchers have developed a novel three-step method called Spectral Probe-Circuits to identify specific computational circuits within pretrained transformer models. This technique uses a spectral signal to rank attenti…
-
Deep Learning Models Compared for Skin Cancer Detection
Researchers have conducted a comprehensive evaluation of twelve deep learning models for detecting skin cancer using a unified approach on the PAD-UFES-20 dataset. The study compared convolutional neural networks (CNNs)…
-
NextLat Transformers Learn Compact World Models for Better Generalization
Researchers have developed a new training method called Next-Latent Prediction (NextLat) for transformers, which encourages them to build more compact internal world models. This approach adds a self-supervised objectiv…
-
Certification Hard for Transformers and Circuits
A new research paper explores the difficulty of certifying the exact behavior of neural networks, particularly Transformers and circuits, even with minimal overparametrization. The study demonstrates that adding even a …
-
Tensor Cache enhances Transformer long-context memory
Researchers have developed a novel memory system called Tensor Cache for Transformers, designed to enhance their ability to handle long contexts. This system combines a sliding-window cache with a second-level fast-weig…
-
Stepfun AI releases 198B parameter multimodal MoE model
Stepfun AI has released Step 3.7 Flash, a 198-billion parameter sparse Mixture-of-Experts (MoE) vision-language model. This model is optimized for agentic workflows, coding, and multimodal tasks, activating approximatel…
-
Google DeepMind releases multimodal Gemma 4 12B models
Google DeepMind has released several variants of its Gemma 4 models, including the 12B parameter versions. These models are multimodal, capable of processing text, image, audio, and video inputs, with a focus on efficie…
-
Fixing local LLM OOM errors by optimizing KV cache and quantization
Running large open-source language models locally can lead to out-of-memory errors, even if the model's weights seem to fit within the available VRAM. This is primarily due to the significant memory required for the KV …
-
New CODA paper reframes Transformers as math problems
A new research paper introduces CODA, a novel approach to Transformers that reframes them as mathematical problems. This method aims to potentially revolutionize the architecture of neural networks. The paper is availab…
-
Transformers struggle with state-based decisions in search, new paper finds
Researchers have identified a critical limitation in how transformer models process serialized trajectory data during backtracking search. These models can struggle with 'scattered retrieval,' where state features are d…
-
LLM Pretraining Creates Generalizable Manifold for Time Series Forecasting
A new research paper explores how large language models (LLMs) pretrained on text can be effectively used for time-series forecasting. The study demonstrates that language pretraining equips transformers with a reusable…
-
LLMs and new frameworks boost GPU kernel optimization
Researchers are exploring novel ways to optimize GPU kernel performance for large language models. One approach uses language models as surrogates to predict kernel performance, significantly increasing the number of ca…
-
Paper reveals graph tokenization trade-offs for Transformer expressivity
A new paper explores the critical role of graph tokenization in applying Transformers to graph learning tasks. Researchers demonstrate that the method used to convert graph structures into tokens significantly impacts a…