GSM8K
PulseAugur coverage of GSM8K — every cluster mentioning GSM8K across labs, papers, and developer communities, ranked by signal.
9 天有情绪数据
-
Claude Sonnet with self-consistency beats Opus on math, code tasks
A recent analysis demonstrates that employing a self-consistency technique with Anthropic's Claude Sonnet model can outperform a single call to the more powerful Claude Opus model on specific tasks. This method involves…
-
New method steers LLM attention to correct reasoning errors
Researchers have developed Manifold-Guided Attention Steering (MAGS), a novel method to improve the reasoning capabilities of large language models. MAGS identifies deviations from a 'correctness manifold' in the model'…
-
X-Token method enhances knowledge distillation for mismatched tokenizers
Researchers have developed X-Token, a novel knowledge distillation technique designed to improve student models by learning from teacher models with different tokenizers. The method addresses limitations in existing log…
-
New 'Distillation Game' framework reveals model imitation risks
Researchers have developed a new framework called "The Distillation Game" to study the trade-off between model utility and imitation risk. This framework models the interaction as a minimax game between a teacher model …
-
New research frames LLM post-training around state distributions, not just tokens
Researchers have proposed a new perspective on large language model post-training, focusing on the distribution of states rather than just tokens. Their study suggests that the source and locality of training states can…
-
New RL methods tackle LLM training issues
Two new research papers introduce methods to improve the training of large language models using reinforcement learning. One paper addresses the issue of "advantage collapse" in Group Relative Policy Optimization (GRPO)…
-
New Reflector framework boosts LLM safety against jailbreaks
Researchers have developed a new framework called Reflector to enhance the safety of Large Language Models (LLMs) against sophisticated jailbreak attacks. This two-stage approach first uses teacher-guided generation for…
-
HRM-Text model drastically cuts LLM pretraining costs
Researchers have developed HRM-Text, a novel Hierarchical Recurrent Model that significantly reduces the computational resources and training data required for pretraining large language models. By decoupling computatio…
-
Small LLMs use positional copying shortcut for arithmetic, bypassing CoT logic
A new research paper reveals a significant shortcut in how small language models perform arithmetic tasks using chain-of-thought (CoT) prompting. Instead of relying on logical sequencing, these models tend to copy the n…
-
LLM benchmark costs analyzed: $0.12 for 3 tasks
Benchmarking three large language model tasks (GSM8K, HellaSwag, and TruthfulQA) on a single T4 GPU costs approximately $0.12. The analysis reveals that generative tasks are the primary cost driver, while log-likelihood…
-
Evaluate LLMs for under $1 using Qwen2.5-0.5B
This post details a cost-effective method for evaluating large language models, demonstrating that comprehensive benchmarks can be run for under a dollar. The author used a free Google Colab T4 instance to test the Qwen…
-
CANTANTE framework optimizes LLM multi-agent systems via credit attribution
Researchers have developed CANTANTE, a new framework designed to optimize the configuration of large language model-based multi-agent systems. This system addresses the challenge of assigning credit for performance when…
-
New Yoked Feature Preference Optimization enhances LLM math reasoning
Researchers have introduced Yoked Feature Preference Optimization (YFPO), a novel framework designed to enhance the mathematical reasoning capabilities of large language models. Unlike existing methods that rely solely …
-
AI reasoning studies flawed by focus on final answer, not computation
A new research paper identifies a significant flaw in chain-of-thought (CoT) corruption studies, which are used to evaluate the faithfulness of AI reasoning. The study found that these evaluations often mistakenly ident…
-
New RL algorithm fix boosts GSM8K accuracy by 45 points
Researchers have identified a critical issue in the Group Relative Policy Optimization (GRPO) algorithm when applied to binary rewards, leading to "gradient starvation." This occurs when all responses in a group are eit…
-
New research reveals "coupling tax" limits LLM reasoning accuracy
A new research paper introduces the concept of a "coupling tax" in large language models, highlighting how shared token budgets for reasoning and final answers can hinder accuracy. The study found that for certain tasks…
-
LLM framework CIKA pinpoints causally relevant math concepts
Researchers have developed a new framework called CIKA to improve large language model (LLM) mathematical reasoning by identifying causally relevant concepts. Unlike previous methods that struggled with spurious associa…
-
LoRA rank allocation fails in RL fine-tuning, study finds
A new study on the Qwen 2.5 1.5B model reveals that adaptive rank allocation techniques, effective in supervised fine-tuning, do not translate to reinforcement learning with Group Relative Policy Optimization (GRPO). Re…
-
AI models use policy-guided routing for cost-effective reasoning on math tasks
Researchers have developed a new method for cost-effective reasoning in large language models by implementing a policy-guided stepwise model routing system. This approach formulates the routing of intermediate chain-of-…
-
QKVShare framework enables efficient quantized KV-cache handoff for on-device LLMs
Researchers have developed QKVShare, a framework designed to improve the efficiency of transferring latent context between agents in multi-agent LLM systems operating on edge devices. This approach utilizes quantized KV…