PulseAugur
LIVE 09:08:07
research · [36 sources] ·
0
research

LLM research explores new methods for training, evaluation, and understanding model behavior

Researchers are developing new methods to improve LLM capabilities in various domains. One study introduces MemCoE, a cognition-inspired framework for LLM agents to learn how to organize and update long-term user memory, enhancing personalization. Another paper, ReLay, explores personalized LLM-generated summaries, finding that while personalization improves comprehension, it also introduces risks of bias and hallucinations. Additionally, a new benchmark called ClassEval-Pro has been created to evaluate LLMs on class-level code generation, revealing significant performance gaps among current frontier models. AI

Summary written by gemini-2.5-flash-lite from 36 sources. How we write summaries →

IMPACT Advances in LLM memory, personalization, and code generation benchmarks will drive further research and development in AI agents and software engineering.

RANK_REASON Multiple arXiv papers introduce new methodologies, benchmarks, and datasets for LLM research.

Read on Practical AI →

LLM research explores new methods for training, evaluation, and understanding model behavior

COVERAGE [36]

  1. OpenAI News TIER_1 ·

    Generative models

    This post describes four projects that share a common theme of enhancing or using generative models, a branch of unsupervised learning techniques in machine learning. In addition to describing our work, this post will tell you a bit more about generative models: what they are, wh…

  2. arXiv cs.LG TIER_1 · Wanru Zhao, Yihong Chen, Yuzhi Tang, Wentao Ma, Shengchao Hu, Shell Xu Hu, Alex Iacob, Abhinav Mehrotra, Nicholas D. Lane ·

    Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

    arXiv:2605.05227v1 Announce Type: new Abstract: Data curation is a critical yet under-explored area in large language model (LLM) training. Existing methods, such as data selection and mixing, operate in an offline paradigm, detaching themselves from training. This separation int…

  3. arXiv cs.AI TIER_1 · Chengda Lu, Xiaoyu Fan, Wei Xu ·

    HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory

    arXiv:2605.05741v1 Announce Type: new Abstract: While Large Language Models (LLMs) achieve strong performance across diverse tasks, their inference dynamics remain poorly understood because of the limited resolution of existing analysis tools. In this work, we identify an intrins…

  4. arXiv cs.CL TIER_1 · Yuan Sui, Bryan Hooi ·

    Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

    arXiv:2601.21464v2 Announce Type: replace Abstract: Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scal…

  5. arXiv cs.CL TIER_1 · Hang Chen, Jiaying Zhu, Hongyang Chen, Hongxu Liu, Xinyu Yang, Wenya Wang ·

    Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

    arXiv:2605.06076v1 Announce Type: new Abstract: The "Locate-then-Update" paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic interpretability for targeted parameter updates. However, this…

  6. arXiv cs.CL TIER_1 · Wenya Wang ·

    Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

    The "Locate-then-Update" paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic interpretability for targeted parameter updates. However, this paradigm rests on a fundamental yet unverified …

  7. arXiv cs.LG TIER_1 · Hengyu Shi, Tianyang Han, Peizhe Wang, Zhiling Wang, Xu Yang, Junhao Su ·

    Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

    arXiv:2605.04913v1 Announce Type: cross Abstract: LLM post-training typically propagates task gradients through the full depth of the model. Although this end-to-end structure is simple and general, it couples task adaptation to full-depth activation storage, long-range backward …

  8. arXiv cs.LG TIER_1 · Pere Martra ·

    Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

    arXiv:2512.22671v2 Announce Type: replace-cross Abstract: Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance …

  9. arXiv cs.CL TIER_1 · Junhao Su ·

    Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

    LLM post-training typically propagates task gradients through the full depth of the model. Although this end-to-end structure is simple and general, it couples task adaptation to full-depth activation storage, long-range backward dependencies and direct task-gradient access to pr…

  10. arXiv cs.AI TIER_1 · Xiyuan Zhou, Xinlei Wang, Yirui He, Yang Wu, Ruixi Zou, Yuheng Cheng, Yulu Xie, Wenxuan Liu, Huan Zhao, Yan Xu, Jinjin Gu, Junhua Zhao ·

    EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

    arXiv:2509.17677v2 Announce Type: replace Abstract: Large language models (LLMs) have shown strong performance on mathematical reasoning under well-defined conditions. However, real-world engineering problems involve uncertainty, context, and open-ended settings that extend beyon…

  11. arXiv cs.AI TIER_1 · YoungBin Kim ·

    Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks

    While Large Vision-Language Models (LVLMs) offer powerful capabilities, they pose privacy risks by unintentionally memorizing sensitive personal information. Current unlearning benchmarks attempt to mitigate this using fictitious identities but overlook a critical stage 1 failure…

  12. arXiv cs.AI TIER_1 · Shouyu Yin, Zhao Tian, Junjie Chen, Shikai Guo ·

    Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning

    arXiv:2605.00433v1 Announce Type: cross Abstract: Code generation, which aims to automatically generate source code from given programming requirements, has the potential to substantially improve software development efficiency. With the rapid advancement of large language models…

  13. arXiv cs.CL TIER_1 · Michael J. Parker, Maria G. Zavala-Cerna ·

    What Don't You Understand? Using Large Language Models to Identify and Characterize Student Misconceptions About Challenging Topics

    arXiv:2605.00294v1 Announce Type: new Abstract: This study presents a systematic approach to identifying and characterizing student misconceptions in online learning environments through a novel combination of quantitative performance analysis and large language model (LLM) asses…

  14. arXiv cs.CL TIER_1 · Derong Xu, Shuochen Liu, Pengfei Luo, Pengyue Jia, Yingyi Zhang, Yi Wen, Yimin Deng, Wenlin Zhang, Enhong Chen, Xiangyu Zhao, Tong Xu ·

    Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

    arXiv:2605.00702v1 Announce Type: new Abstract: Large language model (LLM) agents require long-term user memory for consistent personalization, but limited context windows hinder tracking evolving preferences over long interactions. Existing memory systems mainly rely on static, …

  15. arXiv cs.CL TIER_1 · Joey Chan, Yikun Han, Jingyuan Chen, Samuel Fang, Lauren D. Gryboski, Alexandra Lee, Sheel Tanna, Qingqing Zhu, Zhiyong Lu, Lucy Lu Wang, Yue Guo ·

    ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?

    arXiv:2605.00468v1 Announce Type: new Abstract: Plain Language Summaries (PLS) aim to make research accessible to lay readers, but they are typically written in a one-size-fits-all style that ignores differences in readers' information needs and comprehension. In health contexts,…

  16. arXiv cs.CL TIER_1 · Tong Xu ·

    Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

    Large language model (LLM) agents require long-term user memory for consistent personalization, but limited context windows hinder tracking evolving preferences over long interactions. Existing memory systems mainly rely on static, hand-crafted update rules; although reinforcemen…

  17. arXiv cs.CL TIER_1 · Yue Guo ·

    ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?

    Plain Language Summaries (PLS) aim to make research accessible to lay readers, but they are typically written in a one-size-fits-all style that ignores differences in readers' information needs and comprehension. In health contexts, this limitation is particularly important becau…

  18. arXiv cs.AI TIER_1 · Shikai Guo ·

    Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning

    Code generation, which aims to automatically generate source code from given programming requirements, has the potential to substantially improve software development efficiency. With the rapid advancement of large language models (LLMs), LLM-based code generation has attracted w…

  19. arXiv cs.AI TIER_1 · Chao Fei, Hongcheng Guo, Yanghua Xiao ·

    When Agents Evolve, Institutions Follow

    arXiv:2604.27691v1 Announce Type: new Abstract: Across millennia, complex societies have faced the same coordination problem of how to organize collective action among cognitively bounded and informationally incomplete individuals. Different civilizations developed different poli…

  20. arXiv cs.CL TIER_1 · Maria G. Zavala-Cerna ·

    What Don't You Understand? Using Large Language Models to Identify and Characterize Student Misconceptions About Challenging Topics

    This study presents a systematic approach to identifying and characterizing student misconceptions in online learning environments through a novel combination of quantitative performance analysis and large language model (LLM) assessment. We analyzed data from 9 course periods ac…

  21. arXiv cs.CL TIER_1 · Yeheng Chen, Chaoxiang Xie, Yuling Shi, Wenhao Zeng, Yongpan Wang, Hongyu Zhang, Xiaodong Gu ·

    ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

    arXiv:2604.26923v1 Announce Type: cross Abstract: LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, inte…

  22. arXiv cs.CL TIER_1 · Xiaodong Gu ·

    ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

    LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- re…

  23. arXiv cs.AI TIER_1 · Eduardo Oliveira, Michael Fu, Patanamon Thongtanunam, Sonsoles L\'opez-Pernas, Mohammed Saqr ·

    AI-Assisted Code Review as a Scaffold for Code Quality and Self-Regulated Learning: An Experience Report

    arXiv:2604.23251v1 Announce Type: cross Abstract: Code review is central to software engineering education but hard to scale in capstone projects due to tight deadlines, uneven peer feedback, and limited prior experience. We investigate an LLM-as-reviewer integrated directly into…

  24. arXiv cs.AI TIER_1 · Haoxuan Zhang, Ruochi Li, Yang Zhang, Zhenni Liang, Junhua Ding, Ting Xiao, Haihua Chen ·

    MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

    arXiv:2604.23539v1 Announce Type: new Abstract: The rapid proliferation of Generative AI necessitates rigorous documentation standards for transparency and governance. However, manual creation of Model and Data Cards is not scalable, while automated approaches lack large-scale, h…

  25. METR (Model Evaluation & Threat Research) TIER_1 ·

    Response to NIST Draft Generative AI Profile

    Comments on NIST’s draft document “AI Risk Management Framework: Generative AI Profile.”

  26. arXiv cs.CV TIER_1 · JuneHyoung Kwon, MiHyeon Kim, Eunju Lee, JungMin Yun, Byeonggeuk Lim, YoungBin Kim ·

    Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks

    arXiv:2605.03759v1 Announce Type: new Abstract: While Large Vision-Language Models (LVLMs) offer powerful capabilities, they pose privacy risks by unintentionally memorizing sensitive personal information. Current unlearning benchmarks attempt to mitigate this using fictitious id…

  27. arXiv stat.ML TIER_1 · Yizheng Huang, Wenjun Zeng, Aditi Kumaresan, Zi Wang ·

    ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

    arXiv:2604.23099v1 Announce Type: cross Abstract: Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that levera…

  28. arXiv cs.CV TIER_1 · Nivetha Jayakumar, Swakshar Deb, Bahram Jafrasteh, Qingyu Zhao, Miaomiao Zhang ·

    Generative Modeling of Neurodegenerative Brain Anatomy with 4D Longitudinal Diffusion Model

    arXiv:2604.22700v1 Announce Type: new Abstract: Understanding and predicting the progression of neurodegenerative diseases remains a major challenge in medical AI, with significant implications for early diagnosis, disease monitoring, and treatment planning. However, most availab…

  29. arXiv stat.ML TIER_1 · Zi Wang ·

    ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

    Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate perf…

  30. arXiv cs.CV TIER_1 · Miaomiao Zhang ·

    Generative Modeling of Neurodegenerative Brain Anatomy with 4D Longitudinal Diffusion Model

    Understanding and predicting the progression of neurodegenerative diseases remains a major challenge in medical AI, with significant implications for early diagnosis, disease monitoring, and treatment planning. However, most available longitudinal neuroimaging datasets are tempor…

  31. Chip Huyen TIER_1 ·

    Common pitfalls when building generative AI applications

    <p>As we’re still in the early days of building applications with foundation models, it’s normal to make mistakes. This is a quick note with examples of some of the most common pitfalls that I’ve seen, both from public case studies and from my personal experience.</p> <p>Because …

  32. Chip Huyen TIER_1 ·

    Building A Generative AI Platform

    <p>After studying how companies deploy generative AI applications, I noticed many similarities in their platforms. This post outlines the common components of a generative AI platform, what they do, and how they are implemented. I try my best to keep the architecture general, but…

  33. Chip Huyen TIER_1 ·

    Generative AI Strategy

    <p>I had a lot of fun preparing the talk: <em>“Leadership needs us to do generative AI. What do we do?”</em> for <a href="https://fullyconnected.com/">Fully Connected</a>. The idea for the talk came from many conversations I’ve had recently with friends who need to figure out the…

  34. Practical AI TIER_1 · Practical AI LLC ·

    Generative models: exploration to deployment

    <p>What is the model lifecycle like for experimenting with and then deploying generative AI models? Although there are some similarities, this lifecycle differs somewhat from previous data science practices in that models are typically not trained from scratch (or even fine-tuned…

  35. Practical AI TIER_1 · Practical AI LLC ·

    From ML to AI to Generative AI

    <p>Chris and Daniel take a step back to look at how generative AI fits into the wider landscape of ML/AI and data science. They talk through the differences in how one approaches “traditional” supervised learning and how practitioners are approaching generative AI based solutions…

  36. Medium — fine-tuning tag TIER_1 · praveenreddy_c ·

    How LLMs Learn to Think: Inside DeepSeek’s GRPO Technique

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@mailpraveenreddy.c/how-llms-learn-to-think-inside-deepseeks-grpo-technique-c2acf34aa6e1?source=rss------fine_tuning-5"><img src="https://cdn-images-1.medium.com/max/1536/1*yWcluVScWAJmCDx7Lhvo…