PulseAugur
EN
LIVE 18:38:44

Google Research evaluates LLM alignment and improves factuality

Google Research has developed a new framework to evaluate the behavioral alignment of large language models with human social inclinations. This approach adapts established psychological questionnaires into large-scale situational judgment tests, allowing for the quantification of model tendencies in realistic scenarios. The research identifies gaps where model behaviors deviate from human consensus or fail to capture the range of human opinions, aiming to improve LLM navigation of social dynamics. Separately, Google Research also introduced SLED, a novel decoding strategy that enhances LLM factuality by utilizing all model layers instead of just the final one, without requiring external data or fine-tuning. AI

IMPACT New methods for evaluating LLM alignment and improving factuality could lead to more trustworthy and socially adept AI systems.

RANK_REASON The cluster contains two research papers from Google Research detailing new methods for evaluating LLM alignment and improving LLM factuality.

Read on Google AI / Research →

AI-generated summary · Google Gemini · from 598 sources. How we write summaries →

Google Research evaluates LLM alignment and improves factuality

COVERAGE [598]

  1. Google AI / Research TIER_1 English(EN) ·

    Evaluating alignment of behavioral dispositions in LLMs

    Generative AI

  2. Google AI / Research TIER_1 English(EN) ·

    Making LLMs more accurate by using all of their layers

    Algorithms & Theory

  3. Hugging Face Blog TIER_1 English(EN) ·

    Consilium: When Multiple LLMs Collaborate

  4. Hugging Face Blog TIER_1 English(EN) ·

    Mastering Long Contexts in LLMs with KVPress

  5. Hugging Face Blog TIER_1 English(EN) ·

    Judge Arena: Benchmarking LLMs as Evaluators

  6. Hugging Face Blog TIER_1 English(EN) ·

    Expert Support case study: Bolstering a RAG app with LLM-as-a-Judge

  7. Hugging Face Blog TIER_1 English(EN) ·

    CodeGemma - an official Google release for code LLMs

  8. Hugging Face Blog TIER_1 English(EN) ·

    Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem

  9. Hugging Face Blog TIER_1 English(EN) ·

    Open-source LLMs as LangChain Agents

  10. Hugging Face Blog TIER_1 English(EN) ·

    Introducing Agents.js: Give tools to your LLMs using JavaScript

  11. arXiv cs.LG TIER_1 English(EN) · Advik Raj Basani, Anshuman Chhabra ·

    Exposing the Illusion of Erasure in Knowledge Editing for LLMs

    arXiv:2606.23276v2 Announce Type: replace Abstract: Knowledge Editing (KE) has emerged as a frontier for updating specific facts in LLMs without costly retraining, but its reliability and underlying mechanisms remain poorly understood. In this work, we examine KE from an adversar…

  12. arXiv cs.AI TIER_1 English(EN) · Anshuman Chhabra ·

    Exposing the Illusion of Erasure in Knowledge Editing for LLMs

    Knowledge Editing (KE) has emerged as a frontier for updating specific facts in LLMs without costly retraining, but its reliability and underlying mechanisms remain poorly understood. In this work, we examine KE from an adversarial elicitation perspective, revealing that edited k…

  13. arXiv cs.CL TIER_1 English(EN) · Haw-Shiuan Chang, Jeffrey Gomez, Mehul Patwari, Aryan Sajith, Hamed Zamani ·

    Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

    arXiv:2606.20482v1 Announce Type: new Abstract: To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First…

  14. arXiv cs.AI TIER_1 English(EN) · Zunchen Huang, Songgaojun Deng ·

    Analyzing the Narration Gap in LLM-Solver Loops

    arXiv:2606.19588v1 Announce Type: new Abstract: Formal tools such as SAT and SMT solvers are increasingly embedded in language model reasoning pipelines when a safety or security critical question can be formulated in logic. Unlike chain of thought whose steps are sampled from th…

  15. arXiv cs.CL TIER_1 English(EN) · Hamed Zamani ·

    Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

    To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First, the users rarely provide explicit feedback for…

  16. arXiv cs.MA (Multiagent) TIER_1 Nederlands(NL) · Sankalp Nayak ·

    Heterogeneous LLM Debate Under Adversarial Peers: Honest Gains, Replacement Costs, and Resilience

    Heterogeneous LLM debate is motivated by the promise that diverse peers correct one another, but the same exchange that carries correction also carries adversarial influence. We measure which dominates by tracking how a heterogeneous peer changes the honest agents' revision behav…

  17. arXiv cs.CL TIER_1 English(EN) · Naihao Deng, Yiming Feng, Chimaobi Okite, Kaijian Zou, Lu Wang, Rada Mihalcea, Yulong Chen ·

    The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

    arXiv:2606.18656v1 Announce Type: new Abstract: Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper high…

  18. arXiv cs.AI TIER_1 English(EN) · Sunnie S. Y. Kim, Margit Bowler, Leon A Gatys ·

    Examining Human-Like Behaviors in LLMs: A Multi-Dimensional Analysis of Model Behaviors, User Factors, and System Prompts

    arXiv:2606.18258v1 Announce Type: cross Abstract: Large language models (LLMs) exhibit a wide range of human-like behaviors, from expressing thoughts and emotions, to engaging in relationship-building with users, to refusing requests and maintaining boundaries. Despite their prev…

  19. arXiv cs.LG TIER_1 English(EN) · Zilong Zhang, Yi-Ting Hung, Lei Ding, Chi-Kuang Yeh ·

    Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning

    arXiv:2606.19057v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly used as judges for scalable evaluation, yet such LLM--as--a--Judge systems exhibit systematic biases that are decoupled from semantic quality, most notably verbosity bias. Meanwhile, h…

  20. arXiv cs.AI TIER_1 English(EN) · Xi Fang, Weijie Xu, Yuchong Zhang, Stephanie Eckman, Scott Nickleach, Chandan K. Reddy ·

    The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs

    arXiv:2510.09905v2 Announce Type: replace Abstract: When an AI assistant remembers that Sarah is a single mother working two jobs, does it interpret her stress differently than if she were a wealthy executive? As personalized AI systems increasingly incorporate long-term user mem…

  21. arXiv cs.AI TIER_1 English(EN) · Mika M\"antyl\"a, Patricia Matsubara, Katia Romero Felizardo, Miikka Kuutila, Marco Gerosa, Savio de Sousa Sampaio, Tayana Conte, Igor Steinmacher ·

    Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations

    arXiv:2606.17588v1 Announce Type: cross Abstract: Several studies have examined the use of large language models (LLMs) for title-abstract screening in systematic reviews (SRs), reporting mixed accuracy. However, questions of reliability remain largely unaddressed. In this study,…

  22. arXiv cs.CL TIER_1 English(EN) · Hyungwon Kim, Kandarp Joshi, Lillian Zhou, Pavel Golik, Petar Aleksic ·

    Are you speaking my languages? On spoken language adherence in multimodal LLMs

    arXiv:2606.17281v1 Announce Type: new Abstract: While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To pr…

  23. arXiv cs.CL TIER_1 English(EN) · Ali Marashian, Alexis Palmer, Katharina von der Wense ·

    Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

    arXiv:2606.17234v1 Announce Type: new Abstract: The rapid rise in popularity of large language models (LLMs) for translation calls for a thorough study of the reliability of their confidence in their own outputs. Unlike many generation tasks, translation errors and confidence lev…

  24. arXiv cs.LG TIER_1 English(EN) · SongEun Kim, Seungyoo Lee, Edwin Fong, Hyungi Lee, Juho Lee ·

    From Drift to Coherence: Stabilizing Beliefs in LLMs

    arXiv:2606.17832v1 Announce Type: new Abstract: Large language models (LLMs) are often hypothesized to perform implicit Bayesian inference, yet a key coherence condition, the martingale property of predictive beliefs, has been shown to fail in controlled synthetic in-context lear…

  25. arXiv cs.CL TIER_1 English(EN) · Omar Sharif, Eftekhar Hossain, Nikhil Singh, Patrick Ng ·

    Disentangling Perception and Reasoning in Multimodal LLMs via Reward Design

    arXiv:2601.00215v2 Announce Type: replace-cross Abstract: Reinforcement learning with verifiable rewards has driven major gains in LLM reasoning, and it is intuitive to assume this recipe will transfer well to multimodal models. However, multimodal models do two things: first, pe…

  26. arXiv cs.CL TIER_1 English(EN) · Rui Wen, Lu Sun, Jiayang Liu, Zesheng Xu, Tianshuo Cong, Zheng Li ·

    The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

    arXiv:2606.17609v1 Announce Type: new Abstract: Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the sam…

  27. arXiv cs.CL TIER_1 English(EN) · Ramaravind Kommiya Mothilal, Terry Jingchen Zhang, Raiyan Ahmed, Zhijing Jin, Shion Guha, Syed Ishtiaque Ahmed ·

    Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

    arXiv:2606.17506v1 Announce Type: new Abstract: Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biase…

  28. arXiv cs.CL TIER_1 English(EN) · Yulong Chen ·

    The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

    Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper highlights the need for principled approaches to mor…

  29. arXiv cs.LG TIER_1 English(EN) · Juho Lee ·

    From Drift to Coherence: Stabilizing Beliefs in LLMs

    Large language models (LLMs) are often hypothesized to perform implicit Bayesian inference, yet a key coherence condition, the martingale property of predictive beliefs, has been shown to fail in controlled synthetic in-context learning settings. We revisit this question in a mor…

  30. arXiv cs.CL TIER_1 English(EN) · Zheng Li ·

    The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

    Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the same question in open generation. We ask what pruni…

  31. arXiv cs.CL TIER_1 English(EN) · Syed Ishtiaque Ahmed ·

    Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

    Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biased content, which current methods do not systemat…

  32. arXiv cs.CL TIER_1 English(EN) · Pan Wang ·

    REFLEX: Reflective Evolution from LLM Experience

    arXiv:2606.16496v1 Announce Type: new Abstract: Large multimodal language models (LLMs) have emerged as powerful tools for guiding evolutionary search toward interpretable programmatic policies. However, existing frameworks rely on a monolithic model call to simultaneously interp…

  33. arXiv cs.LG TIER_1 English(EN) · Violet Xiang, Amrith Setlur, Chase Blagden, Nick Haber, Aviral Kumar ·

    ExpRL: Exploratory RL for LLM Mid-Training

    arXiv:2606.17024v1 Announce Type: new Abstract: Sparse reward reinforcement learning (RL) has become a standard tool for improving LLM reasoning, but its success depends critically on the coverage present in the base model. In practice, models are often primed for RL through \emp…

  34. arXiv cs.CL TIER_1 English(EN) · Katharina Trinley, Jesujoba O. Alabi, Dietrich Klakow, Vagrant Gautam ·

    A Mechanistic Understanding of Pronoun Fidelity in LLMs

    arXiv:2606.16407v1 Announce Type: new Abstract: Faithful and robust pronoun use is important for fair and coherent generations, yet large language models largely fail when multiple referents use different pronouns. To study the interplay of reasoning, repetition, and bias in this…

  35. arXiv cs.CL TIER_1 English(EN) · Xuran Li, Guanqin Zhang, Imran Razzak, Hakim Hacid, Eleanna Kafeza, Hao Xue, Flora D. Salim ·

    Evaluating LLM Personalization via Semantic Constraint Verification

    arXiv:2606.16368v1 Announce Type: new Abstract: Current evaluation paradigms for Large Language Model (LLM) personalization rely heavily on brittle surface-matching metrics or computationally expensive LLM-as-a-judge protocols, both of which lack interpretability. To address thes…

  36. arXiv cs.CL TIER_1 English(EN) · Nafiseh Nikeghbal, Amir Hossein Kargaran, Shaghayegh Kolli, Jana Diesner ·

    Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

    arXiv:2606.16011v1 Announce Type: new Abstract: Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plaus…

  37. arXiv cs.AI TIER_1 English(EN) · Erica Zhang, Fangzhao Zhang, Aneesh Pappu, Batu El, Jose Blanchet, Susan Athey, Jiashuo Liu, James Zou ·

    TERMS-Bench: Diagnosing LLM Negotiation Agents Beyond Deal Rate

    arXiv:2605.13909v2 Announce Type: replace-cross Abstract: Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction…

  38. arXiv cs.AI TIER_1 English(EN) · Zhen Yang, Mingyang Zhang, Feng Chen, Ganggui Ding, Liang Hou, Xin Tao, Ying-Cong Chen ·

    Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

    arXiv:2510.13940v4 Announce Type: replace-cross Abstract: Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a si…

  39. arXiv cs.AI TIER_1 English(EN) · Uljad Berdica, Fernando Acero, Anton Ipsen, Parisa Zehtabi, Michael Cashmore, Manuela Veloso ·

    When Do We Need LLMs? A Diagnostic for Language-Driven Bandits

    arXiv:2604.05859v2 Announce Type: replace Abstract: We study Contextual Multi-Armed Bandits (CMABs) for non-episodic decision-making problems where the context includes both textual and numerical information (e.g., recommendation systems, dynamic portfolio adjustments, offer sele…

  40. arXiv cs.AI TIER_1 English(EN) · Louie Hong Yao, Nicholas Jarvis, Tiffany Zhan, Saptarshi Ghosh, Linfeng Liu, Tianyu Jiang ·

    JE-IRT: A Geometric Lens on LLM Abilities through Joint Embedding Item Response Theory

    arXiv:2509.22888v2 Announce Type: replace Abstract: Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their inherently multidimensional nature. We present JE-IRT, a geometric item-response framework that embeds both LLMs and questions in a…

  41. arXiv cs.AI TIER_1 English(EN) · Ismail Hossain, Sai Puppala, Md Jahangir Alam, Tanzim Ahad, Sajedul Talukder ·

    SkillVetBench: LLM-as-Judge for Multi-Dimensional Security Risk Evaluation in Open-Source LLM Agent Skills

    arXiv:2606.15899v1 Announce Type: cross Abstract: Open-source LLM agent ecosystems are growing rapidly, yet the security of community-contributed skills - modular tool definitions that extend agent capabilities - remains largely unvetted. The gap we fill: existing scanners operat…

  42. arXiv cs.AI TIER_1 English(EN) · Hiroyasu Usami, Keisuke Hara, Ayato Tsuboi, Naohiko Matsuda ·

    LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

    arXiv:2606.15610v1 Announce Type: cross Abstract: LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or agr…

  43. arXiv cs.AI TIER_1 English(EN) · Aina Vila Pons, Ioannis Tzachristas, Constantinos Antoniou ·

    LLMs on Tabular Data with Limited Semantics: Evidence from Industrial Car Retrofit Prediction

    arXiv:2606.15314v1 Announce Type: cross Abstract: Industrial retrofit planning depends on structured operational data rather than free text: planners must estimate whether a newly registered prototype will require a retrofit, which retrofit package it will need, and how long the …

  44. arXiv cs.AI TIER_1 English(EN) · Olivia Peiyu Wang, Sanna Wong-Toropainen, Daneshvar Amrollahi, Ryan Bai, Tashvi Bansal, Arush Garg, Leilani H. Gilpin ·

    Know Your Limits : On the Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning

    arXiv:2606.16118v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong performance on reasoning tasks, but whether this reflects faithful logical inference or heuristic approximation remains unclear. We study this question in legal entailment by comparing thr…

  45. arXiv cs.AI TIER_1 English(EN) · Yitao Li ·

    Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines

    arXiv:2606.15474v1 Announce Type: new Abstract: Continuous evaluation of LLM products relies on a strong LLM judge treated as ground truth: a cheap monitor scores every interaction and a team is paged when the score drifts down. But the judge is itself a model behind an API, and …

  46. arXiv cs.AI TIER_1 English(EN) · Alyssa Unell, Natalie Dullerud, Naomi Boneh, Meena Jagadeesan, Tatsu Hashimoto, Nigam Shah, Sanmi Koyejo ·

    Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

    arXiv:2606.15029v1 Announce Type: new Abstract: LLM judges are used to reduce the need for costly human labor in evaluating open-ended text generation. However, the reliability of these judges depends critically on their alignment with human raters -- a property that itself depen…

  47. arXiv cs.AI TIER_1 English(EN) · Louis Mahon, Elliot Ford, Callum Hackett ·

    A Definition of Good Explanations and the Challenges Explaining LLM Outputs

    arXiv:2606.14838v1 Announce Type: new Abstract: How to define a good explanation is a long-standing philosophical debate which has found recent renewed interest in the context of AI outputs. Explainability is crucial for AI adoption in many contexts, but in order to produce good …

  48. arXiv cs.CL TIER_1 English(EN) · Petar Aleksic ·

    Are you speaking my languages? On spoken language adherence in multimodal LLMs

    While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabiliti…

  49. arXiv cs.CL TIER_1 English(EN) · Katharina von der Wense ·

    Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

    The rapid rise in popularity of large language models (LLMs) for translation calls for a thorough study of the reliability of their confidence in their own outputs. Unlike many generation tasks, translation errors and confidence levels can be useful at different levels of granula…

  50. arXiv cs.LG TIER_1 English(EN) · Aviral Kumar ·

    ExpRL: Exploratory RL for LLM Mid-Training

    Sparse reward reinforcement learning (RL) has become a standard tool for improving LLM reasoning, but its success depends critically on the coverage present in the base model. In practice, models are often primed for RL through \emph{mid-training} on curated reasoning traces that…

  51. arXiv cs.CL TIER_1 English(EN) · Pan Wang ·

    REFLEX: Reflective Evolution from LLM Experience

    Large multimodal language models (LLMs) have emerged as powerful tools for guiding evolutionary search toward interpretable programmatic policies. However, existing frameworks rely on a monolithic model call to simultaneously interpret visual behavioral evidence and synthesize co…

  52. arXiv cs.CL TIER_1 English(EN) · Vagrant Gautam ·

    A Mechanistic Understanding of Pronoun Fidelity in LLMs

    Faithful and robust pronoun use is important for fair and coherent generations, yet large language models largely fail when multiple referents use different pronouns. To study the interplay of reasoning, repetition, and bias in this task, prior work relies exclusively on behaviou…

  53. arXiv cs.CL TIER_1 English(EN) · Flora D. Salim ·

    Evaluating LLM Personalization via Semantic Constraint Verification

    Current evaluation paradigms for Large Language Model (LLM) personalization rely heavily on brittle surface-matching metrics or computationally expensive LLM-as-a-judge protocols, both of which lack interpretability. To address these limitations, we introduce Natural Language Inf…

  54. arXiv cs.AI TIER_1 English(EN) · Yash Pulse, Yong-Bin Kang, Abhik Banerjee, Abdur Forkan, Prem Prakash Jayaraman ·

    FactoryLLM: A Safe and Open-Source AI Playground for Evaluating LLMs in Smart Factories

    arXiv:2606.14119v1 Announce Type: new Abstract: Fault diagnostics and recovery in smart factories is challenging because critical information is dispersed across manuals of multiple machines which are interconnected through the manufacturing process. Large Language Models (LLMs) …

  55. arXiv cs.CL TIER_1 English(EN) · Li Zhang, Yuzhen Shi, Yiran Hu, Jingwen Zhang, Wenbo Lv, Yubo Ma, Wei Wang, Rongyao Shi, Yuanyang Qiu, Xinran Xu, Yuemeng Qi, Linlin Miao, Jaromir Savelka, Yun Liu, Kevin Ashley, Bing Zhao, Hu Wei, Lin Qu ·

    DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation

    arXiv:2606.13931v1 Announce Type: new Abstract: Lawyer-client consultation is a critical starting point for legal services. Effective legal assistance hinges on eliciting sufficient and truthful information from clients in order to devise strategies that best protect their intere…

  56. arXiv cs.AI TIER_1 English(EN) · Toni J. B. Liu, Baran Zadeo\u{g}lu, Nicolas Boull\'e, Rapha\"el Sarfati, Gurbir Arora, Christopher J. Earls ·

    Jacobian Scopes: token-level causal attributions in LLMs

    arXiv:2601.16407v3 Announce Type: replace-cross Abstract: Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given p…

  57. arXiv cs.AI TIER_1 English(EN) · Shaun Feakins, Ibrahim Habli, Kim Littler, Robert Palin ·

    I'm Sorry Driver, I'm Afraid I Can't Do That: Appraising the Safety of LLMs within Automotive Contexts

    arXiv:2606.14327v1 Announce Type: cross Abstract: This paper appraises recent frameworks within AI development to integrate LLMs into control tasks in automotive contexts from the perspective of safety assurance. This work has built upon the rapid integration of LLMs across autom…

  58. arXiv cs.AI TIER_1 English(EN) · Abel Yagubyan ·

    The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

    arXiv:2606.13685v1 Announce Type: cross Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanni…

  59. arXiv cs.LG TIER_1 English(EN) · Angira Sharma, Christian Schroeder de Witt, Philip Torr, Anisoara Calinescu, Jialin Yu ·

    A Low-Rank Subspace Analysis of LLM Interventions

    arXiv:2606.14388v1 Announce Type: new Abstract: Interventions designed to modify a particular behavior in LLMs, such as refusal or sycophancy, often produce unintended changes in other behaviors. This lack of targeted control makes it difficult to design and implement reliable sa…

  60. arXiv cs.CL TIER_1 English(EN) · Yufeng Xu, Taiming Lu, Kunjun Li, Jiachen Zhu, Mingjie Sun, Zhuang Liu ·

    Small LLMs: Pruning vs. Training from Scratch

    arXiv:2606.14150v1 Announce Type: cross Abstract: Pruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5--0.8 with six methods spanning depth, width, and sparse granularities, under two c…

  61. arXiv cs.CL TIER_1 English(EN) · Filip Trhlik, Aoife O'Flynn, Angela Yu, Arduin Findeis, Paula Buttery ·

    LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

    arXiv:2606.13944v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly characterised in recent evaluation work as having stable, model-level preference and value systems. However, accompanying robustness checks are limited to incidental prompt perturbations…

  62. Hugging Face Daily Papers TIER_1 English(EN) ·

    ExpRL: Exploratory RL for LLM Mid-Training

    ExpRL uses human-written question-answer data as reward scaffolds to provide automated reinforcement learning priming for language models, outperforming traditional methods on math reasoning tasks.

  63. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Sajedul Talukder ·

    SkillVetBench: LLM-as-Judge for Multi-Dimensional Security Risk Evaluation in Open-Source LLM Agent Skills

    Open-source LLM agent ecosystems are growing rapidly, yet the security of community-contributed skills - modular tool definitions that extend agent capabilities - remains largely unvetted. The gap we fill: existing scanners operate at the code layer and are structurally blind to …

  64. Hugging Face Daily Papers TIER_1 English(EN) ·

    Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

    Answer stability in large language models is evaluated through controlled challenges that measure response consistency when correct answers face plausible counterarguments, revealing significant variation in model reliability beyond traditional accuracy metrics.

  65. arXiv cs.LG TIER_1 English(EN) · Jialin Yu ·

    A Low-Rank Subspace Analysis of LLM Interventions

    Interventions designed to modify a particular behavior in LLMs, such as refusal or sycophancy, often produce unintended changes in other behaviors. This lack of targeted control makes it difficult to design and implement reliable safety controls. To understand these side-effects,…

  66. arXiv cs.AI TIER_1 English(EN) · Robert Palin ·

    I'm Sorry Driver, I'm Afraid I Can't Do That: Appraising the Safety of LLMs within Automotive Contexts

    This paper appraises recent frameworks within AI development to integrate LLMs into control tasks in automotive contexts from the perspective of safety assurance. This work has built upon the rapid integration of LLMs across automotive settings. However, we find that at present, …

  67. arXiv cs.CL TIER_1 English(EN) · Zhuang Liu ·

    Small LLMs: Pruning vs. Training from Scratch

    Pruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5--0.8 with six methods spanning depth, width, and sparse granularities, under two controlled token-matched settings. (1) With the sam…

  68. arXiv cs.AI TIER_1 English(EN) · Prem Prakash Jayaraman ·

    FactoryLLM: A Safe and Open-Source AI Playground for Evaluating LLMs in Smart Factories

    Fault diagnostics and recovery in smart factories is challenging because critical information is dispersed across manuals of multiple machines which are interconnected through the manufacturing process. Large Language Models (LLMs) can provide a promising approach. In this paper,…

  69. arXiv cs.CL TIER_1 English(EN) · Sangho Kim, Heejin Kim, Yoonhee Park, Hyunggeun Jeon, Jaejin Lee ·

    Polar: A Benchmark for Evaluating Political Bias in LLMs

    arXiv:2606.12922v1 Announce Type: new Abstract: Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that measures…

  70. arXiv cs.AI TIER_1 English(EN) · Ashutosh Hathidara, Sai Shruthi Sistla, Sebastian Schreiber, Sahil Bansal ·

    ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

    arXiv:2606.12451v1 Announce Type: new Abstract: Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, paramet…

  71. arXiv cs.AI TIER_1 English(EN) · Alyssa Unell, Miguel Fuentes, Brenna Li, Bridget Lin, Meena Jagadeesan, Sanmi Koyejo, Nigam Shah ·

    Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

    arXiv:2606.12702v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly integrated into clinical systems, making it essential to evaluate the real-world utility of these systems. However, static benchmarks tend to measure correctness rather than user accepta…

  72. arXiv cs.AI TIER_1 English(EN) · Danica Dillion, Chen Cecilia Liu, Baihui Wang, Daniele Barolo, Tanmay Rajore, Niket Tandon, Pranathi Ravikumar, Kurt Gray ·

    LLMs Can Better Capture Human Judgments--With the Right Prompts

    arXiv:2606.12754v1 Announce Type: cross Abstract: Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distributions of responses, and that their judgments are unstable across wording variations. We d…

  73. arXiv cs.AI TIER_1 English(EN) · Benno Krojer, Shravan Nayak, Oscar Ma\~nas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, Marius Mosbach ·

    LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

    arXiv:2602.00462v4 Announce Type: replace-cross Abstract: Transforming a large language model (LLM) into a vision-language model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simpl…

  74. arXiv cs.CL TIER_1 English(EN) · Aviya Maimon, Amir DN Cohen, Gal Vishne, Shauli Ravfogel, Reut Tsarfaty ·

    From Benchmarks to Skills: Low-Rank Factors for LLM Evaluation

    arXiv:2507.20208v2 Announce Type: replace Abstract: Current evaluations of large language models (LLMs) rely heavily on a growing collection of benchmarks and on aggregate benchmark scores, yet it remains unclear what this comparison actually captures, and what these scores revea…

  75. arXiv cs.CL TIER_1 English(EN) · Laura Majer, Jan \v{S}najder, Martin Tutek ·

    Evaluating Pluralism in LLMs through Latent Perspectives

    arXiv:2606.13254v1 Announce Type: new Abstract: The growing need to represent diverse perspectives has increased interest in pluralistic LLM generation. Although difficult to operationalize, identifying perspectives expressed in text would provide clear guidance on pluralistic al…

  76. arXiv cs.CL TIER_1 English(EN) · Camilla Dalerci, Thilo Michael, Robin Schaefer, Daniel Weinland ·

    M\"OVE: A Holistic LLM Benchmark for the German Public Sector

    arXiv:2606.13111v1 Announce Type: new Abstract: We present M\"OVE (Modelle f\"ur die \"Offentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public ad…

  77. arXiv cs.CL TIER_1 English(EN) · Paula Buttery ·

    LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

    Large language models (LLMs) are increasingly characterised in recent evaluation work as having stable, model-level preference and value systems. However, accompanying robustness checks are limited to incidental prompt perturbations such as syntax variation and option reordering.…

  78. arXiv cs.CL TIER_1 English(EN) · Lin Qu ·

    DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation

    Lawyer-client consultation is a critical starting point for legal services. Effective legal assistance hinges on eliciting sufficient and truthful information from clients in order to devise strategies that best protect their interests. This task requires Large Language Models (L…

  79. arXiv cs.CL TIER_1 English(EN) · Martin Tutek ·

    Evaluating Pluralism in LLMs through Latent Perspectives

    The growing need to represent diverse perspectives has increased interest in pluralistic LLM generation. Although difficult to operationalize, identifying perspectives expressed in text would provide clear guidance on pluralistic alignment and more clearly articulate the pluralis…

  80. arXiv cs.LG TIER_1 English(EN) · David Salinas ·

    From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

    Evaluating new large language models typically requires costly human annotation campaigns at scale. LLM-as-a-judge offers a cheaper alternative, but judge scores carry systematic errors - such as position bias, self-preference, or intransitivity - that can strongly miscalibrate t…

  81. arXiv cs.CL TIER_1 English(EN) · Daniel Weinland ·

    MÖVE: A Holistic LLM Benchmark for the German Public Sector

    We present MÖVE (Modelle für die Öffentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, …

  82. arXiv cs.CL TIER_1 English(EN) · Jaejin Lee ·

    Polar: A Benchmark for Evaluating Political Bias in LLMs

    Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that measures political bias through option-level likelihoods…

  83. arXiv cs.CL TIER_1 English(EN) · Kiril Georgiev, Yuxia Wang, Dimitar Iliyanov Dimitrov, Preslav Nakov, Ivan Koychev ·

    Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts

    arXiv:2606.11316v1 Announce Type: new Abstract: Large language models are increasingly deployed across professional domains, bringing hard-to-predict risks, including the generation of harmful or disrespectful content. Although substantial progress has been made in developing saf…

  84. arXiv cs.AI TIER_1 English(EN) · Kaituo Zhang, Mingzhi Hu, Hoang Anh Duy Le, Fariha Kabir Torsha, Zhimeng Jiang, Minh Khai Bui, Chia-Yuan Chang, Yu-Neng Chuang, Zhen Xiong, Ying Lin, Guanchu Wang, Na Zou ·

    A Survey on Evaluating Quality and Trustworthiness in LLM-Generated Data

    arXiv:2601.17717v3 Announce Type: replace Abstract: Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities. By transforming data from a scarce resource into a controllable asset, LLMs mitigate the bottlenecks imposed by the acqui…

  85. arXiv cs.CL TIER_1 Deutsch(DE) · Sanjay Adhikesaven, Haoxiang Sun, Sewon Min ·

    Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

    arXiv:2606.12385v1 Announce Type: new Abstract: Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own…

  86. arXiv cs.CL TIER_1 English(EN) · Hongjian Zhou, Xinyu Zou, Jinge Wu, Sean Wu, Junchi Yu, Bradley Max Segal, Tobias Erich Niebuhr, Sara Amro, Michael Petrus, Sheikh Momin, Alexandra M. Cardoso Pinto, Rachel Niesen, Laura Sophie Wegner, Dhruv Darji, Jung Moses Koo, Joshua Fieggen, Kapil N… ·

    Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

    arXiv:2606.12291v1 Announce Type: new Abstract: Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assu…

  87. arXiv cs.AI TIER_1 English(EN) · Orion Reblitz-Richardson ·

    When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis

    arXiv:2606.11375v1 Announce Type: cross Abstract: Standard linear probing declares a property "encoded" when a classifier on hidden states achieves high accuracy. The protocol works well on a snapshot but breaks across pre-training: probe accuracy saturates within the first few t…

  88. arXiv cs.LG TIER_1 English(EN) · Shuo Yang, Qihui Zhang, Yuyang Liu, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan ·

    AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

    arXiv:2506.08473v4 Announce Type: replace Abstract: Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. We observe that perturbations orthogonal to the ali…

  89. arXiv cs.CL TIER_1 English(EN) · Muhammed Saeed, Simon Razniewski ·

    LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale

    arXiv:2603.24080v2 Announce Type: replace Abstract: Benchmarks like MMLU suggest flagship language models approach factuality saturation above 90\%. \emph{LLMpedia} shows this picture is incomplete. We materialize ${\sim}$1.3M encyclopedia articles entirely from parametric memory…

  90. arXiv cs.CL TIER_1 English(EN) · Dongryeol Lee, Yerin Hwang, Taegwan Kang, Minwoo Lee, Younhyung Chae, Kyomin Jung ·

    Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation

    arXiv:2601.07506v2 Announce Type: replace Abstract: While large language models (LLMs) are increasingly used as automatic judges for question answering (QA) and other reference-conditioned evaluation tasks, little is known about their ability to adhere to a provided reference. We…

  91. arXiv cs.CL TIER_1 Deutsch(DE) · Sewon Min ·

    Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

    Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate re…

  92. arXiv cs.CL TIER_1 English(EN) · David A. Clifton ·

    Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

    Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is in…

  93. arXiv cs.AI TIER_1 English(EN) · Sophie Hao, William Merrill ·

    A Theory of Training Profit-Optimal LLMs

    arXiv:2605.16430v2 Announce Type: replace-cross Abstract: Scaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure. While it is established that scaling up LLMs reliably increases model …

  94. arXiv cs.AI TIER_1 English(EN) · Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Jingxiang Chen, Mohammad Kachuee, Teja Gollapudi, Yiwei Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong ·

    TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

    arXiv:2509.25760v2 Announce Type: replace-cross Abstract: While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside thei…

  95. arXiv cs.AI TIER_1 English(EN) · Gabriel Freedman, Francesca Toni ·

    Superficial Beliefs in LLM Decision-Making

    arXiv:2606.11016v1 Announce Type: new Abstract: We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a systematic underlying decision structure. Using synthetic binary decision settings in which …

  96. arXiv cs.LG TIER_1 English(EN) · Ke Li, Chongzhe Zhang, Zifan Zeng, Feng Liu, Qunli Zhang, Zheng Hu ·

    Calibrating Overconfidence Without Sacrificing Confidence: Probe-Conditioned Head Intervention for LLMs

    arXiv:2606.09876v1 Announce Type: new Abstract: Large language models often express high confidence in answers that are wrong. Standard calibration remedies typically act globally or at the score level, reducing unwarranted confidence but also risking erosion of warranted confide…

  97. arXiv cs.CL TIER_1 English(EN) · Qian Zhu, Xinnan Guo, Jingjing Huo, Jun Li, Pan Liu, Wenyan Yang, Wanqing Xu, Xuan Lin ·

    An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs

    arXiv:2603.14463v2 Announce Type: replace Abstract: Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucina…

  98. arXiv cs.CL TIER_1 English(EN) · Naihao Deng, Sheng Zhang, Henghui Zhu, Shuaichen Chang, Jiani Zhang, Alexander Hanbo Li, Chung-Wei Hang, Hideo Kobayashi, Yiqun Hu, Patrick Ng ·

    What Really Matters for Table LLMs? A Meta-Evaluation of Model and Data Effects

    arXiv:2501.14717v2 Announce Type: replace Abstract: Table modeling has progressed for decades. In this work, we revisit this trajectory and highlight emerging challenges in the LLM era, particularly the paradox of choice: the difficulty of attributing performance gains amid diver…

  99. arXiv cs.CL TIER_1 English(EN) · Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, Jun Zhao ·

    Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation

    arXiv:2606.10875v1 Announce Type: new Abstract: Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic st…

  100. Hugging Face Daily Papers TIER_1 Deutsch(DE) ·

    Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

    ModSleuth is an agentic system that recursively reconstructs large-scale dependency graphs for LLM development by analyzing public artifacts and resolving inconsistencies in documentation and artifact identities.

  101. arXiv cs.AI TIER_1 English(EN) · Francesca Toni ·

    Superficial Beliefs in LLM Decision-Making

    We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a systematic underlying decision structure. Using synthetic binary decision settings in which models choose between profiles defined by graded…

  102. arXiv cs.CL TIER_1 English(EN) · Jun Zhao ·

    Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation

    Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic study on how knowledge influences tool-use perform…

  103. arXiv cs.AI TIER_1 English(EN) · Yanyan Luo, Xue Han, Ruiqiao Bai, Xin Huang, Yitong Wang, Qian Hu, Qing Wang, Chunxu Zhao, Jie Liu, Cong Geng, Lehao Xing, Pengwei Hu, Junlan Feng ·

    Personalization Meets Safety:Mechanisms,Risks,and Mitigations in Personalized LLMs

    arXiv:2606.09038v1 Announce Type: new Abstract: Large Language Models (LLMs) have enabled increasingly personalized interactions by adapting to users' preferences, contexts, and long-term histories. However, the mechanisms that enable personalization also expand the safety landsc…

  104. arXiv cs.AI TIER_1 English(EN) · Mohsinul Kabir, Tasnim Ahmed, Md Mezbaur Rahman, Shaoxiong Ji, Hassan Alhuzali, Yuechen Jiang, Jimin Huang, Sophia Ananiadou ·

    XCR-Bench: Benchmarking Cross-Cultural Reasoning in LLMs via Culture-Specific Items and Hall's Triad

    arXiv:2601.14063v2 Announce Type: replace-cross Abstract: Cross-cultural competence in large language models (LLMs) requires understanding and adapting Culture-Specific Items (CSIs) across varying cultural contexts. However, progress in evaluating this capability remains limited …

  105. arXiv cs.AI TIER_1 English(EN) · Yasushi Sakai, Allen Song, Kent Larson ·

    When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

    arXiv:2606.08098v1 Announce Type: new Abstract: Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. We show that piping the signals every sample carries into a delegation-based aggregator (Propagational Proxy Voting, PPV) y…

  106. arXiv cs.AI TIER_1 English(EN) · Anissa Alloula, Federico Licini, Ava Batchkala, Seraphina Goldfarb-Tarrant ·

    Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

    arXiv:2606.07874v1 Announce Type: new Abstract: LLMs-as-judges are the only way to evaluate safety at scale. Despite their importance, LLM-judges themselves are rarely evaluated beyond human agreement in simple, static benchmarks. We therefore investigate two under-explored but c…

  107. arXiv cs.AI TIER_1 English(EN) · Gaojie Jin, Yong Tao, Lijia Yu, Tianjin Huang ·

    Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

    arXiv:2605.15416v2 Announce Type: replace-cross Abstract: Jung et al. (2025) introduce a hypothesis testing framework for guaranteeing agreement between large language models (LLMs) and human judgments, relying on the assumption that the model's estimated confidence is monotonic …

  108. arXiv cs.AI TIER_1 English(EN) · Jianhui Chen, Yuzhang Luo, Liangming Pan ·

    Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

    arXiv:2601.21996v2 Announce Type: replace-cross Abstract: While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Inf…

  109. arXiv cs.LG TIER_1 English(EN) · Dennis Frauen, Athiya Deviyani, Mihaela van der Schaar, Stefan Feuerriegel ·

    Nonparametric LLM Evaluation from Preference Data

    arXiv:2601.21816v2 Announce Type: replace Abstract: Evaluating the performance of large language models (LLMs) from human preference data is crucial for obtaining LLM leaderboards. However, many existing approaches either rely on restrictive parametric assumptions or lack valid u…

  110. arXiv cs.AI TIER_1 English(EN) · Haoran Xu ·

    Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence

    arXiv:2606.07834v1 Announce Type: cross Abstract: LLM judges increasingly turn verdicts into system commitments. Under mixed evidence (claims with both supporting and refuting sources) this is unsafe: when the schema exposes CONFLICTING as the authorized non-directional verdict, …

  111. arXiv cs.CL TIER_1 English(EN) · Yerzhan Sapenov, Jaromir Savelka ·

    mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

    arXiv:2606.07069v1 Announce Type: new Abstract: We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require reas…

  112. arXiv cs.CL TIER_1 English(EN) · Parisa Rabbani, Priyam Sahoo, Ruben Mathew, Aishee Mondal, Harshita Ketharaman, Nimet Beyza Bozdag, Dilek Hakkani-T\"ur ·

    DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference

    arXiv:2601.10896v2 Announce Type: replace Abstract: LLMs are increasingly used as third-party judges, yet their reliability when evaluating speakers in dialogue remains poorly understood. We show that LLMs judge identical claims differently depending on framing: the same content …

  113. arXiv cs.CL TIER_1 English(EN) · Shaiv Patel, Kartik Narayan, Vishal Patel ·

    PromptPrint: Behavioral Biometrics Through Natural Language Prompting in LLMs

    arXiv:2606.06755v1 Announce Type: new Abstract: Authorship attribution research has traditionally focused on long-form, expressive texts; however, interactions with large language models (LLMs) are typically brief and task-driven prompts. This raises a fundamental question: do su…

  114. arXiv cs.AI TIER_1 English(EN) · Guoxi Zhang, Jiawei Chen, Tianzhuo Yang, Lang Qin, Juntao Dai, Yaodong Yang, Jingwei Yi ·

    Stable Reasoning, Unstable Responses: Mitigating LLM Deception via Stability Asymmetry

    arXiv:2603.26846v2 Announce Type: replace-cross Abstract: As Large Language Models (LLMs) expand in capability and application scope, their trustworthiness becomes critical. A vital risk is intrinsic deception, wherein models strategically mislead users to achieve their own objec…

  115. arXiv cs.AI TIER_1 English(EN) · Gonzalo Mancera, Daniel DeAlcala, Aythami Morales, Julian Fierrez, Ruben Tolosana, Francisco Jurado ·

    Auditing Training Data in Domain-adapted LLMs: LoRA-MINT

    arXiv:2606.06946v1 Announce Type: cross Abstract: We present LoRA-MINT, a new methodology for Membership Inference Test (MINT) applied to recent Large Language Models (LLMs) fine-tuned for specific Natural Language Processing (NLP) tasks through Low-Rank Adaptation (LoRA). The pr…

  116. arXiv cs.CL TIER_1 English(EN) · Wei Lu ·

    When Languages Disagree: Self-Evolving Multilingual LLM Judges

    Multilingual LLM-as-a-judge is widely used to evaluate model outputs across languages, but suffers from cross-lingual inconsistency (Fu and Liu, 2025). Existing methods typically treat this inconsistency as noise and mitigate it through voting or aggregation. In this work, we ins…

  117. arXiv cs.AI TIER_1 English(EN) · Cristina Carleo, Pietro Liguori, Naghmeh Ivaki, Domenico Cotroneo ·

    Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration

    arXiv:2606.05396v1 Announce Type: cross Abstract: Producing a labeled vulnerable code at scale is a recurring obstacle for learning-based vulnerability detection: mined corpora carry substantial label noise, and existing LLM-based augmentation propagates these inaccuracies becaus…

  118. arXiv cs.AI TIER_1 English(EN) · Oleg Somov, Mikhail Chaichuk, Gleb Ershov, Karim Vafin, Mikhail Seleznyov, Alexander Panchenko, Elena Tutubalina ·

    Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures

    arXiv:2603.16475v2 Announce Type: replace Abstract: In schema-guided reasoning (SGR) pipelines, LLMs produce explicit intermediate structures -- rubrics, checklists, or verification queries -- before committing to a final decision. SGR is increasingly adopted because it promises …

  119. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Haoran Xu ·

    Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence

    LLM judges increasingly turn verdicts into system commitments. Under mixed evidence (claims with both supporting and refuting sources) this is unsafe: when the schema exposes CONFLICTING as the authorized non-directional verdict, returning SUPPORTS/REFUTES is an unauthorized dire…

  120. arXiv cs.LG TIER_1 English(EN) · Yehua Wei ·

    Online Pandora's Box for Contextual LLM Cascading

    Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-phase decision problem. In the query phase, the decis…

  121. arXiv cs.CL TIER_1 English(EN) · Jaromir Savelka ·

    mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

    We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require reasoning in order to be answered correctly. Each qu…

  122. arXiv cs.CL TIER_1 English(EN) · Francisco Jurado ·

    Auditing Training Data in Domain-adapted LLMs: LoRA-MINT

    We present LoRA-MINT, a new methodology for Membership Inference Test (MINT) applied to recent Large Language Models (LLMs) fine-tuned for specific Natural Language Processing (NLP) tasks through Low-Rank Adaptation (LoRA). The primary goal is to assess whether individual samples…

  123. arXiv cs.CL TIER_1 English(EN) · Aofan Yu, Chenyu Zhou, Tianyi Xu, Zihan Guo, Rong Shan, Zhihui Fu, Jun Wang, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin ·

    LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

    arXiv:2606.06087v1 Announce Type: new Abstract: Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present Latent…

  124. arXiv cs.CL TIER_1 English(EN) · Amirhossein Ghaffari, Ali Goodarzi, Huong Nguyen, Simo Hosio, Lauri Lov\'en, Ekaterina Gilman ·

    RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit

    arXiv:2606.06027v1 Announce Type: cross Abstract: Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artif…

  125. arXiv cs.CL TIER_1 English(EN) · Kuan-Yen Chen, Fang-Yi Su, Jung-Hsien Chiang ·

    The Self-Correction Illusion: LLMs Correct Others but Not Themselves

    arXiv:2606.05976v1 Announce Type: cross Abstract: Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a cap…

  126. arXiv cs.CL TIER_1 English(EN) · Taewon Yun, Hyeonseong Park, Jeonghwan Choi, Hayoon Park, Yeeun Choi, Hwanjun Song ·

    SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

    arXiv:2606.05563v1 Announce Type: cross Abstract: Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly st…

  127. arXiv cs.CL TIER_1 English(EN) · Srimonti Dutta, Akshata Kishore Moharir ·

    Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

    arXiv:2606.05384v1 Announce Type: cross Abstract: LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We sh…

  128. arXiv cs.CL TIER_1 English(EN) · Zhihao Wu, Linhai Zhang, Taiyi Wang, Runcong Zhao, Peter Andrews, Cesare Aloisi, Yulan He ·

    EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading

    arXiv:2606.06350v1 Announce Type: new Abstract: Reliable rubric grading requires more than accurate score prediction. Each judgement must be grounded in the mark scheme and evidence from the student answer. Existing credit-assignment and intervention methods, primarily designed f…

  129. arXiv cs.CL TIER_1 English(EN) · Gianluca Barmina, Peter Schneider-Kamp, Lukas Galke Poech ·

    LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

    arXiv:2606.06286v1 Announce Type: new Abstract: Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-awar…

  130. arXiv cs.CL TIER_1 English(EN) · Michiro Asai, Ailiang Lin, Yu Kishimoto, Takao Obi, Satoshi Kosugi, Kotaro Funakoshi, Manabu Okumura ·

    Can LLMs Be Constrained to the Past? Improving Knowledge Cutoff through Recall-Based Prompting

    arXiv:2606.05804v1 Announce Type: new Abstract: Prompted knowledge cutoff instructs a large language model (LLM) to act as if information beyond a specified cutoff date were unavailable. However, prior work mainly relies on direct-answer generation, which struggles when post-cuto…

  131. arXiv cs.CL TIER_1 English(EN) · Hong Qian, Yuanhao Liu, Zihan Zhou, Zongbao Zhang, Hanjie Ge, Haotian Shi, Liang Dou, Xiangfeng Wang, Jingwen Yang, Aimin Zhou ·

    CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement

    arXiv:2606.05793v1 Announce Type: new Abstract: While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral exec…

  132. arXiv cs.LG TIER_1 English(EN) · Arslan Bisharat, Brian Ortiz, Eric Spencer, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad ·

    Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

    arXiv:2606.05792v1 Announce Type: cross Abstract: TLA+ has supported industrial verification at companies such as Amazon and Microsoft, yet writing correct TLA+ specifications from natural language still requires time and expertise, which limits adoption. LLMs show promise, but n…

  133. arXiv cs.LG TIER_1 English(EN) · Rohan N. Pradhan, Steve Goley ·

    Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

    arXiv:2606.05403v1 Announce Type: new Abstract: Language models increasingly act as epistemic proxies, synthesizing evidence from multiple sources to inform decisions. Whether they evaluate the quality of that evidence, or merely aggregate it based on surface presentation, remain…

  134. Hugging Face Daily Papers TIER_1 English(EN) ·

    The Cold-Start Safety Gap in LLM Agents

    Tool-calling language model agents exhibit improved safety after initial interactions, with a systematic benchmark demonstrating enhanced security through prior task completion.

  135. arXiv cs.CL TIER_1 English(EN) · Vishal Patel ·

    PromptPrint: Behavioral Biometrics Through Natural Language Prompting in LLMs

    Authorship attribution research has traditionally focused on long-form, expressive texts; however, interactions with large language models (LLMs) are typically brief and task-driven prompts. This raises a fundamental question: do such prompts contain a stable, author-identifiable…

  136. arXiv cs.CL TIER_1 English(EN) · Yulan He ·

    EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading

    Reliable rubric grading requires more than accurate score prediction. Each judgement must be grounded in the mark scheme and evidence from the student answer. Existing credit-assignment and intervention methods, primarily designed for self-contained reasoning tasks such as mathem…

  137. arXiv cs.AI TIER_1 English(EN) · Lukas Galke Poech ·

    LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

    Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that con…

  138. arXiv cs.CL TIER_1 English(EN) · Jianghao Lin ·

    LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

    Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills …

  139. arXiv cs.CL TIER_1 English(EN) · Ekaterina Gilman ·

    RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit

    Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framewor…

  140. arXiv cs.CL TIER_1 English(EN) · Jung-Hsien Chiang ·

    The Self-Correction Illusion: LLMs Correct Others but Not Themselves

    Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a capability deficit or a role-label artifact: does an …

  141. Hugging Face Daily Papers TIER_1 English(EN) ·

    Can LLMs Be Constrained to the Past? Improving Knowledge Cutoff through Recall-Based Prompting

    Prompted knowledge cutoff instructs a large language model (LLM) to act as if information beyond a specified cutoff date were unavailable. However, prior work mainly relies on direct-answer generation, which struggles when post-cutoff knowledge is not explicitly queried but is on…

  142. arXiv cs.CL TIER_1 English(EN) · Zhenyu Yu, Shuigeng Zhou ·

    Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

    arXiv:2606.04915v1 Announce Type: new Abstract: Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation…

  143. arXiv cs.CL TIER_1 English(EN) · XiuYu Zhang, Yi Shan, Junfeng Fang, Zhenkai Liang ·

    Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

    arXiv:2606.05122v1 Announce Type: new Abstract: Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: promp…

  144. arXiv cs.CL TIER_1 English(EN) · Nuoyan Lyu, Bingbing Xu, Xueyun Tian, Weihao Meng, Yige Yuan, Yang Zhang, Zhiyong Huang, Tat-Seng Chua, Huawei Shen ·

    GIFT: Games as Informal Training for Generalizable LLMs

    arXiv:2601.05633v2 Announce Type: replace Abstract: Recent LLMs excel at formal tasks such as mathematical reasoning and code generation, but still struggle with broader abilities such as planning, creativity, and social intelligence. Inspired by human learning, where formal inst…

  145. arXiv cs.LG TIER_1 English(EN) · Rachit Bansal, Clara Mohri, Tian Qin, David Alvarez-Melis, Sham Kakade ·

    RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

    arXiv:2606.04272v1 Announce Type: new Abstract: The standard LLM training pipeline applies reinforcement learning (RL) only after pre-training and supervised fine-tuning (SFT). We question this status quo by training a LLM from scratch and applying RL, SFT, and SFT followed by RL…

  146. arXiv cs.AI TIER_1 English(EN) · Zacharie Bugaud ·

    Unpredictable Safety: Domain-Dependent Compliance and the Transparency Gap in Open-Weight LLMs

    arXiv:2606.04035v1 Announce Type: cross Abstract: We present a systematic study of domain-dependent safety behavior in open-weight LLMs: 7 standardized experiments across 7 ethical domains, testing 5 models (12B--70B) in 4,200 interactions with dual-judge validation. Using a dual…

  147. arXiv cs.AI TIER_1 English(EN) · Liang Shan, Kaicheng Shen, Wen Wu, Zhenyu Ying, Chaochao Lu, Yan Teng, Jingqi Huang, Qingshan Liu, Guangze Ye, Guoqing Wang, Jie Zhou, Liang He ·

    MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

    arXiv:2511.07107v3 Announce Type: replace Abstract: Ensuring the safety of Large Language Models (LLMs) is critical for real-world deployment. However, current safety measures often fail to address implicit, domain-specific risks. To investigate this gap, we introduce a dataset o…

  148. arXiv cs.AI TIER_1 English(EN) · Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, Fei Huang ·

    SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

    arXiv:2505.11166v3 Announce Type: replace-cross Abstract: Despite advances in pretraining with extended context sizes, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context align…

  149. arXiv cs.AI TIER_1 English(EN) · Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, Aritra Dutta ·

    FinTradeBench: A Financial Reasoning Benchmark for LLMs

    arXiv:2603.19225v3 Announce Type: replace-cross Abstract: Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynam…

  150. arXiv cs.CL TIER_1 English(EN) · Ming-Hao Hsu, Xiaohai Tian, Jun Zhang, Zhizheng Wu ·

    Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

    arXiv:2606.04474v1 Announce Type: new Abstract: Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T) matche…

  151. Hugging Face Daily Papers TIER_1 English(EN) ·

    LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

    PropMe framework evaluates language model memorization by distinguishing between forced reproduction capabilities and natural propensity, using SimpleTrace for deterministic attribution and propensity-transformed metrics across open models and datasets.

  152. Hugging Face Daily Papers TIER_1 English(EN) ·

    LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

    LatentSkill enables efficient deployment of textual skills in agent systems by converting them into LoRA adapters stored in weight space, reducing context overhead while maintaining modularity and composability.

  153. Hugging Face Daily Papers TIER_1 English(EN) ·

    ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

    Parametric tool retrieval models show reduced performance and understanding when evaluated with realistic ambiguous queries compared to standard benchmarks, revealing a dissociation between knowledge retrieval and true tool comprehension.

  154. Hugging Face Daily Papers TIER_1 English(EN) ·

    SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

    SoCRATES presents a realistic multi-domain benchmark for evaluating proactive LLM mediators across various socio-cognitive adaptation axes, demonstrating that even top-performing models only resolve about one-third of the consensus gap in conflict resolution.

  155. Hugging Face Daily Papers TIER_1 English(EN) ·

    Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

    Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an e…

  156. arXiv cs.CL TIER_1 English(EN) · Zhenkai Liang ·

    Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

    Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an e…

  157. arXiv cs.CL TIER_1 English(EN) · Shuigeng Zhou ·

    Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

    Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with plac…

  158. arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Mohammad Aliannejadi ·

    Improving the Efficiency and Effectiveness of LLM Knowledge Distillation for Conversational Search

    Conversational Search (CS) considers retrieval of relevant documents based on conversational context. Large Language Models (LLMs) have significantly enhanced CS by enabling effective query rewriting. However, employing LLMs during inference poses efficiency challenges. A method …

  159. arXiv cs.CL TIER_1 English(EN) · Zhizheng Wu ·

    Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

    Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, synt…

  160. arXiv cs.AI TIER_1 English(EN) · Akshatha Srikantha, Manpreet Singh, Yash Jajoo, Shyamal Lakhanpal ·

    TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

    arXiv:2606.03036v1 Announce Type: new Abstract: LLMs have evolved from basic chatbots to the backbone of the AI ecosystem, now widely used in healthcare, schools, and government services. The domain-wide adoption of LLMs necessitates continuous evaluation to ensure their safety a…

  161. arXiv cs.CL TIER_1 English(EN) · Chaoyi Xiang, Olga Ohrimenko, Benjamin I. P. Rubinstein, Lea Frermann ·

    Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility

    arXiv:2606.03291v1 Announce Type: new Abstract: Large language models (LLMs) can memorize sensitive facts, motivating unlearning methods that remove targeted knowledge without costly retraining. However, unlearning research remains heavily English-centric. We study multilingual u…

  162. arXiv cs.CL TIER_1 English(EN) · Sourabrata Mukherjee, Hamna Hamna, Kalika Bali, Sunayana Sitaram ·

    The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

    arXiv:2606.03043v1 Announce Type: new Abstract: LMs-as-judges are now standard, yet judges agree strongly with one another while agreeing only weakly with humans. We test whether this reflects shared signal or shared bias by measuring four geometric quantities on the standard LLM…

  163. arXiv cs.AI TIER_1 English(EN) · Lukas Fesser, Yasha Ektefaie, Ada Fang, Sham M. Kakade, Marinka Zitnik ·

    Evaluating Relational Reasoning in LLMs with REL

    arXiv:2604.12176v2 Announce Type: replace Abstract: Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables. This ability is central to scientific reasoning, but existing evaluations of relational reasoning in large lan…

  164. arXiv cs.AI TIER_1 English(EN) · Yang Xu, Zihuai Xu, Hongli Xu, Yunming Liao, Zhiwei Yao, Xitong Fu ·

    ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services

    arXiv:2606.02606v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly deployed as continuously evolving services, where frequent base-model updates may invalidate previously deployed task-specific Low-Rank Adaptation (LoRA) adapters. For service provider…

  165. arXiv cs.AI TIER_1 English(EN) · Xu Wan, Speed Zhu, Jianwei Cai, Guang Chen, XiMing Huang, Wiggin Zhou, Mingyang Sun ·

    The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

    arXiv:2606.03092v1 Announce Type: new Abstract: Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet real-world deployment is constrained by strict computational budgets. In this work, we formulate inference budget allocati…

  166. arXiv cs.CL TIER_1 English(EN) · Yuhan Wang, Shiyu Ni, Zhikai Ding, Zihang Zhan, Yuanzi Li, Keping Bi ·

    Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

    arXiv:2602.07842v2 Announce Type: replace Abstract: Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these metho…

  167. arXiv cs.CL TIER_1 English(EN) · Lisa Bouger, Th\'eo Lasnier, Philippe Looubet Moundi, Yannick Teglia, Djam\'e Seddah ·

    Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

    arXiv:2606.03785v1 Announce Type: new Abstract: Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, le…

  168. arXiv cs.CL TIER_1 English(EN) · Xuan Yang, Hao Xu, Tingfeng Hui, Hongsheng Xin, Kaike Zhang, Chunxiao Liu, Ning Miao ·

    Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions

    arXiv:2606.03318v1 Announce Type: new Abstract: Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions a…

  169. Hugging Face Daily Papers TIER_1 English(EN) ·

    Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

    Self-Evaluation Elicitation (SEE) method improves model calibration for quality assessment through calibration-coupled reinforcement learning and masked distillation, demonstrating transferable quality evaluation beyond specific judge preferences.

  170. Hugging Face Daily Papers TIER_1 English(EN) ·

    Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

    Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage …

  171. arXiv cs.CL TIER_1 English(EN) · Djamé Seddah ·

    Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

    Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage …

  172. arXiv cs.CL TIER_1 English(EN) · Ning Miao ·

    Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions

    Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions and lacks experience-oriented evaluation. These l…

  173. arXiv cs.CL TIER_1 English(EN) · Lea Frermann ·

    Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility

    Large language models (LLMs) can memorize sensitive facts, motivating unlearning methods that remove targeted knowledge without costly retraining. However, unlearning research remains heavily English-centric. We study multilingual unlearning by extending the TOFU benchmark to fiv…

  174. arXiv cs.AI TIER_1 English(EN) · Yufeng Wang ·

    Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents

    arXiv:2606.00476v1 Announce Type: new Abstract: Do LLM agents act on the reasoning they state? This question of process fidelity is central to using LLMs in social simulation, yet it is hard to measure where no reference for correct behavior exists. We study it in acontrolled set…

  175. arXiv cs.LG TIER_1 English(EN) · Weitao Li, Hao Zhou, Xuanyu Lei, Fandong Meng, Yuanhang Liu, Jingyi Ren, Ante Wang, Xiaolong Wang, Yuanchi Zhang, Fuwen Luo, Guangwen Yang, Lin Gan, Weizhi Ma, Yang Liu ·

    Enhancing LLM Metacognition via Cognitive Pairwise Training

    arXiv:2606.00869v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become central to LLM reasoning, but its outcome-level rewards can make models more willing to give confident answers when evidence or reasoning is unreliable. Existing SFT o…

  176. arXiv cs.CL TIER_1 English(EN) · Yoonah Park, Haesung Pyun, Yohan Jo ·

    Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions

    arXiv:2509.23782v4 Announce Type: replace Abstract: While large language models (LLMs) perform strongly on diverse tasks, their trustworthiness is limited by erratic behavior that is unfaithful to their internal knowledge. In particular, LLMs often fail on multiple-choice questio…

  177. arXiv cs.CL TIER_1 English(EN) · Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh, Isabelle Augenstein ·

    Not What, But How: A Communicative Audit of LLM Response Framing

    arXiv:2606.02493v1 Announce Type: new Abstract: Large language models (LLMs) are being increasingly used to answer subjective, information-seeking questions, where users are sensitive to how responses are communicated, not just whether the answers are correct. Existing LLM evalua…

  178. arXiv cs.CL TIER_1 English(EN) · Yangfan Ye, Xiaocheng Feng, Jialong Tang, Xiayu Cao, Zihan Zhang, Xiachong Feng, Baosong Yang, Bing Qin ·

    CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

    arXiv:2606.01879v1 Announce Type: new Abstract: Existing research largely reduces cultural intelligence in LLMs to a knowledge-level problem, overlooking whether models can effectively utilize their acquired knowledge in realistic scenarios. To bridge this gap, we introduce Cultu…

  179. arXiv cs.CL TIER_1 English(EN) · Yubo Gao, Haotian Wu, Hong Chen, Junquan Huang, Yibo Yan, Jungang Li, Zihao Dongfang, Sicheng Tao, Puay Siew Tan, Jie Zhang, Xuming Hu ·

    Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs

    arXiv:2606.01168v1 Announce Type: new Abstract: Chain-of-Thought (CoT) has significantly enhanced LLM reasoning, yet often incurs substantial computational overhead due to "overthinking": generating excessively long rationales without commensurate accuracy gains. Existing efficie…

  180. arXiv cs.CL TIER_1 English(EN) · Andrew Aquilina, Chetna Nihalani, Vasudha Varadarajan, Nathan S. Fishbein, Yu-Ru Lin, Maarten Sap ·

    Lost in Delusion: Examining LLM Safety Under User Delusions and Distress

    arXiv:2606.00975v1 Announce Type: new Abstract: LLM chatbots increasingly serve as a first source of support for people in psychological distress, including those whose distress is entangled with delusional beliefs. Prior work on LLM mental-health safety largely evaluates general…

  181. arXiv cs.CL TIER_1 English(EN) · F. Carichon, S. Sharma, M. Girard, R. Rampa, G. Farnadi ·

    IDEAFix: Evaluation Framework for Creative Defixation Prompting in LLMs

    arXiv:2606.00875v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for tasks involving creative problem solving and idea generation. However, there is a lack of consensus concerning their creative capabilities: some studies report superior performa…

  182. arXiv cs.CL TIER_1 English(EN) · Wajdi Zaghouani ·

    Toward Responsible and Epistemically Grounded Multilingual LLMs for Computational Social Science and Humanities

    arXiv:2606.00596v1 Announce Type: new Abstract: Large language models have rapidly evolved in multilingual competence and reasoning capacity, enabling their integration into Social Sciences and Humanities research workflows. Yet existing evaluation paradigms remain anchored in ta…

  183. arXiv cs.CL TIER_1 English(EN) · Delip Rao, Chris Callison-Burch ·

    Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

    arXiv:2606.00093v1 Announce Type: new Abstract: Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $\kappa$, and one or more rank correlations. A survey of 24 recent LLM-as-judge pape…

  184. arXiv cs.AI TIER_1 English(EN) · Shei Pern Chua, Zhen Leng Thai, Kai Jun Teh, Xiao Li, Qibing Ren, Xiaolin Hu ·

    Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs

    arXiv:2509.05367v5 Announce Type: replace-cross Abstract: Large Language Model safety alignment predominantly operates on a binary assumption that requests are either safe or unsafe. This classification proves insufficient when models encounter ethical dilemmas, where the capacit…

  185. arXiv cs.AI TIER_1 English(EN) · Atoosa Chegini, Soheil Feizi ·

    Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

    arXiv:2606.01682v1 Announce Type: cross Abstract: Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids…

  186. arXiv cs.AI TIER_1 English(EN) · Jiaming Qu, Lucheng fu, Yibo Hu ·

    Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

    arXiv:2606.01637v1 Announce Type: cross Abstract: Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conformity: a model may abandon its own answer simply because others agree on a different one. …

  187. arXiv cs.AI TIER_1 English(EN) · Yu-An Lu, Ci-Yang Tsai, Yu-Lin Tsai, Raluca Ada Popa, Chia-Mu Yu ·

    Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

    arXiv:2606.00642v1 Announce Type: new Abstract: Reasoning traces have become a valuable form of learning signals for improving and transferring the capabilities of large language models. In particular, detailed traces can help distill reasoning behavior from stronger teacher mode…

  188. arXiv cs.AI TIER_1 English(EN) · Haoyan Yang, Reza Shirkavand, Yukai Jin, Jiawei Zhou, Shangqian Gao, Heng Huang ·

    Capability Self-Assessment: Teaching LLMs to Know Their Limits

    arXiv:2606.00251v1 Announce Type: new Abstract: The ability to recognize one's own limitations and decide whether to solve a problem or delegate is fundamental for reliable intelligent systems. Yet we show that modern large language models systematically lack this ability: across…

  189. Hugging Face Daily Papers TIER_1 English(EN) ·

    TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

    LLMs have evolved from basic chatbots to the backbone of the AI ecosystem, now widely used in healthcare, schools, and government services. The domain-wide adoption of LLMs necessitates continuous evaluation to ensure their safety and fairness. Common issues encountered after dep…

  190. Hugging Face Daily Papers TIER_1 English(EN) ·

    The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

    Inference-time scaling is enhanced through constrained optimization that allocates computational resources based on economic principles, improving performance in resource-constrained environments.

  191. arXiv cs.CL TIER_1 English(EN) · Isabelle Augenstein ·

    Not What, But How: A Communicative Audit of LLM Response Framing

    Large language models (LLMs) are being increasingly used to answer subjective, information-seeking questions, where users are sensitive to how responses are communicated, not just whether the answers are correct. Existing LLM evaluations for subjective cultural queries largely fo…

  192. arXiv cs.CL TIER_1 English(EN) · Bing Qin ·

    CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

    Existing research largely reduces cultural intelligence in LLMs to a knowledge-level problem, overlooking whether models can effectively utilize their acquired knowledge in realistic scenarios. To bridge this gap, we introduce CultureForest, a benchmark for \textit{Cultural Norm …

  193. Hugging Face Daily Papers TIER_1 English(EN) ·

    Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

    Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during ge…

  194. arXiv cs.AI TIER_1 English(EN) · Chanwoo Park, Ziyang Chen, Asuman Ozdaglar, Kaiqing Zhang ·

    Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach

    arXiv:2511.04393v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly deployed as "agents" for decision-making (DM) in interactive and dynamic environments. Yet, since they were not originally designed for DM, recent studies show that LLMs can struggle…

  195. arXiv cs.AI TIER_1 English(EN) · Junhyuk Choi, Sohhyung Park, Chanhee Cho, Hyeonchu Park, Bugeun Kim ·

    Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory

    arXiv:2602.00521v2 Announce Type: replace Abstract: While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and re…

  196. arXiv cs.AI TIER_1 English(EN) · Roberto Figli\`e, Simone Caputo, Alan Serrano, Daria Mikhaylova, Tommaso Turchi, Daniele Mazzei ·

    Neither Replacement nor Panacea: Comparing LLM-Based Conversational and Graphical Decision Support in Industrial Tasks

    arXiv:2605.31287v1 Announce Type: cross Abstract: Managers in manufacturing settings rely on digital interfaces to interpret operational data for decision-making, but growing data volume and complexity can make relevant insights difficult to identify efficiently. While dashboards…

  197. arXiv cs.LG TIER_1 English(EN) · Ali Dadsetan, Frank Rudzicz ·

    Re-examining Low Rank adaptation for private LLM fine-tuning

    arXiv:2510.01137v3 Announce Type: replace Abstract: Privacy is a central concern when fine-tuning large language models (LLMs) on sensitive data, and differentially private stochastic gradient descent (DP-SGD) -- which clips per-sample gradients and adds calibrated Gaussian noise…

  198. arXiv cs.CL TIER_1 English(EN) · Maiya Goloburda, Roman Vashurin, Fedor Chernogorskii, Nurkhan Laiyk, Daniil Orel, Preslav Nakov, Maxim Panov ·

    Why Don't You Know? Evaluating the Impact of Uncertainty Sources on Uncertainty Quantification in LLMs

    arXiv:2604.10495v2 Announce Type: replace Abstract: As Large Language Models (LLMs) are increasingly deployed in real-world applications, reliable uncertainty quantification (UQ) becomes critical for safe and effective use. Most existing UQ approaches for language models aim to p…

  199. arXiv cs.AI TIER_1 English(EN) · Iv\'an Arcuschin, David Chanin, Adri\`a Garriga-Alonso, Oana-Maria Camburu ·

    Biases in the Blind Spot: Detecting What LLMs Fail to Mention

    arXiv:2602.10117v5 Announce Type: replace-cross Abstract: Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these unverbalized biases. Monitoring models via their stated reasoning is the…

  200. arXiv cs.AI TIER_1 English(EN) · Aditya Thimmaiah, Jiyang Zhang, Jayanth Srinivasa, Junyi Jessy Li, Milos Gligoric ·

    LLMs Lean on Priors, Not Programming Language Semantics

    arXiv:2510.03415v3 Announce Type: replace-cross Abstract: Recent work asks whether large language models (LLMs) condition their reasoning on explicit rules rather than statistical regularities from pretraining. Program execution provides a canonical instance: formal semantics def…

  201. arXiv cs.AI TIER_1 English(EN) · Hamin Koo, Jaehyung Kim ·

    EMCEE: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context

    arXiv:2503.05846v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have achieved impressive progress across a wide range of tasks, yet their heavy reliance on English-centric training data leads to significant performance degradation in non-English languages. …

  202. arXiv cs.AI TIER_1 English(EN) · Caroline Wang, Daniel Kasenberg, Kim Stachenfeld, Pablo Samuel Castro ·

    Discovering Differences in Strategic Behavior Between Humans and LLMs

    arXiv:2602.10324v2 Announce Type: replace Abstract: As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provide…

  203. Hugging Face Daily Papers TIER_1 English(EN) ·

    Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

    Chunk-Level Guided Generation uses a large language model as a process scorer to select fixed-length candidate chunks during small model generation, improving reasoning accuracy over traditional methods like majority voting and PRM guided search.

  204. arXiv cs.AI TIER_1 English(EN) · Daniele Mazzei ·

    Neither Replacement nor Panacea: Comparing LLM-Based Conversational and Graphical Decision Support in Industrial Tasks

    Managers in manufacturing settings rely on digital interfaces to interpret operational data for decision-making, but growing data volume and complexity can make relevant insights difficult to identify efficiently. While dashboards remain dominant in industrial contexts, Large Lan…

  205. arXiv cs.AI TIER_1 English(EN) · Rebecca M. M. Hicke, Kiran Tomlinson ·

    Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

    arXiv:2605.29018v1 Announce Type: new Abstract: Although a growing body of research has begun to describe user--LLM interactions, the picture it paints is largely static; little is known about how individual users change their behavior over time. To address this gap, we analyze t…

  206. arXiv cs.LG TIER_1 English(EN) · Youngbin Choi, Minjong Lee, Saemi Moon, Seunghyuk Cho, Chaehyeon Chung, MoonJeong Park, Dongwoo Kim ·

    In-Place Feedback: Reliable Refinement for Multi-Turn Expert-LLM Collaboration

    arXiv:2510.00777v2 Announce Type: replace Abstract: LLM-generated drafts often contain subtle factual or logical errors, yet prior work shows that models struggle to reliably integrate multi-turn feedback aimed at fixing them. We propose in-place feedback, an interaction paradigm…

  207. arXiv cs.CL TIER_1 English(EN) · Anyuan Zhuo, Xuefei Ning, Ningyuan Li, Jingyi Zhu, Yu Wang, Pinyan Lu ·

    Understanding the Ability of LLMs to Handle Character-Level Perturbation

    arXiv:2510.14365v4 Announce Type: replace Abstract: This work investigates the resilience of contemporary large language models (LLMs) against frequent character-level perturbations. We examine three types of character-level perturbations including introducing numerous typos with…

  208. arXiv cs.CL TIER_1 English(EN) · Alan Li, Yixin Liu, Arpan Sarkar, Doug Downey, Arman Cohan ·

    Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

    arXiv:2508.19202v3 Announce Type: replace Abstract: Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for ass…

  209. arXiv cs.CL TIER_1 English(EN) · Wajdi Zaghouani, Kholoud K. Aldous, Yicheng Gao ·

    Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

    arXiv:2605.29667v1 Announce Type: new Abstract: When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries, leav…

  210. arXiv cs.CL TIER_1 English(EN) · Yeyong Yu, Wenya Hu, Xing Wu, Quan Qian ·

    From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals

    arXiv:2605.29555v1 Announce Type: new Abstract: As candidate generation and high-throughput experimentation advance, the primary bottleneck in materials discovery is shifting from property prediction to making reliable evaluations among massive candidate sets. We propose a Knowle…

  211. arXiv cs.CL TIER_1 English(EN) · Xinming Yang, Jun Li ·

    Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation

    arXiv:2605.29007v1 Announce Type: new Abstract: Personalized tutoring, teacher training, and education research need access to \emph{targeted} synthetic misconceptions, but privacy and IRB constraints make labelled corpora of real student errors scarce. LLMs could in principle ge…

  212. arXiv cs.CL TIER_1 English(EN) · Mohamed Abdelwahab, Michelle Yu Collins, Sihan Chen, Yi Cheng Zhao, Zafarullah Mahmood, Jiading Zhu, Soliman Ali, Jonathan Rose ·

    What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

    arXiv:2605.28823v1 Announce Type: new Abstract: As the influence of LLMs expands, it is imperative to gain insight into their decisions. One way to do that is to develop probes that detect the presence or absence of a broad set of concepts within the embeddings computed in an LLM…

  213. arXiv cs.AI TIER_1 English(EN) · Yu Lei, Hao Liu, Chengxing Xie, Songjia Liu, Zhiyu Yin, Canyu Chen, Guohao Li, Philip Torr, Zhen Wu ·

    Are LLMs Socially Adaptive? Contrasting Belief Evolution in Large Language Models and Humans

    arXiv:2410.10398v3 Announce Type: replace-cross Abstract: As large language models (LLMs) increasingly engage in complex social interactions, ensuring that their behaviors align with human ethical principles and intentions, known as value alignment, has become a critical scientif…

  214. arXiv cs.AI TIER_1 English(EN) · Shaojie Wang, Liang Zhang ·

    From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning

    arXiv:2601.21909v2 Announce Type: replace Abstract: Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental g…

  215. arXiv cs.AI TIER_1 English(EN) · Ruoxi Su, Yuhan Liu, Jingyu Hu ·

    Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

    arXiv:2605.29458v1 Announce Type: cross Abstract: Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona information is often provided as static descriptions that miss the values, experiences, and …

  216. arXiv cs.AI TIER_1 English(EN) · Asaf Yehudai, Naama Rozen, Ariel Gera ·

    Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

    arXiv:2605.30036v1 Announce Type: new Abstract: Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this wor…

  217. arXiv cs.AI TIER_1 English(EN) · Yunjin Qi, Zhaojun Jiang, Xuan Wu, Hanxi Pan, Yixuan Wang, Yanfang Liu, Xiang Ji, Churu Yu, Chunyuan Zheng, Yingze Chen, Jie He, Liuqing Chen, Zaifeng Gao ·

    NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs

    arXiv:2605.29685v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly applied in social contexts such as emotional companionship and customer service, measuring their social intelligence has become critical to the quality and safety of human-AI interact…

  218. arXiv cs.AI TIER_1 English(EN) · Ariel Gera ·

    Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

    Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value th…

  219. Hugging Face Daily Papers TIER_1 English(EN) ·

    Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

    When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries, leaving models exposed to adversarial prompts that e…

  220. arXiv cs.LG TIER_1 English(EN) · Chacha Chen, Matthew J\"orke, Adam Goli\'nski, Masha Fedzechkina, Guillermo Sapiro, Sinead Williamson, Nicholas Foti ·

    LLMs are not (consistently) Bayesian: Quantifying internal (in)consistencies of LLMs' probabilistic beliefs

    arXiv:2605.06915v2 Announce Type: replace Abstract: Modern AI systems are being deployed in complex domains such as medicine, science, and law, where it is important that they not only produce correct answers, but also represent and update uncertain beliefs about the world as new…

  221. arXiv cs.CL TIER_1 English(EN) · Yahan Yu, Yuyang Dong, Masafumi Oyamada ·

    Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs

    arXiv:2507.06999v2 Announce Type: replace-cross Abstract: Reasoning is essential for large language models (LLMs), especially in complex tasks such as mathematical problem solving. However, multimodal reasoning still faces challenges in modality alignment and training scalability…

  222. arXiv cs.AI TIER_1 English(EN) · Zihao Han, Tiangang Zhang, Huaibin Wang, Yilun Sun ·

    Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

    arXiv:2605.11458v2 Announce Type: replace Abstract: On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such m…

  223. arXiv cs.AI TIER_1 English(EN) · Yansong Ning, Mianpeng Liu, Jingwen Ye, Weidong Zhang, Hao Liu ·

    HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

    arXiv:2605.28398v1 Announce Type: new Abstract: Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost. However, existing methods for adaptive thinking-mode selecti…

  224. arXiv cs.CL TIER_1 English(EN) · Gabrielle Kaili-May Liu, Arman Cohan ·

    Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

    arXiv:2605.28778v1 Announce Type: new Abstract: LLMs' linguistically expressed confidence should faithfully reflect their intrinsic uncertainty. While recent work shows LLMs struggle to use epistemic markers (e.g., "it is likely...") in a human-aligned fashion, it remains unclear…

  225. arXiv cs.AI TIER_1 English(EN) · Camilo Chac\'on Sartori, Jos\'e H. Garc\'ia ·

    A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

    arXiv:2605.27789v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) systems are often compared by asking a large language model (LLM) judge which answer is better. For multi-hop RAG, this has become a measurement problem as much as a modeling problem: the same sc…

  226. arXiv cs.AI TIER_1 English(EN) · Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo ·

    RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

    arXiv:2509.21128v2 Announce Type: replace Abstract: Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these metho…

  227. arXiv cs.AI TIER_1 English(EN) · Yue Cheng, Jiajun Zhang, Xiaohui Gao, Weiwei Xing, Zheng Wang, Zhanxing Zhu ·

    Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

    arXiv:2605.28388v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Reward (RLVR) is empirically shown to notably enhance the reasoning performance of large language models (LLMs), particularly in mathematics and programming. However, the mechanistic role of Sa…

  228. arXiv cs.CL TIER_1 English(EN) · Arman Cohan ·

    Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

    LLMs' linguistically expressed confidence should faithfully reflect their intrinsic uncertainty. While recent work shows LLMs struggle to use epistemic markers (e.g., "it is likely...") in a human-aligned fashion, it remains unclear whether models can apply their own linguistic c…

  229. arXiv cs.AI TIER_1 English(EN) · Shivam Rawat, Lucie Flek, Florian Mai, Nicholas Kluge Corr\^ea ·

    Reasoning Primitives in Hybrid and Non-Hybrid LLMs: Do Architectural Differences Yield Advantages in State-Tracking and Recall?

    arXiv:2604.21454v2 Announce Type: replace-cross Abstract: Reasoning in large language models is often discussed as a single capability, but some of its gains may stem from simpler underlying operations. We examine two such primitives, recall and state-tracking, through five contr…

  230. arXiv cs.AI TIER_1 English(EN) · Wenda Xu, Sweta Agrawal, Vil\'em Zouhar, Markus Freitag, Daniel Deutsch ·

    When LLMs Benchmark Themselves: Deconstructing Self-Bias in Automated Evaluation

    arXiv:2509.26600v2 Announce Type: replace-cross Abstract: As LLMs rapidly saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) -- where a model generates test inputs (LLM-as-a-testset) and evaluates outputs (LLM-as-an-evaluator) -- has gained…

  231. arXiv cs.AI TIER_1 English(EN) · Nafis Tanveer Islam, Zhiming Zhao ·

    How Reliable are LLMs for Reasoning on the Re-ranking task?

    arXiv:2508.18444v2 Announce Type: replace-cross Abstract: With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with human values, but this comes at the cost of transparency. Although promising results…

  232. arXiv cs.AI TIER_1 English(EN) · Shashwat Singh, Tal Linzen, Shauli Ravfogel ·

    Can LLMs Introspect? A Reality Check

    arXiv:2605.26242v1 Announce Type: new Abstract: Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may b…

  233. arXiv cs.AI TIER_1 English(EN) · Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dongsheng Li, Yuqing Yang ·

    Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty

    arXiv:2603.15500v2 Announce Type: replace Abstract: LLMs often exhibit Aha moments such as self-correction after tokens like "Wait," yet the underlying mechanism remains unclear. Standard LLMs collapse mainly through silent divergence, where trajectories drift from the correct an…

  234. arXiv cs.AI TIER_1 English(EN) · Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin ·

    It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

    arXiv:2605.27288v1 Announce Type: cross Abstract: Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we …

  235. arXiv cs.AI TIER_1 English(EN) · Adam Bawatneh, Sagar Sapkota, Amrit Singh Bedi, Santu Karmaker, Mubarak Shah ·

    OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

    arXiv:2605.26322v1 Announce Type: new Abstract: Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, where performance is judged solely by the final answer…

  236. arXiv cs.AI TIER_1 English(EN) · Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin ·

    Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Reliability

    arXiv:2603.11394v3 Announce Type: replace-cross Abstract: Large language models (LLMs) excel on static benchmarks, but their performance across multi-turn conversations, which better reflect real-world usage, remains understudied. Addressing this gap is critical in high-stakes se…

  237. Hugging Face Daily Papers TIER_1 English(EN) ·

    HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

    HRBench presents a unified evaluation framework for hybrid-reasoning LLMs that systematically compares thinking-mode switching strategies across different training regimes and model scales.

  238. Hugging Face Daily Papers TIER_1 English(EN) ·

    Review Arcade: On the Human Alignment and Gameability of LLM Reviews

    Empirical analysis reveals limited alignment between LLM-generated reviews and human reviews, with varying performance across different prompts and models, and demonstrates that authors can strategically improve paper scores through iterative revision based on LLM feedback.

  239. arXiv cs.AI TIER_1 English(EN) · Bradley A. Malin ·

    It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

    Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we hypothesize that conformity is also driven by a mo…

  240. arXiv cs.CL TIER_1 English(EN) · Nura Aljaafari, Marco Valentino, Andr\'e Freitas ·

    Is Inference Mediated by Distinct Semantic Structures in LLMs? A Mechanistic Interpretation

    arXiv:2605.25520v1 Announce Type: new Abstract: Predicting a label correctly does not necessarily require representing the operation that produces it. Transformer representations are known to carry label-level information, but whether they encode semantic operations producing tho…

  241. arXiv cs.AI TIER_1 English(EN) · Zhiyuan Zhai, Xinkai You, Wenjing Yan, Xin Wang ·

    How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning

    arXiv:2605.23926v1 Announce Type: new Abstract: Reasoning-capable large language models solve hard problems by emitting long chains of thought, paying heavily in latency, GPU time, and energy. Casual inspection of their traces reveals extensive reformulation, verification, and ci…

  242. arXiv cs.AI TIER_1 English(EN) · Zenghui Zhou, Man Li, Xiaoke Fang, Xinyi Zhou, Weibin Li, Zheng Zheng ·

    LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

    arXiv:2605.23965v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equiva…

  243. arXiv cs.AI TIER_1 English(EN) · Ali \c{S}enol, Garima Agrawal, Huan Liu ·

    Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

    arXiv:2605.24661v1 Announce Type: new Abstract: LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into the underlying reasoning processes that produce those …

  244. arXiv cs.AI TIER_1 English(EN) · Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Leonidas Guibas, Jiajun Wu, Yejin Choi ·

    Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

    arXiv:2602.21198v3 Announce Type: replace-cross Abstract: Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into exper…

  245. arXiv cs.CL TIER_1 English(EN) · Tianlang Chen, Shirley Wu, Jure Leskovec ·

    Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap

    arXiv:2605.24432v1 Announce Type: new Abstract: Large Language Model (LLM) interactions are typically underspecified, with users clarifying all necessary details across multiple conversational turns. Yet recent work shows that LLMs perform far worse in this multi-turn setting tha…

  246. arXiv cs.CL TIER_1 English(EN) · Jinyan Su, Claire Cardie ·

    Knowing but Not Showing: LLMs Recognize Ambiguity but Rarely Ask Clarifying Questions

    arXiv:2605.25284v1 Announce Type: new Abstract: User queries are often underspecified and may admit multiple valid interpretations. Rather than silently making assumptions about the user's intent, a helpful assistant should surface such ambiguity by asking a clarifying question. …

  247. arXiv cs.LG TIER_1 English(EN) · Dennis Frauen, Marie Brockschmidt, Konstantin Hess, Haorui Ma, Yuchen Ma, Abdurahman Maarouf, Maresa Schr\"oder, Jonas Schweisthal, Yuxin Wang, Athiya Deviyani, Sonali Parbhoo, Rahul G. Krishnan, Stefan Feuerriegel ·

    Causal methods for LLM development and evaluation

    arXiv:2605.25998v1 Announce Type: new Abstract: Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM develop…

  248. arXiv cs.LG TIER_1 English(EN) · Jackie Baek, Yunhan Chen, Ziyu Chi, Will Ma ·

    LLM-SAA: LLM-persona Generated Distributions for Decision-making

    arXiv:2602.06357v2 Announce Type: replace Abstract: LLMs can generate a wealth of data, ranging from simulated personas imitating human valuations and preferences, to demand forecasts based on world knowledge. But how well do such LLM-generated distributions support downstream de…

  249. Hugging Face Daily Papers TIER_1 English(EN) ·

    Can LLMs Introspect? A Reality Check

    Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion …

  250. arXiv cs.LG TIER_1 English(EN) · Stefan Feuerriegel ·

    Causal methods for LLM development and evaluation

    Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM development and evaluation are inherently causal: What …

  251. Hugging Face Daily Papers TIER_1 English(EN) ·

    Causal methods for LLM development and evaluation

    Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM development and evaluation are inherently causal: What …

  252. arXiv cs.CL TIER_1 English(EN) · André Freitas ·

    Is Inference Mediated by Distinct Semantic Structures in LLMs? A Mechanistic Interpretation

    Predicting a label correctly does not necessarily require representing the operation that produces it. Transformer representations are known to carry label-level information, but whether they encode semantic operations producing those labels is unclear. We investigate this in Nat…

  253. arXiv cs.AI TIER_1 English(EN) · Dongxin Guo, Jikun Wu, Siu Ming Yiu ·

    Do Language Models Know What Not to Say? Causal Evidence for Statistical Preemption in LLMs

    arXiv:2605.23039v1 Announce Type: cross Abstract: How do learners acquire knowledge of what is unacceptable without negative evidence? Construction Grammar proposes statistical preemption: exposure to a conventional form (e.g., "donated the books to the library") preempts structu…

  254. arXiv cs.AI TIER_1 English(EN) · Eric Xu ·

    As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

    arXiv:2605.23147v1 Announce Type: cross Abstract: Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an ear…

  255. arXiv cs.LG TIER_1 English(EN) · Tim Tomov, Dominik Fuchsgruber, Stephan G\"unnemann ·

    Task-Awareness Improves LLM Generations and Uncertainty

    arXiv:2601.21500v2 Announce Type: replace Abstract: In many applications of LLMs, natural language responses often have an underlying structure such as representing discrete labels, numerical values, or graphs. Yet, existing decoding and uncertainty estimation methods operate onl…

  256. arXiv cs.CL TIER_1 English(EN) · Binqi Shen, Lier Jin, Hanyu Cai, Lan Hu, Yuting Xin ·

    The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management

    arXiv:2605.23071v1 Announce Type: new Abstract: Large language models (LLMs) increasingly rely on long-context processing, but expanding context windows introduces substantial computational and financial costs. Existing context reduction approaches, including retrieval and memory…

  257. arXiv cs.AI TIER_1 English(EN) · Chuanyang Jin, Binze Li, Haopeng Xie, Cathy Mengying Fang, Tianjian Li, Shayne Longpre, Hongxiang Gu, Maximillian Chen, Tianmin Shu ·

    ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

    arXiv:2605.20087v2 Announce Type: replace-cross Abstract: Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-world multi-turn human-…

  258. arXiv cs.AI TIER_1 English(EN) · Sirui Chen, Lei Xu, Yuying Zhao, Yutian Chen, Yu Wang, Beier Zhu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu ·

    Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

    arXiv:2605.23384v1 Announce Type: cross Abstract: Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable …

  259. Hugging Face Daily Papers TIER_1 English(EN) ·

    Can LLMs Introspect? A Reality Check

    Large language models may not genuinely detect their internal states, as their apparent introspective abilities could reflect surface-level pattern matching rather than true metacognitive monitoring.

  260. arXiv cs.AI TIER_1 English(EN) · Chaochao Lu ·

    Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

    Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable checks or ground-truth answers, but provides limit…

  261. arXiv cs.AI TIER_1 English(EN) · Carolina Camassa, Derek Shiller ·

    Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

    arXiv:2605.20382v1 Announce Type: cross Abstract: Language models are trained to follow instructions, but they are also powerful pattern completers. What happens when these two objectives conflict? We construct conversations in which a user instruction to behave in a target way T…

  262. arXiv cs.LG TIER_1 Svenska(SV) · Zhuo Li, Guodong Du, Zesheng Shi, Weiyang Guo, Weijun Yao, Yuan Zhou, Jiabo Zhang, Jing Li ·

    Skill Weaving: Efficient LLM Improvement via Modular Skillpacks

    arXiv:2605.22205v1 Announce Type: cross Abstract: Large language models increasingly require specialization across diverse domains, yet existing approaches struggle to balance multi-domain capacities with strict memory and inference constraints. In this work, we introduce SkillWe…

  263. arXiv cs.CL TIER_1 English(EN) · Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao ·

    Token-Level LLM Collaboration via FusionRoute

    arXiv:2601.05106v4 Announce Type: replace-cross Abstract: Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitive…

  264. arXiv cs.CL TIER_1 English(EN) · Sid-ali Temkit ·

    AMEL: Accumulated Message Effects on LLM Judgments

    arXiv:2605.22714v1 Announce Type: cross Abstract: Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation histor…

  265. arXiv cs.AI TIER_1 English(EN) · Andrea Sassella, Andrea Chizzola, Tommaso Bianchi, Luca Alessandrelli, Mark James Carman ·

    Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

    arXiv:2605.07731v2 Announce Type: replace-cross Abstract: This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a w…

  266. arXiv cs.AI TIER_1 English(EN) · Sangwoo Park, Woongyeong Yeo, Seanie Lee, Yumin Choi, Hyomin Lee, Kangsan Kim, Jinheon Baek, Seong Joon Oh, Sung Ju Hwang ·

    It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs

    arXiv:2605.20258v1 Announce Type: cross Abstract: Contextual Integrity (CI) defines privacy not merely as keeping information hidden, but as governing information flows according to the norms of a given context. As large language models are increasingly deployed as personal agent…

  267. arXiv cs.CL TIER_1 English(EN) · Eric Xu ·

    As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

    Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an early/mid layer band. There, persona and task contrib…

  268. arXiv cs.CL TIER_1 English(EN) · Yuting Xin ·

    The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management

    Large language models (LLMs) increasingly rely on long-context processing, but expanding context windows introduces substantial computational and financial costs. Existing context reduction approaches, including retrieval and memory compression methods, are typically evaluated us…

  269. arXiv cs.CL TIER_1 English(EN) · Siu Ming Yiu ·

    Do Language Models Know What Not to Say? Causal Evidence for Statistical Preemption in LLMs

    How do learners acquire knowledge of what is unacceptable without negative evidence? Construction Grammar proposes statistical preemption: exposure to a conventional form (e.g., "donated the books to the library") preempts structurally possible but unattested alternatives ("*dona…

  270. arXiv cs.AI TIER_1 English(EN) · Sid-ali Temkit ·

    AMEL: Accumulated Message Effects on LLM Judgments

    Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call t…

  271. arXiv cs.LG TIER_1 English(EN) · Faezeh Ghaderi ·

    What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema

    We read twelve well-known LLM agent benchmark papers and recorded, dimension by dimension, what each paper actually says about how its evaluation was run. The motivation came from a familiar frustration: two papers will report results on the same benchmark with the same model nam…

  272. arXiv cs.CL TIER_1 English(EN) · Akiko Aizawa ·

    Refining and Reusing Annotation Guidelines for LLM Annotation

    While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism…

  273. arXiv cs.CL TIER_1 English(EN) · Derek Shiller ·

    Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

    Language models are trained to follow instructions, but they are also powerful pattern completers. What happens when these two objectives conflict? We construct conversations in which a user instruction to behave in a target way T (e.g., always output a specific token, answer in …

  274. arXiv cs.AI TIER_1 English(EN) · Tianmin Shu ·

    ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

    Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-world multi-turn human--AI conversations with users' self-reported thoughts: thei…

  275. arXiv cs.CL TIER_1 English(EN) · Dzmitry Bahdanau ·

    Forecasting Downstream Performance of LLMs With Proxy Metrics

    Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals…

  276. Hugging Face Daily Papers TIER_1 English(EN) ·

    Forecasting Downstream Performance of LLMs With Proxy Metrics

    Proxy metrics based on token-level statistics from expert-written solutions provide more reliable model performance forecasting than traditional loss-based methods across multiple development stages.

  277. arXiv cs.LG TIER_1 English(EN) · Jendrik Seipp ·

    Property-Guided LLM Program Synthesis for Planning

    LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers n…

  278. arXiv cs.CL TIER_1 English(EN) · Rose Yu ·

    Calibrating LLMs with Semantic-level Reward

    As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standa…

  279. arXiv cs.CL TIER_1 English(EN) · Shashi Bhushan TN ·

    From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

    Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and …

  280. arXiv cs.AI TIER_1 English(EN) · Xiaosong Zhang ·

    Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use

    Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats h…

  281. arXiv cs.AI TIER_1 English(EN) · Nigam Shah ·

    Quantifying and Mitigating Premature Closure in Frontier LLMs

    Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: p…

  282. arXiv cs.AI TIER_1 English(EN) · Murphy Zhuang ·

    MinT: Managed Infrastructure for Training and Serving Millions of LLMs

    We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing ea…

  283. arXiv cs.CL TIER_1 English(EN) · Chuang Gan ·

    FlowCompile: An Optimizing Compiler for Structured LLM Workflows

    Structured LLM workflows, where specialized LLM sub-agents execute according to a predefined graph, have become a powerful abstraction for solving complex tasks. Optimizing such workflows, i.e., selecting configurations for each sub-agent to balance accuracy and latency, is chall…

  284. arXiv cs.CL TIER_1 English(EN) · Marcos Piau ·

    LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs

    Frontier assistant LLMs ship with strong guardrails: asked directly to write a persuasive essay denying the Holocaust, denying vaccine safety, defending flat-earth cosmology, arguing for racial hierarchies, denying anthropogenic climate change, or replacing evolution with creatio…

  285. Hugging Face Daily Papers TIER_1 English(EN) ·

    The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code

    As Large Language Models (LLMs) are transforming software development, the functional quality of generated code has become a central focus, leaving readability, one of critical non-functional attributes, understudied. Given that LLM-generated code still needs human review before …

  286. arXiv cs.AI TIER_1 English(EN) · Devvrit Khatri ·

    Learning, Fast and Slow: Towards LLMs That Adapt Continually

    Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context lea…

  287. arXiv cs.AI TIER_1 English(EN) · Xiaoxing Ma ·

    Uncertainty Quantification for LLM-based Code Generation

    Prediction sets provide a theoretically grounded framework for quantifying uncertainty in machine learning models. Adapting them to structured generation tasks, in particular, large language model (LLM) based code generation, remains a challenging problem. An existing attempt pro…

  288. arXiv cs.CL TIER_1 English(EN) · Fuli Feng ·

    SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

    Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing…

  289. arXiv cs.CL TIER_1 English(EN) · Dayiheng Liu ·

    On Predicting the Post-training Potential of Pre-trained LLMs

    The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model's plasticity in complex open-ended scenarios, leading to…

  290. arXiv cs.CL TIER_1 English(EN) · Xiangdong Su ·

    Training-Inference Consistent Segmented Execution for Long-Context LLMs

    Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve eff…

  291. arXiv cs.LG TIER_1 English(EN) · Marco Cuturi ·

    DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures

    Multi-domain fine-tuning of large language models requires improving performance on target domains while preserving performance on constrained domains, such as general knowledge, instruction following, or safety evaluations. Existing data mixing strategies rely on fixed heuristic…

  292. arXiv cs.AI TIER_1 English(EN) · Jens Albrecht ·

    LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation

    We demonstrate LLARS (LLM Assisted Research System), an open-source platform that bridges the gap between domain experts and developers for building LLM-based systems. It integrates three tightly connected modules into an end-to-end pipeline: Collaborative Prompt Engineering for …

  293. arXiv cs.CL TIER_1 English(EN) · Martin Vechev ·

    Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

    Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems.…

  294. arXiv cs.AI TIER_1 English(EN) · Jing Li ·

    Personalizing LLMs with Binary Feedback: A Preference-Corrected Optimization Framework

    Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a framework that personalizes LLMs via pr…

  295. Hugging Face Daily Papers TIER_1 English(EN) ·

    Personalizing LLMs with Binary Feedback: A Preference-Corrected Optimization Framework

    Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a framework that personalizes LLMs via pr…

  296. arXiv cs.CL TIER_1 English(EN) · Deep Shah ·

    The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods

    Large Language Models are increasingly used as zero-shot classifiers in complex reasoning tasks. However, standard constrained decoding suffers from a phenomenon we define as Renormalization Bias. When a model is restricted to a small set of target labels, the standard softmax op…

  297. arXiv cs.CL TIER_1 English(EN) · Yanran Li ·

    Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges

    Multi-judge evaluation is increasingly used to assess LLMs and reward models, and the prevailing heuristic is to curate: keep the most accurate judges and discard weaker ones. We show that this heuristic can reverse when the target is not point accuracy, but calibrated probabilis…

  298. arXiv cs.CL TIER_1 English(EN) · Ash Lewis ·

    GLiGuard: Schema-Conditioned Classification for LLM Safeguard

    Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions. However, state-of-the-art guardrail models rely on autoregressive decoders with 7B--27B parameters, reformulating what is fun…

  299. arXiv cs.CL TIER_1 English(EN) · Ning Xu ·

    Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement

    Large Language Models (LLMs) rely on safety alignment to obey safe requests while refusing harmful ones. However, traditional refusal mechanisms often lead to "rigid rejection," where a general template (e.g., "I cannot fulfill this request") indiscriminately triggers refusals an…

  300. arXiv cs.AI TIER_1 English(EN) · James Z. Wang ·

    Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs

    Large Language Models (LLMs) are increasingly used in settings where reliable self-assessment is critical. Assessing model reliability has evolved from using probabilistic correctness estimates to, more recently, eliciting verbalized confidence. Confidence, however, has been show…

  301. arXiv cs.AI TIER_1 English(EN) · Abbas Rahimi ·

    POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

    Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel framework that bridges uncertainty quantification …

  302. arXiv cs.AI TIER_1 English(EN) · Mark James Carman ·

    Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

    This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared …

  303. arXiv cs.LG TIER_1 English(EN) · Nan Jiang ·

    Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

    Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy policy-gradient estimation. Existi…

  304. arXiv cs.CL TIER_1 English(EN) · Xiaozhang Liu ·

    From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

    Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for sur…

  305. arXiv cs.AI TIER_1 English(EN) · Nguyen Viet Tuan Kiet, Bui Dinh Pham, Dao Van Tung, Tran Cong Dao, Huynh Thi Thanh Binh ·

    Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs

    arXiv:2605.06123v1 Announce Type: new Abstract: Large language models (LLMs) have recently advanced automatic heuristic design (AHD) for combinatorial optimization (CO), where candidate heuristics are iteratively proposed, evaluated, and refined. Most existing approaches search o…

  306. arXiv cs.LG TIER_1 English(EN) · Zixuan Chen, Hao Lin, Zizhe Chen, Yizhou Tian, Garry Yang, Depeng Wang, Ya Guo, Huijia Zhu, James Cheng ·

    Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs

    arXiv:2605.05957v1 Announce Type: new Abstract: LLMs reliably correct false claims when presented in isolation, yet when the same claims are embedded in task-oriented requests, they often comply rather than correct. We term this failure mode \emph{correction suppression} and cons…

  307. arXiv cs.LG TIER_1 English(EN) · Xinrui Chen, Liu Yang, Ou Wu ·

    One Algorithm, Two Goals: Dual Scoring for Parameter and Data Selection in LLM Fine-Tuning

    arXiv:2605.06166v1 Announce Type: new Abstract: In Large Language Model (LLM) fine-tuning, parameter and data selection are common strategies for reducing fine-tuning cost, yet they are typically driven by separate scoring mechanisms. When a parameter mask and data subset jointly…

  308. arXiv cs.LG TIER_1 English(EN) · Dylan Bouchard ·

    Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades

    arXiv:2605.06350v1 Announce Type: new Abstract: Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical…

  309. arXiv cs.LG TIER_1 English(EN) · Sushant Gautam, Finn Schwall, Annika Willoch Olstad, Fernando Vallecillos Ruiz, Birk Torpmann-Hagen, Sunniva Maria Stordal Bj{\o}rklund, Leon Moonen, Klas Pettersen, Michael A. Riegler ·

    When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

    arXiv:2605.06652v1 Announce Type: new Abstract: Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and …

  310. arXiv cs.LG TIER_1 English(EN) · Andy Zeyi Liu, Elliot Paquette, John Sous ·

    Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

    arXiv:2605.05683v1 Announce Type: cross Abstract: Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family…

  311. arXiv cs.LG TIER_1 English(EN) · Yang Xu, Jiefu Zhang, Haixiang Sun, Zihan Zhou, Tianyu Cao, Vaneet Aggarwal ·

    Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

    arXiv:2605.05973v1 Announce Type: cross Abstract: Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy proc…

  312. arXiv cs.LG TIER_1 English(EN) · Jonas Bayer, Stefan Zetzsche, Olivier Bouissou, Remi Delmas, Michael Tautschnig, Soonho Kong ·

    Teaching LLMs Program Semantics via Symbolic Execution Traces

    arXiv:2605.06184v1 Announce Type: cross Abstract: We introduce an evaluation framework of 500 C verification tasks across five property types (memory safety, overflow, termination, reachability, data races) built on SV-COMP 2025, and evaluate 14 models across six families. We fin…

  313. arXiv cs.LG TIER_1 English(EN) · Florian A. D. Burnat, Brittany I. Davidson ·

    Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

    arXiv:2605.06327v1 Announce Type: cross Abstract: Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context…

  314. arXiv cs.LG TIER_1 English(EN) · Ashwani Anand, Ivi Chatzi, Ritam Raha, Anne-Kathrin Schmuck ·

    MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

    arXiv:2605.06334v1 Announce Type: cross Abstract: Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is chall…

  315. arXiv cs.LG TIER_1 English(EN) · Zichuan Liu, Jinyu Wang, Lei Song, Jiang Bian ·

    Sample-efficient LLM Optimization with Reset Replay

    arXiv:2508.06412v3 Announce Type: replace Abstract: Recent advancements in LLM post-training, particularly through reinforcement learning and preference optimization, are key to boosting their reasoning capabilities. However, these methods often suffer from low sample efficiency …

  316. arXiv cs.LG TIER_1 English(EN) · Wei Huang, Anda Cheng, Yinggui Wang, Lei Wang, Tao Wei ·

    LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

    arXiv:2601.20375v2 Announce Type: replace Abstract: Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (…

  317. arXiv cs.LG TIER_1 English(EN) · Ekaterina Fadeeva, Maiya Goloburda, Aleksandr Rubashevskii, Roman Vashurin, Artem Shelmanov, Preslav Nakov, Mrinmaya Sachan, Maxim Panov ·

    Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

    arXiv:2512.09538v2 Announce Type: replace-cross Abstract: Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring …

  318. arXiv cs.CL TIER_1 English(EN) · Atharva Naik, Yash Mathur, Prakam, Carolyn Rose, David Mortensen ·

    ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis

    arXiv:2605.05485v1 Announce Type: new Abstract: LLMs can solve program synthesis tasks but remain inefficient and unreliable on hard instances requiring large combinatorial search. Given a small set of reasoning traces, we use coding agents to compile them into reusable symbolic …

  319. arXiv cs.CL TIER_1 English(EN) · Ruben Fernandez-Boullon, David N. Olivieri ·

    Patch-Effect Graph Kernels for LLM Interpretability

    arXiv:2605.06480v1 Announce Type: cross Abstract: Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high…

  320. arXiv cs.AI TIER_1 English(EN) · Amal Alnouri, Andreas Hinterreiter, Christina Humer, Furui Cheng, Marc Streit ·

    Visual Fingerprints for LLM Generation Comparison

    arXiv:2605.06054v1 Announce Type: new Abstract: Large language model (LLM) outputs arise from complex interactions among prompts, system instructions, model parameters, and architecture. We refer to specific configurations of these factors as generation conditions, each of which …

  321. arXiv cs.AI TIER_1 English(EN) · Xinmiao Huang, Jinwei Hu, Rajarshi Roy, Changshun Wu, Yi Dong, Xiaowei Huang ·

    PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

    arXiv:2605.06455v1 Announce Type: new Abstract: Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored e…

  322. arXiv cs.AI TIER_1 English(EN) · Kaifeng He, Xiaojun Zhang, Peiliang Cai, Mingwei Liu, Yanlin Wang, Chong Wang, Kaifeng Huang, Bihuan Chen, Xin Peng, Zibin Zheng ·

    Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

    arXiv:2605.05267v1 Announce Type: cross Abstract: Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empi…

  323. arXiv cs.AI TIER_1 English(EN) · Yujia Chen, Yang Ye, Xiao Chu, Yuchi Ma, Cuiyun Gao ·

    Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs

    arXiv:2605.06111v1 Announce Type: cross Abstract: Reinforcement learning (RL) with verifiable rewards has proven effective at post-training LLMs for coding, yet deploying separate task-specific specialists incurs costs that scale with the number of tasks, motivating a unified mul…

  324. arXiv cs.AI TIER_1 English(EN) · Chengjie Wang, Jingzheng Wu, Xiang Ling, Tianyue Luo, Chen Zhao ·

    Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

    arXiv:2605.06279v1 Announce Type: cross Abstract: Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version ch…

  325. arXiv cs.AI TIER_1 English(EN) · Michael A. Riegler ·

    When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

    Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-base…

  326. arXiv cs.AI TIER_1 English(EN) · David N. Olivieri ·

    Patch-Effect Graph Kernels for LLM Interpretability

    Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are diffi…

  327. arXiv cs.AI TIER_1 English(EN) · Xiaowei Huang ·

    PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

    Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM…

  328. arXiv cs.AI TIER_1 English(EN) · Dylan Bouchard ·

    Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades

    Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical hyperparameter, with limited guidance on the ge…

  329. arXiv cs.CL TIER_1 English(EN) · Anne-Kathrin Schmuck ·

    MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

    Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans i…

  330. arXiv cs.AI TIER_1 English(EN) · Brittany I. Davidson ·

    Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

    Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an observable within-item change in…

  331. arXiv cs.AI TIER_1 English(EN) · Bo Bai ·

    Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs

    arXiv:2511.01202v3 Announce Type: replace-cross Abstract: Despite the unprecedented empirical triumphs of LLMs across diverse real-world applications, the prevailing research paradigm remains overwhelmingly heuristic and experimentally driven, inextricably tethered to astronomica…

  332. arXiv cs.AI TIER_1 English(EN) · Hongkun Yu ·

    Evaluating Prompting and Execution-Based Methods for Deterministic Computation in LLMs

    arXiv:2605.03227v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning. However, their ability to perform exact, deterministic computation remains unclear. In this work, we systematically …

  333. arXiv cs.CL TIER_1 English(EN) · Sruly Rosenblat, Tim O'Reilly, Ilan Strauss ·

    Beyond Public Access in LLM Pre-Training Data

    arXiv:2505.00020v2 Announce Type: replace Abstract: Using a legally obtained dataset of 34 copyrighted O'Reilly Media books, we apply the DE-COP membership inference attack method to investigate whether OpenAI's large language models show recognition of copyrighted content. Our r…

  334. arXiv cs.CL TIER_1 English(EN) · Ge Lei, Samuel J. Cooper ·

    Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations

    arXiv:2605.04764v1 Announce Type: new Abstract: Large language models are increasingly used as surrogate models for low-data optimization, but their optimizer-facing prediction and its uncertainty remain poorly understood. We study the surrogate belief elicited from an LLM under …

  335. arXiv cs.LG TIER_1 English(EN) · Jonas K\"ubler, Kailash Budhathoki, Matth\"aus Kleindessner, Xiong Zhou, Junming Yin, Ashish Khetan, George Karypis ·

    When LLMs get significantly worse: A statistical approach to detect model degradations

    arXiv:2602.10144v2 Announce Type: replace-cross Abstract: Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization.…

  336. arXiv cs.LG TIER_1 English(EN) · Luze Sun, Alina Oprea, Eric Wong ·

    Syntax- and Compilation-Preserving Evasion of LLM Vulnerability Detectors

    arXiv:2602.00305v2 Announce Type: replace-cross Abstract: LLM-based vulnerability detectors are increasingly deployed in CI/CD security gating, yet their resilience to evasion under syntax- and compilation-preserving edits remains poorly understood. We evaluate five attack varian…

  337. arXiv cs.LG TIER_1 English(EN) · Sumeet Ramesh Motwani, Chuan Du, Aleksander Petrov, Christopher Davis, Philip Torr, Antonio Papania-Davis, Weishi Yan ·

    AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems

    arXiv:2604.16804v2 Announce Type: replace Abstract: Optimization problems are central to decision-making in manufacturing, logistics, scheduling, and other industrial settings. Translating complicated descriptions of these problems into solver-ready formulations requires speciali…

  338. arXiv cs.LG TIER_1 English(EN) · Dingwei Zhu, Zhiheng Xi, Shihan Dou, Jiahan Li, Chenhao Huang, Junjie Ye, Sixian Li, Mingxu Chai, Yuhui Wang, Yajie Yang, Ming Zhang, Jiazheng Zhang, Shichun Liu, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang ·

    DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training

    arXiv:2602.05890v2 Announce Type: replace Abstract: Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods …

  339. arXiv cs.LG TIER_1 English(EN) · Xiao Wang, Yifei Zhang, YongKang Liu, Xiaocui Yang, Zihan Wang, Shi Feng, Daling Wang ·

    From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning

    arXiv:2605.04572v1 Announce Type: cross Abstract: Safety alignment of Large Language Models (LLMs) is extremely fragile, as fine-tuning on a small number of benign samples can erase safety behaviors learned from millions of preference examples. Existing studies attempt to explain…

  340. arXiv cs.CL TIER_1 English(EN) · Samuel J. Cooper ·

    Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations

    Large language models are increasingly used as surrogate models for low-data optimization, but their optimizer-facing prediction and its uncertainty remain poorly understood. We study the surrogate belief elicited from an LLM under sparse observations, showing that it depends str…

  341. Hugging Face Daily Papers TIER_1 English(EN) ·

    From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning

    Safety alignment of Large Language Models (LLMs) is extremely fragile, as fine-tuning on a small number of benign samples can erase safety behaviors learned from millions of preference examples. Existing studies attempt to explain this phenomenon by comparing parameters and hidde…

  342. arXiv cs.LG TIER_1 English(EN) · Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita ·

    Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

    arXiv:2605.03441v1 Announce Type: cross Abstract: Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using for…

  343. arXiv cs.CL TIER_1 English(EN) · Haesung Lee, Gyubin Choi, Eun-Ju Lee, So-Min Lee, Youkang Ko, Dogyoon Lim, Sung-Kyoung Jang, Yohan Jo ·

    TriBench-Ko: Evaluating LLM Risks in Judicial Workflows

    arXiv:2605.03792v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance …

  344. arXiv cs.LG TIER_1 English(EN) · Yi Liu ·

    Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference

    arXiv:2605.03379v1 Announce Type: new Abstract: Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeat…

  345. arXiv cs.LG TIER_1 English(EN) · Shannon K. Gallagher, Swati Rallapalli, Tyler Brooks, Chuck Loughin, Michele Sezgin, Ronald Yurko ·

    Analysis and Explainability of LLMs Via Evolutionary Methods

    arXiv:2605.02930v1 Announce Type: cross Abstract: Evolutionary methods have long been useful for analysis and explanation in genetics, biology, ecology, and related fields. In this work, we extend these methods to neural networks, specifically large language models (LLMs), to bet…

  346. arXiv cs.CL TIER_1 English(EN) · Richard A. A. Jonker, Alexander Christiansen, Alexandros Maniatis, R\'uben Garrido, Rog\'erio Braunschweiger de Freitas Lima, Roman Jurowetzki, S\'ergio Matos ·

    BIT.UA-AAUBS at ArchEHR-QA 2026: Evaluating Open-Source and Proprietary LLMs via Prompting in Low-Resource QA

    arXiv:2605.03618v1 Announce Type: new Abstract: This paper presents the joint participation of the BIT.UA and AAUBS groups in the ArchEHR-QA 2026 shared task, which focuses on clinical question answering and evidence grounding in a low-resource setting. Due to the absence of trai…

  347. arXiv cs.AI TIER_1 English(EN) · Jia Xiao ·

    NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

    arXiv:2605.01847v1 Announce Type: new Abstract: Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitments required to solve a multi-turn task coherently. NeuroState-Bench is a human-calibrated benchmark that operationalizes commitment in…

  348. arXiv cs.AI TIER_1 English(EN) · Yifei Wang, Ruiyin Li, Peng Liang, Yangxiao Cai, Zengyang Li, Mojtaba Shahin, Arif Ali Khan, Qiong Feng ·

    Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey

    arXiv:2605.01392v1 Announce Type: cross Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant potential across a wide range of software engineering tasks, including software design, an area traditionally regarded as highly dependent on human …

  349. arXiv cs.AI TIER_1 English(EN) · Youpeng Li, Fuxun Yu, Xinda Wang ·

    From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection

    arXiv:2602.14012v2 Announce Type: replace-cross Abstract: The integration of LLMs into vulnerability detection (VD) has shifted the field toward more interpretable and context-aware analysis. While post-training techniques have shown promise in general coding tasks, their systema…

  350. arXiv cs.LG TIER_1 English(EN) · Miaomiao Li, Hao Chen, Yang Wang, Tingyuan Zhu, Weijia Zhang, Kaijie Zhu, Kam-Fai Wong, Jindong Wang ·

    Understanding and Mitigating Bias Inheritance in LLM-based Data Augmentation on Downstream Tasks

    arXiv:2502.04419v3 Announce Type: replace Abstract: Generating synthetic datasets via large language models (LLMs) has emerged as a promising approach to improve LLM performance. However, LLMs inherently reflect biases in their training data, leading to a critical challenge: when…

  351. arXiv cs.LG TIER_1 English(EN) · Hyunji Nam, Haoran Li, Natasha Jaques ·

    Maximizing mutual information between prompts and responses improve LLM personalization with no additional data or human oversight

    arXiv:2603.19294v2 Announce Type: replace Abstract: While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited, and new high…

  352. arXiv cs.CL TIER_1 English(EN) · Yohan Jo ·

    TriBench-Ko: Evaluating LLM Risks in Judicial Workflows

    Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial proces…

  353. arXiv cs.CL TIER_1 English(EN) · Sérgio Matos ·

    BIT.UA-AAUBS at ArchEHR-QA 2026: Evaluating Open-Source and Proprietary LLMs via Prompting in Low-Resource QA

    This paper presents the joint participation of the BIT.UA and AAUBS groups in the ArchEHR-QA 2026 shared task, which focuses on clinical question answering and evidence grounding in a low-resource setting. Due to the absence of training data and the strict data privacy constraint…

  354. arXiv cs.CL TIER_1 English(EN) · Shanu Sushmita ·

    Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

    Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using formalisms such as set theory, formal logic, and quan…

  355. arXiv cs.CL TIER_1 English(EN) · Yi Liu ·

    Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference

    Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeated LLM inference under conditional-i.i.d. calls.…

  356. arXiv cs.CL TIER_1 English(EN) · Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Alexander Binder, Sebastian Lapuschkin ·

    Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs

    arXiv:2506.13727v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are widely deployed in real-world applications, yet their internal mechanisms remain difficult to interpret and control, limiting our ability to diagnose and correct undesirable behaviors. Mech…

  357. arXiv cs.LG TIER_1 English(EN) · Jimyung Hong, Jaehyung Kim ·

    Diet Your LLM: Dimension-wise Global Pruning of LLMs via Merging Task-specific Importance Score

    arXiv:2603.23985v2 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but their massive scale poses significant challenges for practical deployment. Structured pruning offers a promising solution by removing entire dimensions …

  358. arXiv cs.LG TIER_1 English(EN) · Timoth\'ee Chauvin, Cl\'ement Lalanne, Erwan Le Merrer, Jean-Michel Loubes, Fran\c{c}ois Ta\"iani, Gilles Tredan ·

    Token-Efficient Change Detection in LLM APIs

    arXiv:2602.11083v2 Announce Type: replace Abstract: Remote change detection in LLMs is a difficult problem. Existing methods are either too expensive for deployment at scale, or require initial white-box access to model weights or grey-box access to log probabilities. We aim to a…

  359. arXiv cs.LG TIER_1 English(EN) · Nickil Maveli, Antonio Vergari, Shay B. Cohen ·

    Can LLMs Compress (and Decompress)? Evaluating Code Understanding and Execution via Invertibility

    arXiv:2601.13398v2 Announce Type: replace Abstract: LLMs demonstrate strong performance on code benchmarks, yet consistent reasoning across forward and backward execution remains elusive. We present RoundTripCodeEval (RTCE), a benchmark of four code execution reasoning tasks that…

  360. arXiv cs.CL TIER_1 English(EN) · Ian Rios-Sialer ·

    The Homogenization Problem in LLMs: Towards Meaningful Diversity in AI Safety

    arXiv:2601.06116v3 Announce Type: replace-cross Abstract: Generative AI models reproduce the human biases in their training data and further amplify them through mechanisms such as mode collapse. The loss of diversity produces homogenization, which not only harms the minoritized …

  361. arXiv cs.CL TIER_1 English(EN) · Antonio Valerio Miceli Barone, Poon Tsz Nok ·

    Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification

    arXiv:2604.17010v2 Announce Type: replace Abstract: We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validati…

  362. arXiv cs.CL TIER_1 English(EN) · Ziyi Zhu, Olivier Tieleman, Alexey Bukhtiyarov, Jinghong Chen ·

    CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

    arXiv:2603.01865v3 Announce Type: replace Abstract: LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be averaged out by increasing the number of scenarios or generations. These biases are o…

  363. arXiv cs.CL TIER_1 English(EN) · Pawel Kaplanski (Kaplanski AI Lab) ·

    Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates

    arXiv:2605.02236v1 Announce Type: cross Abstract: Recursive language-model loops often settle into recognizable attractor-like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in…

  364. arXiv cs.CL TIER_1 English(EN) · Noga Peleg Pelc, Gal A. Kaminka, Yoav Goldberg ·

    A Language for Describing Agentic LLM Contexts

    arXiv:2605.01920v1 Announce Type: cross Abstract: Large language models are increasingly used within larger systems ("LLM agents"). These make a sequence of LLM calls, each call providing the LLM with a combination of instructions, observations, and interaction history. The desig…

  365. arXiv cs.CL TIER_1 English(EN) · Sadia Asif, Mohammad Mohammadi Amiri ·

    RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

    arXiv:2605.01913v1 Announce Type: cross Abstract: Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features a…

  366. arXiv cs.CL TIER_1 English(EN) · Benjamin Warner, Ratna Sagari Grandhi, Max Kieffer, Aymane Ouraq, Saurav Panigrahi, Geetu Ambwani, Kunal Bagga, Nikhil Khandekar, Arya Hariharan, Nishant Mishra, Manish Ram, Shamus Sim Zi Yang, Ahmed Essouaied, Adepoju Jeremiah Moyondafoluwa, Robert Schol ·

    Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks

    arXiv:2605.01417v1 Announce Type: new Abstract: Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavil…

  367. arXiv cs.CL TIER_1 English(EN) · Koshiro Saito, Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki ·

    LLM Output Detectability and Task Performance Can be Jointly Optimized

    arXiv:2605.01350v1 Announce Type: new Abstract: Detecting machine-generated text is essential for transparency and accountability when deploying large language models (LLMs). Among detection approaches, watermarking is a statistically reliable method by design -- it embeds detect…

  368. arXiv cs.CL TIER_1 English(EN) · Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Goa, Juming Xiong, Zhijun Yin, Bradley A. Malin ·

    CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

    arXiv:2605.01011v1 Announce Type: new Abstract: Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) fr…

  369. arXiv cs.AI TIER_1 English(EN) · Lehan He, Zeren Chen, Zhe Zhang, Xiang Gao, Lu Sheng ·

    Effective LLM Code Refinement via Property-Oriented and Structurally Minimal Feedback

    arXiv:2506.18315v2 Announce Type: replace-cross Abstract: LLMs excel at code generation, yet ensuring the functional correctness of their outputs remains a persistent challenge. While recent studies have applied Test-Driven Development (TDD) to refine code, these methods are ofte…

  370. arXiv cs.AI TIER_1 English(EN) · Abdurrahman Javat, Allan Kazakov ·

    Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

    arXiv:2605.00519v2 Announce Type: cross Abstract: The operational landscape of local Large Language Model (LLM) inference has shifted from lightweight models to datacenter-class weights exceeding 70B parameters, creating profound systems challenges for consumer hardware. This pap…

  371. arXiv cs.AI TIER_1 English(EN) · Fazle Rabbi, Lin Ling, Song Wang, Jinqiu Yang ·

    Social Bias in LLM-Generated Code: Benchmark and Mitigation

    arXiv:2605.00382v2 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly deployed to generate code for human-centered applications where demographic fairness is critical. However, existing evaluations focus almost exclusively on functional correctness, leav…

  372. arXiv cs.AI TIER_1 English(EN) · Qinyuan Wu, Soumi Das, Mahsa Amani, Arijit Nag, Seungeon Lee, Krishna P. Gummadi, Abhilasha Ravichander, Muhammad Bilal Zafar ·

    To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

    arXiv:2605.00737v1 Announce Type: new Abstract: Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM d…

  373. arXiv cs.CL TIER_1 English(EN) · Pawel Kaplanski ·

    Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates

    Recursive language-model loops often settle into recognizable attractor-like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in 30-step recursive loops by separating the model f…

  374. arXiv cs.CL TIER_1 Français(FR) · Ryan Lail, Luke Markham ·

    On Cost-Effective LLM-as-a-Judge Improvement Techniques

    arXiv:2604.13717v2 Announce Type: replace Abstract: Using a language model to score or rank candidate responses has become a scalable alternative to human evaluation in reinforcement learning from human feedback (RLHF) pipelines, benchmarking, and application layer evaluations. H…

  375. arXiv cs.CL TIER_1 English(EN) · Zongqi Wang, Tianle Gu, Chen Gong, Xin Tian, Siqi Bao, Yujiu Yang ·

    SCAN: Structured Capability Assessment and Navigation for LLMs

    arXiv:2505.06698v4 Announce Type: replace Abstract: Evaluating Large Language Models (LLMs) has become increasingly important, with automatic evaluation benchmarks gaining prominence as alternatives to human evaluation. While existing research has focused on approximating model r…

  376. arXiv cs.CL TIER_1 English(EN) · Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, Mayank Singh ·

    When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

    arXiv:2605.00817v1 Announce Type: new Abstract: Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through…

  377. arXiv cs.LG TIER_1 English(EN) · Pavlin G. Poli\v{c}ar, Andra\v{z} Pevcin, Bla\v{z} Zupan ·

    Generating Statistical Charts with Validation-Driven LLM Workflows

    arXiv:2605.00800v1 Announce Type: new Abstract: Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely pro…

  378. arXiv cs.LG TIER_1 English(EN) · Jiale Fu, Yuchu Jiang, Peijun Wu, Chonghan Liu, Joey Tianyi Zhou, Xu Yang ·

    Rethinking LLM Ensembling from the Perspective of Mixture Models

    arXiv:2605.00419v1 Announce Type: new Abstract: Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. Th…

  379. arXiv cs.CL TIER_1 English(EN) · Yoav Goldberg ·

    A Language for Describing Agentic LLM Contexts

    Large language models are increasingly used within larger systems ("LLM agents"). These make a sequence of LLM calls, each call providing the LLM with a combination of instructions, observations, and interaction history. The design of the encoded information and its structure pla…

  380. arXiv cs.CL TIER_1 English(EN) · Mohammad Mohammadi Amiri ·

    RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

    Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features are encoded in structured representations within th…

  381. arXiv cs.CL TIER_1 English(EN) · Mayank Singh ·

    When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

    Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedura…

  382. arXiv cs.LG TIER_1 English(EN) · Blaž Zupan ·

    Generating Statistical Charts with Validation-Driven LLM Workflows

    Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely provide fully aligned artifacts, such as executable…

  383. arXiv cs.AI TIER_1 English(EN) · Muhammad Bilal Zafar ·

    To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

    Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM decision: whether to call or not call a tool, whe…

  384. arXiv cs.AI TIER_1 English(EN) · Abdurrahman Javat ·

    Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

    The operational landscape of local Large Language Model (LLM) inference has shifted from lightweight models to datacenter-class weights exceeding 70B parameters, creating profound systems challenges for consumer hardware. This paper presents a systematic empirical analysis of the…

  385. arXiv cs.CL TIER_1 English(EN) · Xu Yang ·

    Rethinking LLM Ensembling from the Perspective of Mixture Models

    Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to large lan…

  386. arXiv cs.AI TIER_1 English(EN) · Jinqiu Yang ·

    Social Bias in LLM-Generated Code: Benchmark and Mitigation

    Large Language Models (LLMs) are increasingly deployed to generate code for human-centered applications where demographic fairness is critical. However, existing evaluations focus almost exclusively on functional correctness, leaving social bias in LLM-generated code largely unex…

  387. arXiv cs.AI TIER_1 English(EN) · Jon-Paul Cacioli ·

    Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

    arXiv:2604.27405v1 Announce Type: cross Abstract: We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.…

  388. arXiv cs.CL TIER_1 English(EN) · Solomon Messing ·

    Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking

    arXiv:2604.11581v4 Announce Type: replace Abstract: LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet standard confidence intervals ignore variability from prompt phrasing, model temperature, and…

  389. arXiv cs.AI TIER_1 English(EN) · Ziyao Xu, Cong Wang, Houfeng Wang ·

    Investigating More Explainable and Partition-Free Compositionality Estimation for LLMs: A Rule-Generation Perspective

    arXiv:2604.27340v1 Announce Type: new Abstract: Compositional generalization tests are often used to estimate the compositionality of LLMs. However, such tests have the following limitations: (1) they only focus on the output results without considering LLMs' understanding of sam…

  390. arXiv cs.LG TIER_1 English(EN) · Ahan Gupta, Zhihao Wang, Neel Dani, Masahiro Tanaka, Olatunji Ruwase, Minjia Zhang ·

    AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

    arXiv:2604.27089v1 Announce Type: new Abstract: Large-language-models (LLMs) demonstrate enormous utility in long-context tasks which require processing prompts that consist of tens to hundreds of thousands of tokens. However, existing LLM training libraries do not provide easy t…

  391. arXiv cs.LG TIER_1 English(EN) · Jun Yeon Won, Xin Jin, Shiqing Ma, Zhiqiang Lin ·

    REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names (Extended Version)

    arXiv:2604.27319v1 Announce Type: cross Abstract: Large Language Models (LLMs) have achieved remarkable progress in recent years, driving their adoption across a wide range of domains, including computer security. In reverse engineering, LLMs are increasingly applied to critical …

  392. arXiv cs.CL TIER_1 English(EN) · Jon-Paul Cacioli ·

    Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

    We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). O…

  393. Hugging Face Daily Papers TIER_1 English(EN) ·

    Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

    We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). O…

  394. arXiv cs.CL TIER_1 English(EN) · Wenxuan Wang, Juluan Shi, Zixuan Ling, Yuk-Kit Chan, Chaozheng Wang, Cheryl Lee, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu ·

    Learning to Ask: When LLM Agents Meet Unclear Instruction

    arXiv:2409.00557v4 Announce Type: replace Abstract: Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of thes…

  395. arXiv cs.AI TIER_1 English(EN) · Zoe Kotti, Konstantina Dritsa, Diomidis Spinellis, Panos Louridas ·

    The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion

    arXiv:2508.16131v2 Announce Type: replace-cross Abstract: Code completion entails the task of providing missing tokens given a surrounding context. It can boost developer productivity while providing a powerful code discovery tool. Following the Large Language Model (LLM) wave, c…

  396. arXiv cs.AI TIER_1 English(EN) · Emre Furkan Akyol, Mehmet Dedeler, Eray T\"uz\"un ·

    ImproBR: Bug Report Improver Using LLMs

    arXiv:2604.26142v1 Announce Type: cross Abstract: Bug tracking systems play a crucial role in software maintenance, yet developers frequently struggle with low-quality user-submitted reports that omit essential details such as Steps to Reproduce (S2R), Observed Behavior (OB), and…

  397. arXiv cs.CL TIER_1 English(EN) · Sasha Ronaghi, Chloe Stanwyck, Asad Aali, Amir Ronaghi, Miguel Fuentes, Tina Hernandez-Boussard, Emily Alsentzer ·

    Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models

    arXiv:2601.03423v3 Announce Type: replace Abstract: Adapting language models to the clinical domain through continued pretraining and instruction tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling…

  398. arXiv cs.CL TIER_1 English(EN) · Samee Arif, Naihao Deng, Zhijing Jin, Rada Mihalcea ·

    One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

    arXiv:2604.25921v1 Announce Type: new Abstract: Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD…

  399. arXiv cs.CL TIER_1 English(EN) · Hongyeon Yu, Young-Bum Kim, Yoon Kim ·

    FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients

    arXiv:2604.26258v1 Announce Type: new Abstract: LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, offer a promising path towards extending the capabilities of LLMs and building po…

  400. Hugging Face Daily Papers TIER_1 English(EN) ·

    AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

    Large-language-models (LLMs) demonstrate enormous utility in long-context tasks which require processing prompts that consist of tens to hundreds of thousands of tokens. However, existing LLM training libraries do not provide easy to use abstractions to optimize for long-context …

  401. arXiv cs.CL TIER_1 English(EN) · Avinash Amballa, Yashas Malur Saidutta, Chi-Heng Lin, Vivek Kulkarni, Srinivas Chappidi ·

    VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs

    arXiv:2512.12072v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly being used to generate synthetic datasets for the evaluation and training of downstream models. However, prior work has noted that such generated data lacks diversity. In this paper,…

  402. arXiv cs.CL TIER_1 English(EN) · Ocean Monjur, Shahriar Kabir Nahin, Anshuman Chhabra ·

    Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

    arXiv:2604.25098v1 Announce Type: cross Abstract: While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning method…

  403. arXiv cs.CL TIER_1 English(EN) · Huyen Nguyen, Haoxuan Zhang, Yang Zhang, Junhua Ding, Haihua Chen ·

    LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

    arXiv:2604.25665v1 Announce Type: new Abstract: Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarizatio…

  404. arXiv cs.CL TIER_1 English(EN) · Alif Munim, Jun Ma, Omar Ibrahim, Alhusain Abdalla, Shuolin Yin, Leo Chen, Bo Wang ·

    Benchmarking and Adapting On-Device LLMs for Clinical Decision Support

    arXiv:2601.03266v2 Announce Type: replace Abstract: Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow…

  405. arXiv cs.LG TIER_1 English(EN) · Keita Broadwater ·

    Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

    arXiv:2602.11786v2 Announce Type: replace Abstract: Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety through breadth-oriented evaluation across diverse tasks and risk categories. However, real-world deployment often expo…

  406. arXiv cs.CL TIER_1 English(EN) · Yoon Kim ·

    FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients

    LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, offer a promising path towards extending the capabilities of LLMs and building powerful systems that can tackle diverse tasks. Ho…

  407. arXiv cs.CL TIER_1 English(EN) · Haihua Chen ·

    LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

    Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven …

  408. Hugging Face Daily Papers TIER_1 English(EN) ·

    LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

    Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven …

  409. arXiv cs.LG TIER_1 English(EN) · Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang ·

    TRINITY: An Evolved LLM Coordinator

    arXiv:2512.04695v3 Announce Type: replace Abstract: Combining diverse foundation models is promising, but weight-merging is limited by mismatched architectures and closed APIs. Trinity addresses this with a lightweight coordinator that orchestrates collaboration among large langu…

  410. arXiv cs.LG TIER_1 English(EN) · Zhengding Hu, Hehua Ouyang, Chang Chen, Zaifeng Pan, Yue Guan, Zhongkai Yu, Zhen Wang, Steven Swanson, Yufei Ding ·

    JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

    arXiv:2604.23838v1 Announce Type: new Abstract: We present JigsawRL, a cost-efficient framework that explores Pipeline Multiplexing as a new dimension of RL parallelism. JigsawRL decomposes each pipeline into a Sub-Stage Graph that exposes the intra-stage and inter-worker imbalan…

  411. arXiv cs.LG TIER_1 English(EN) · Xuancheng Li, Haitao Li, Yujia Zhou, Yiqun Liu, Qingyao Ai ·

    Beyond Experience Retrieval: Learning to Generate Utility-Optimized Structured Experience for Frozen LLMs

    arXiv:2602.02556v2 Announce Type: replace Abstract: Large language models (LLMs) are largely static and often redo reasoning or repeat mistakes. Prior experience reuse typically relies on external retrieval, which is similarity-based, can introduce noise, and adds latency. We int…

  412. arXiv cs.LG TIER_1 English(EN) · Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma ·

    Continual Calibration: Coverage Can Collapse Before Accuracy in Lifelong LLM Fine-Tuning

    arXiv:2604.23987v1 Announce Type: new Abstract: Continual learning for large language models is typically evaluated through accuracy retention under sequential fine-tuning. We argue that this perspective is incomplete, because uncertainty reliability can degrade earlier and more …

  413. arXiv cs.CL TIER_1 English(EN) · Rohith Reddy Bellibatlu ·

    JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

    arXiv:2604.23478v1 Announce Type: new Abstract: Large language models are increasingly deployed as automated judges for evaluating other models, yet the stability of their verdicts under semantically equivalent prompt paraphrases remains unmeasured. We introduce JudgeSense, a fra…

  414. arXiv cs.CL TIER_1 English(EN) · Alessio Sordo, Lingxiao Du, Meeka-Hanna Lenisa, Evgeny Bogdanov, Maxim Romanovsky ·

    STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

    arXiv:2604.24544v1 Announce Type: cross Abstract: The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due t…

  415. arXiv cs.CL TIER_1 English(EN) · Chenyang Yang, Yike Shi, Qianou Ma, Michael Xieyang Liu, Christian K\"astner, Tongshuang Wu ·

    What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts

    arXiv:2505.13360v3 Announce Type: replace Abstract: Prompt underspecification is a common challenge when interacting with LLMs. In this paper, we present an in-depth analysis of this problem, showing that while LLMs can often infer unspecified requirements by default (41.1%), suc…

  416. arXiv cs.CL TIER_1 English(EN) · Zhiqiu Xu, Shibo Jin, Shreya Arya, Mayur Naik ·

    MathDuels: Evaluating LLMs as Problem Posers and Solvers

    arXiv:2604.21916v2 Announce Type: replace Abstract: As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers …

  417. arXiv cs.AI TIER_1 English(EN) · Huzaifa Arif, Keerthiram Murugesan, Ching-Yun Ko, Pin-Yu Chen, Payel Das, Alex Gittens ·

    Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models

    arXiv:2511.08484v2 Announce Type: replace Abstract: We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infre…

  418. arXiv cs.LG TIER_1 English(EN) · Juyeon Yoon, Somin Kim, Robert Feldt, Shin Yoo ·

    Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs

    arXiv:2509.17314v3 Announce Type: replace-cross Abstract: Software increasingly relies on the emergent capabilities of Large Language Models (LLMs), from natural language understanding to program analysis and generation. Yet testing them on specific tasks remains difficult and co…

  419. arXiv cs.CL TIER_1 English(EN) · Yue Liu, Yingwei Ma, Yibo Miao, Yanhao Li, Yuchong Xie, Xinlong Yang, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi ·

    KLong: Training LLM Agent for Extremely Long-horizon Tasks

    arXiv:2602.17547v3 Announce Type: replace-cross Abstract: This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. S…

  420. arXiv cs.LG TIER_1 English(EN) · Frank Xiao, Santiago Aranguri ·

    Probe-Based Data Attribution: Discovering and Mitigating Undesirable Behaviors in LLM Post-Training

    arXiv:2602.11079v3 Announce Type: replace Abstract: We propose probe-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for both test prompts and preference…

  421. arXiv cs.CL TIER_1 English(EN) · Anshuman Chhabra ·

    Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

    While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning methods that can reduce model size without sacrificing p…

  422. arXiv cs.CL TIER_1 English(EN) · Maxim Romanovsky ·

    STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

    The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and t…

  423. Hugging Face Daily Papers TIER_1 English(EN) ·

    STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

    The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and t…

  424. arXiv cs.CL TIER_1 English(EN) · Sourav Saha, Mandar Mitra, Aditya Dutta ·

    LLMs as Assessors: Right for the Right Reason?

    arXiv:2601.08919v2 Announce Type: replace-cross Abstract: A good deal of recent research has focused on how Large Language Models (LLMs) may be used as judges in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this…

  425. arXiv cs.AI TIER_1 English(EN) · Manuel Alejandro Borroto Santana, Erica Coppolillo, Francesco Calimeri, Giuseppe Manco, Simona Perri, Francesco Ricca ·

    BLAST: Benchmarking LLMs with ASP-based Structured Testing

    arXiv:2604.22306v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has …

  426. arXiv cs.LG TIER_1 English(EN) · Zhaokun Wang, Jinyu Guo, Jingwen Pu, Hongli Pu, Meng Yang, Xunlei Chen, Jie Ou, Wenyi Li, Guangchun Luo, Wenhong Tian ·

    CAP: Controllable Alignment Prompting for Unlearning in LLMs

    arXiv:2604.21251v2 Announce Type: replace Abstract: Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-m…

  427. arXiv cs.LG TIER_1 English(EN) · Emil Ryd, Henning Bartsch, Julian Stastny, Joe Benton, Vivek Hebbar ·

    Removing Sandbagging in LLMs by Training with Weak Supervision

    arXiv:2604.22082v1 Announce Type: new Abstract: As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap thr…

  428. arXiv cs.AI TIER_1 English(EN) · Francesco Ricca ·

    BLAST: Benchmarking LLMs with ASP-based Structured Testing

    Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling decla…

  429. arXiv cs.AI TIER_1 English(EN) · Vivek Hebbar ·

    Removing Sandbagging in LLMs by Training with Weak Supervision

    As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap through sandbagging, producing work that appears ac…

  430. Hugging Face Daily Papers TIER_1 English(EN) ·

    MathDuels: Evaluating LLMs as Problem Posers and Solvers

    As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a sel…

  431. arXiv cs.CL TIER_1 English(EN) · Mayur Naik ·

    MathDuels: Evaluating LLMs as Problem Posers and Solvers

    As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a sel…

  432. arXiv cs.LG TIER_1 English(EN) · Wenhong Tian ·

    CAP: Controllable Alignment Prompting for Unlearning in LLMs

    Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high …

  433. Hugging Face Daily Papers TIER_1 English(EN) ·

    HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

    Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately asses…

  434. Ahead of AI (Sebastian Raschka) TIER_1 English(EN) · Sebastian Raschka, PhD ·

    Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

    Multiple-Choice Benchmarks, Verifiers, Leaderboards, and LLM Judges with Code Examples

  435. Ahead of AI (Sebastian Raschka) TIER_1 English(EN) · Sebastian Raschka, PhD ·

    Coding LLMs from the Ground Up: A Complete Course

    Why build LLMs from scratch? It's probably the best and most efficient way to learn how LLMs really work. Plus, many readers have told me they had a lot of fun doing it.

  436. arXiv stat.ML TIER_1 English(EN) · Chi-Kuang Yeh ·

    Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning

    Large Language Models (LLMs) are increasingly used as judges for scalable evaluation, yet such LLM--as--a--Judge systems exhibit systematic biases that are decoupled from semantic quality, most notably verbosity bias. Meanwhile, human supervision is costly and typically selective…

  437. LessWrong (AI tag) TIER_1 English(EN) · gwern ·

    Guardian Angels: LLM Personalization for Productivity and Security

    <p>Powerful LLMs will be deployed at global scale in the next few years, and will dominate the Internet, and increasingly, ordinary life. As of mid-2026, there is no coherent vision for how knowledge professionals, or ordinary people, will be able to harness these LLMs for large …

  438. arXiv stat.ML TIER_1 English(EN) · Constantinos Antoniou ·

    LLMs on Tabular Data with Limited Semantics: Evidence from Industrial Car Retrofit Prediction

    Industrial retrofit planning depends on structured operational data rather than free text: planners must estimate whether a newly registered prototype will require a retrofit, which retrofit package it will need, and how long the work will take. We study an industrial dataset lin…

  439. arXiv stat.ML TIER_1 English(EN) · Alexandre Belloni, Yan Chen, Yehua Wei ·

    Online Pandora's Box for Contextual LLM Cascading

    arXiv:2606.07392v1 Announce Type: cross Abstract: Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-pha…

  440. LessWrong (AI tag) TIER_1 English(EN) · Matthew Khoriaty ·

    Taking the Training Wheels Off: Aligning LLMs without Personas

    <p><br /></p><img alt="" src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/3wqZwXMzEAkd3mLLM/5b18c66a57f187f33ac8a438209481ce38e836a7fdc1cb081161fad23496bc70/ofzs7dwn131h6fep69y4" /><p><span>If you told an AI Alignment researcher in 2018 a…

  441. LessWrong (AI tag) TIER_1 English(EN) · Vidya Ganga ·

    Can LLMs even teach? Exploring the Teacher Axis

    <h2><span>TLDR</span></h2><p><span>As a passionate teacher, it has pained my heart to watch my students lose deeper critical thinking skills and independent reasoning. But attempting to build a constitutionally constrained AI using prompt engineering that acted more Socratically …

  442. LessWrong (AI tag) TIER_1 English(EN) · Owain_Evans ·

    Out-of-Context Reasoning (OOCR) in LLMs: A Short Primer and Reading List

    <p>Out-of-context reasoning (OOCR) is a concept relevant to LLM generalization and AI alignment. Also available as a <a href="https://owainevans.github.io/pdfs/oocr_primer_latex.pdf">PDF</a>.</p> <p><strong>Contents</strong></p> <ol> <li><a href="#what-is-out-of-context-reasoning…

  443. arXiv stat.ML TIER_1 English(EN) · Biswa Sengupta ·

    Representation Without Reward: A JEPA Audit for LLM Fine-Tuning

    arXiv:2605.15394v1 Announce Type: cross Abstract: Joint-embedding predictive architectures (JEPAs) propose that a model should learn more useful abstractions when trained to predict latent representations rather than observed outputs. For autoregressive language-model fine-tuning…

  444. arXiv stat.ML TIER_1 English(EN) · Biswa Sengupta ·

    Representation Without Reward: A JEPA Audit for LLM Fine-Tuning

    Joint-embedding predictive architectures (JEPAs) propose that a model should learn more useful abstractions when trained to predict latent representations rather than observed outputs. For autoregressive language-model fine-tuning the principle entails a stricter requirement: the…

  445. arXiv stat.ML TIER_1 English(EN) · Stef van Buuren ·

    LLMs as Implicit Imputers: Uncertainty Should Scale with Missing Information

    arXiv:2605.13188v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in settings where the available context is incomplete or degraded. We argue that an LLM generating answers under incomplete context can be viewed as an implicit imputer, and eva…

  446. arXiv stat.ML TIER_1 English(EN) · Zetai Cen, Chenfei Gu, Jin Zhu, Ting Li, Yunxiao Chen, Chengchun Shi ·

    Learning Perturbations to Extrapolate Your LLM

    arXiv:2605.13284v1 Announce Type: new Abstract: Recent advancements in large language models demonstrate that injecting perturbations can substantially enhance extrapolation performance. However, current approaches often rely on discrete perturbations with fixed designs, which li…

  447. LessWrong (AI tag) TIER_1 English(EN) · Santiago Aranguri ·

    Predicting Rare LLM Failures with 30× Fewer Rollouts

    <p><span>TL;DR: We estimate how often Qwen 3 4B exhibits rare harmful behaviors with 30× fewer rollouts than naive sampling, using a new method that interpolates between the model and a less-safe variant in logit space.</span></p><p><span>Authors: Francisco Pernice (MIT), Santiag…

  448. arXiv stat.ML TIER_1 English(EN) · Chengchun Shi ·

    Learning Perturbations to Extrapolate Your LLM

    Recent advancements in large language models demonstrate that injecting perturbations can substantially enhance extrapolation performance. However, current approaches often rely on discrete perturbations with fixed designs, which limits their flexibility. In this work, we propose…

  449. arXiv stat.ML TIER_1 English(EN) · Stef van Buuren ·

    LLMs as Implicit Imputers: Uncertainty Should Scale with Missing Information

    Large language models (LLMs) are increasingly deployed in settings where the available context is incomplete or degraded. We argue that an LLM generating answers under incomplete context can be viewed as an implicit imputer, and evaluated against a criterion from the multiple imp…

  450. arXiv stat.ML TIER_1 English(EN) · Nicolas Menet, Andreas Krause, Abbas Rahimi ·

    POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

    arXiv:2605.07775v1 Announce Type: cross Abstract: Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel …

  451. arXiv stat.ML TIER_1 English(EN) · James Fiedler ·

    Bias and Uncertainty in LLM-as-a-Judge Estimation

    arXiv:2605.06939v1 Announce Type: cross Abstract: LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed es…

  452. arXiv stat.ML TIER_1 English(EN) · James Fiedler ·

    Bias and Uncertainty in LLM-as-a-Judge Estimation

    LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed estimators to correct this bias, but their reliabili…

  453. LessWrong (AI tag) TIER_1 English(EN) · NickyP ·

    Axes of Planning in LLMs + Partial Lit Review

    <p><i><span>Epistemic Status: Written over the course of a couple days at </span></i><a href="https://inkhaven.blog/" rel="noreferrer"><i><span>Inkhaven</span></i></a><i><span>. Some of the info is old so some newer papers are excluded.</span></i></p><p><i><span>TL;DR: People tal…

  454. arXiv stat.ML TIER_1 English(EN) · Vaneet Aggarwal ·

    Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

    Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level…

  455. arXiv stat.ML TIER_1 English(EN) · John Sous ·

    Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

    Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded Na…

  456. arXiv cs.CV TIER_1 English(EN) · Wei Liu, Hongkai Liu, Zhiying Deng, Yee Whye Teh, Wee Sun Lee ·

    From Backward Spreading to Forward Replay: Revisiting Target Construction in LLM Parameter Editing

    arXiv:2605.00358v1 Announce Type: cross Abstract: LLM parameter editing methods commonly rely on computing an ideal target hidden-state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreadi…

  457. arXiv cs.CV TIER_1 English(EN) · Wee Sun Lee ·

    From Backward Spreading to Forward Replay: Revisiting Target Construction in LLM Parameter Editing

    LLM parameter editing methods commonly rely on computing an ideal target hidden-state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreading) for cooperative editing. Although widely used …

  458. LessWrong (AI tag) TIER_1 English(EN) · Santiago Aranguri ·

    Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training

    <h1><b><span>Introduction</span></b></h1><p><i><span>Research by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire).</span></i></p><p><span>Post-training can introduce undesired side effects that are difficult to detect and even harder to trace to specific training datapoi…

  459. LessWrong (AI tag) TIER_1 English(EN) · keshavs ·

    Introspection Adapters: Training LLMs to Report Their Learned Behaviors

    <p><i><span>Authors: Keshav Shenoy,</span></i><span> </span><i><span>Li Yang, Abhay Sheshadri, Soren Mindermann, Jack Lindsey, Sam Marks, and Rowan Wang</span></i></p><p><span>📄</span><a href="https://arxiv.org/pdf/2604.16812"><span>Paper</span></a><span>, 💻 </span><a href="https…

  460. arXiv cs.CV TIER_1 English(EN) · Mengyu Wang, Xiaoying Zhi, Zhiyi Li, Robin Schmucker, Shay B. Cohen, Tiejun Ma, Fran Silavong ·

    Self Knowledge Re-expression: A Fully Local Method for Adapting LLMs to Tasks Using Intrinsic Knowledge

    arXiv:2604.22939v1 Announce Type: cross Abstract: While the next-token prediction (NTP) paradigm enables large language models (LLMs) to express their intrinsic knowledge, its sequential nature constrains performance on specialized, non-generative tasks. We attribute this perform…

  461. Smol AINews TIER_1 English(EN) ·

    Thinking Machines' Tinker: LoRA based LLM fine-tuning API

    **Thinking Machines** recently raised **$2 billion** without shipping a product until now, launching their first product **Tinker**, a managed service API for fine-tuning large and mixture-of-experts models like **Qwen-235B-A22B** using **LoRA** for cost-efficient training. The T…

  462. Eugene Yan TIER_1 English(EN) ·

    AI Engineer 2025 - Improving RecSys & Search with LLM techniques

    Recsys & search are converging with LLMs via semantic IDs, data augmentation, and unified foundation models.

  463. Smol AINews TIER_1 Norsk(NO) ·

    Meta BLT: Tokenizer-free, Byte-level LLM

    **Meta AI** introduces the **Byte Latent Transformer (BLT)**, a tokenizer-free architecture that dynamically forms byte patches for efficient compute allocation, outperforming **Llama 3** on benchmarks including the CUTE benchmark. The model was trained on approximately **1 trill…

  464. Eugene Yan TIER_1 English(EN) ·

    Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)

    Use cases, techniques, alignment, finetuning, and critiques against LLM-evaluators.

  465. Eugene Yan TIER_1 English(EN) ·

    Task-Specific LLM Evals that Do & Don't Work

    Evals for classification, summarization, translation, copyright regurgitation, and toxicity.

  466. Chip Huyen TIER_1 English(EN) ·

    Open challenges in LLM research

    <p>[<em><a href="https://www.linkedin.com/posts/chiphuyen_llm-airesearch-generativeai-activity-7097619722363408385-s5Cp">LinkedIn discussion</a>, <a href="https://twitter.com/chipro/status/1691858084824838427">Twitter thread</a></em>]</p> <p>Never before in my life had I seen so …

  467. Eugene Yan TIER_1 English(EN) ·

    Patterns for Building LLM-based Systems & Products

    Evals, RAG, fine-tuning, caching, guardrails, defensive UX, and collecting user feedback.

  468. Eugene Yan TIER_1 English(EN) ·

    Experimenting with LLMs to Research, Reflect, and Plan

    Also, shortcomings in document retrieval and how to overcome them with search & recsys techniques.

  469. Databricks Blog TIER_1 English(EN) ·

    LLM Vs AI: A Practical Guide to Differences, Use Cases, and Tools

    This guide explains the key differences between large language models and the broader...

  470. AWS Machine Learning Blog TIER_1 English(EN) · Hemanth Kumar Jayakumar ·

    Reinforcement fine-tuning with LLM-as-a-judge

    In this post, we take a deeper look at how RLAIF or RL with LLM-as-a-judge works with Amazon Nova models effectively.

  471. Together AI blog TIER_1 English(EN) ·

    Fine-tuning open LLM judges to outperform GPT-5.2

    Fine-tuned open-source LLM judges can outperform GPT-5.2 at evaluating model outputs. Using Direct Preference Optimization on just 5,400 preference pairs, we trained GPT-OSS 120B to beat GPT-5.2 on human preference alignment—at 15x lower cost and 14x faster inference speeds.

  472. Hamel Husain TIER_1 English(EN) · Shreya Shankar ·

    LLM Evals: Everything You Need to Know

    <!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p>This document curates the most common questions Shreya and I received while <a href="https://bit.ly/evals-ai" target="…

  473. Together AI blog TIER_1 English(EN) ·

    Fine-Tuning Small Open-Source LLMs to Outperform Large Closed-Source Models by 60% on Specialized Tasks

    Parsed fine-tuned a 27B open-source model to beat Claude Sonnet 4 by 60% on a real-world healthcare task—while running 10–100x cheaper.

  474. Together AI blog TIER_1 English(EN) ·

    Continued Fine-tuning of LLMs: A Technical Deep Dive

    Together AI's continued fine-tuning lets you build on previously trained models using checkpoints. A deep dive into when and how to use iterative fine-tuning for LLMs.

  475. Hamel Husain TIER_1 English(EN) · Hamel Husain ·

    Using LLM-as-a-Judge For Evaluation: A Complete Guide

    <!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p>Earlier this year, I wrote <a href="https://hamel.dev/blog/posts/evals/">Your AI product needs evals</a>. Many of you …

  476. Hamel Husain TIER_1 English(EN) · Hamel Husain ·

    An Open Course on LLMs, Led by Practitioners

    <!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p>Today, we are releasing <a href="https://parlance-labs.com/education/">Mastering LLMs</a>, a set of workshops and talk…

  477. Hacker News — AI stories ≥50 points TIER_1 English(EN) · khurdula ·

    Show HN: A new benchmark for testing LLMs for deterministic outputs

  478. HN — claude-code stories TIER_1 English(EN) · mufeedvh ·

    N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

  479. HN — AI infrastructure stories TIER_1 English(EN) · cgorlla ·

    Launch HN: Mentat (YC F24) – Controlling LLMs with Runtime Intervention

  480. HN — AI infrastructure stories TIER_1 English(EN) · diptanu ·

    Show HN: Open-source real time data framework for LLM applications

  481. Practical AI TIER_1 English(EN) · Practical AI LLC ·

    Collaboration & evaluation for LLM apps

    <p>Small changes in prompts can create large changes in the output behavior of generative AI models. Add to that the confusion around proper evaluation of LLM applications, and you have a recipe for confusion and frustration. Raza and the Humanloop team have been diving into thes…

  482. dev.to — MCP tag TIER_1 English(EN) · Intellibooks AI ·

    Intellibooks Guide to Retrieval-Augmented Generation (RAG): How Modern AI Finds the Right Answers

    <p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3rn1xki7jkixujgbgn04.jpg"><img alt=" " height="1200"…

  483. Towards AI TIER_1 English(EN) · Rizwanhoda ·

    Streaming Responses from LLMs: SSE, Chunking, and the UX Tricks Nobody Explains

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/streaming-responses-from-llms-sse-chunking-and-the-ux-tricks-nobody-explains-4fe2f3a077b8?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1536/1*eD_TT1PIfVm…

  484. dev.to — MCP tag TIER_1 English(EN) · Wuic Framework ·

    The chatbot's toolbox: the actions an LLM can apply to your app

    <p>A chatbot that only answers questions is a search box with manners. Ours does more: it can <strong>propose concrete changes</strong> to the app you're looking at. You describe what you want in plain language, the model picks the right tool, and you get a proposal — a chip with…

  485. Medium — fine-tuning tag TIER_1 English(EN) · DhanushKumar ·

    RAFT: Teaching LLMs to Read, Not Just Retrieve

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@danushidk507/raft-teaching-llms-to-read-not-just-retrieve-6b5575df822c?source=rss------fine_tuning-5"><img src="https://cdn-images-1.medium.com/max/620/1*rOxISo_oSWFWNhCv59AGlg.png" width="620…

  486. Medium — fine-tuning tag TIER_1 English(EN) · Sanat Vibhor ·

    LoRA and QLoRA: Fine-Tuning LLMs Without Selling a Kidney

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@sanatvibhor2/lora-and-qlora-fine-tuning-llms-without-selling-a-kidney-1b4400b398c1?source=rss------fine_tuning-5"><img src="https://cdn-images-1.medium.com/max/1125/1*CtkXsowQFSd_hWoa-8ycpw.pn…

  487. Medium — Claude tag TIER_1 English(EN) · Mayank Mewar ·

    Achieving Infinite Context in Resource-Constrained LLM Environments: A Journey from Simple Trimming…

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://mayank17-mewar.medium.com/achieving-infinite-context-in-resource-constrained-llm-environments-a-journey-from-simple-trimming-6e57152ecac2?source=rss------claude-5"><img src="https://cdn-images-1.medium.co…

  488. Medium — fine-tuning tag TIER_1 English(EN) · Rizwanhoda ·

    Fine-Tuning LLMs in 2026: LoRA, QLoRA, Unsloth, and Everything In Between

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/fine-tuning-llms-in-2026-lora-qlora-unsloth-and-everything-in-between-929eaf94aea2?source=rss------fine_tuning-5"><img src="https://cdn-images-1.medium.com/max/1672/1*wQkJPdE_EnSRMl2UEGh…

  489. Medium — fine-tuning tag TIER_1 English(EN) · harsha ·

    Fine-Tuning vs RAG: Choosing the Right Approach to Supercharge Your LLM

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@harshadevu2232/fine-tuning-vs-rag-choosing-the-right-approach-to-supercharge-your-llm-4a4c525638d5?source=rss------fine_tuning-5"><img src="https://cdn-images-1.medium.com/max/1536/1*t8NxAKrw-…

  490. Towards AI TIER_1 English(EN) · Deepanshu Gupta ·

    Fine-tuning vs RAG vs MeMo: Where should LLM Knowledge Live?

    <h4>Why updating LLM knowledge is becoming a systems architecture problem</h4><p>LLM knowledge does not fail all at once. It goes stale quietly.</p><p>A policy changes or a product documentation is updated. A customer contract is amended, or a regulation is revised. The model sti…

  491. Medium — AI coding tag TIER_1 English(EN) · DEVS not NULL ·

    The LLM Coding Junior Assistant: A Proof of Concept in One of Too Many Prompts

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://devs-not-null.medium.com/the-llm-coding-junior-assistant-a-proof-of-concept-in-one-of-too-many-prompts-f38f0259a81a?source=rss------ai_coding-5"><img src="https://cdn-images-1.medium.com/max/1536/1*pnn9AP…

  492. Medium — MLOps tag TIER_1 English(EN) · Rezky Aulia Pratama ·

    Prompt Engineering: The Craft Behind Getting LLMs to Actually Do What You Want

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@rezkyauliapratama/prompt-engineering-the-craft-behind-getting-llms-to-actually-do-what-you-want-0abee8c47e19?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/1672/0*HM54e…

  493. Medium — fine-tuning tag TIER_1 English(EN) · Mohammed Anzer M ·

    Fine-Tuning LLMs: What Actually Happens Under the Hood

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@mhd.anzer.m/fine-tuning-llms-what-actually-happens-under-the-hood-84e00b4122b7?source=rss------fine_tuning-5"><img src="https://cdn-images-1.medium.com/max/1536/1*wBBRslqVVsPps6fyj7G2SA.png" w…

  494. Medium — fine-tuning tag TIER_1 English(EN) · Drawbytheroots ·

    The Silent Killer of LLM Fine-Tuning: Why Your Masking is Broken (And How to Fix It)

    <div class="medium-feed-item"><p class="medium-feed-snippet">Many training disasters trace back to dataset formatting problems. A misconfigured masking setup causes the model to train on both the&#x2026;</p><p class="medium-feed-link"><a href="https://medium.com/@drawbytheroots/t…

  495. Medium — Claude tag TIER_1 English(EN) · Sweta ·

    Frontier LLMs: Strengths, Limitations, and Real-World Examples

    <div class="medium-feed-item"><p class="medium-feed-snippet">What are Frontier LLMs?</p><p class="medium-feed-link"><a href="https://sweta-nit.medium.com/frontier-llms-strengths-limitations-and-real-world-examples-d6366516f91c?source=rss------claude-5">Continue reading on Medium …

  496. Lobsters — AI tag TIER_1 English(EN) · aeracode.org via carlana ·

    Constraining LLMs Just Like Users

    <p><a href="https://lobste.rs/s/zom23n/constraining_llms_just_like_users">Comments</a></p>

  497. Medium — fine-tuning tag TIER_1 English(EN) · Hasratmd ·

    What I Learned About Fine-Tuning LLMs: It’s Mostly a Data Problem

    <div class="medium-feed-item"><p class="medium-feed-snippet">I just finished a chapter on Supervised Fine-Tuning (SFT), and the biggest surprise wasn&#x2019;t learning about LoRA, QLoRA, learning rates, or&#x2026;</p><p class="medium-feed-link"><a href="https://medium.com/@hasrat…

  498. Medium — fine-tuning tag TIER_1 English(EN) · Jinali Shah ·

    How Learning About LLMs Changed My Perspective on AI

    <div class="medium-feed-item"><p class="medium-feed-snippet">When I first started learning about Natural Language Processing (NLP), I assumed that every language-related problem needed its own&#x2026;</p><p class="medium-feed-link"><a href="https://medium.com/@jinalishah99/how-le…

  499. Medium — fine-tuning tag TIER_1 English(EN) · Divith Raju ·

    Fine-Tuning LLMs: The Expensive Lesson I Learned Before Reading the Docs

    <div class="medium-feed-item"><p class="medium-feed-snippet">We spent three weeks and significant GPU budget fine-tuning a model. The result was worse than the base model with a better prompt. Here&#x2019;s&#x2026;</p><p class="medium-feed-link"><a href="https://divithraju.medium…

  500. Medium — fine-tuning tag TIER_1 English(EN) · Aayushi Patel ·

    Fine-Tuning of LLM

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@aayushipatel135/fine-tuning-of-llm-e3af256c3f3d?source=rss------fine_tuning-5"><img src="https://cdn-images-1.medium.com/max/1336/1*Z56GX6zsRz2jW0JLe9tp4w.png" width="1336" /></a></p><p class=…

  501. Medium — fine-tuning tag TIER_1 English(EN) · Utkarsh Gupta ·

    Prompt Engineering vs. RAG vs. Fine-Tuning: Choosing Your LLM Strategy

    <div class="medium-feed-item"><p class="medium-feed-snippet">People often use these three terms interchangeably, but they represent entirely different engineering paradigms. If you are building&#x2026;</p><p class="medium-feed-link"><a href="https://medium.com/@utk369gupta/prompt…

  502. Medium — Claude tag TIER_1 English(EN) · Arunbalaji_M ·

    MemBridge:How I switched between LLMs without losing the context.

    <div class="medium-feed-item"><p class="medium-feed-snippet">Large language models have become powerful tools for engineering, research, planning, and creative work. They help us reason faster&#x2026;</p><p class="medium-feed-link"><a href="https://medium.com/@arunbalajimunisubra…

  503. Towards AI TIER_1 English(EN) · Anna Jey ·

    LLM Eval Workflow: How to Build Reliable AI Quality Gates Without Vibes

    <figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*S9ZfPJ11FXU7qaKGhNRgBA.jpeg" /><figcaption>LLM Eval Workflow</figcaption></figure><p>The practical playbook for developers who need to know whether an AI feature is actually getting better before they ship it.</p…

  504. Medium — fine-tuning tag TIER_1 English(EN) · Chinmay Bhalerao ·

    A Practical Framework for Enhancing LLMs: Notes from a Stanford CS Lecture

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@BH_Chinmay/a-practical-framework-for-enhancing-llms-notes-from-a-stanford-cs-lecture-b049f9b8194b?source=rss------fine_tuning-5"><img src="https://cdn-images-1.medium.com/max/1797/1*Z7qW-n1ObI…

  505. Medium — MLOps tag TIER_1 English(EN) · Siddhartha Pramanik ·

    Building a Prompt Regression Suite for Our Customer-Facing LLM App

    <div class="medium-feed-item"><p class="medium-feed-link"><a href="https://pub.aimind.so/building-a-prompt-regression-suite-for-our-customer-facing-llm-app-22f0b27b7301?source=rss------mlops-5">Continue reading on AI Mind »</a></p></div>

  506. Towards AI TIER_1 English(EN) · Akshit Kothari ·

    Decoding LLMs — Part 2: A Step-by-Step Journey Into the Mind of Modern AIe

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/decoding-llms-part-2-a-step-by-step-journey-into-the-mind-of-modern-aie-882e9f39e371?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1656/1*jOp9pKrjWuAYXGvT…

  507. Medium — fine-tuning tag TIER_1 Bahasa(ID) · dita feby indriani ·

    Getting to Know LoRA, QLoRA, and PEFT in LLM Fine-Tuning

    <div class="medium-feed-item"><p class="medium-feed-snippet">Perkembangan Large Language Models (LLM) seperti GPT, LLaMA, dan Mistral membuka banyak peluang dalam pengembangan aplikasi berbasis&#x2026;</p><p class="medium-feed-link"><a href="https://medium.com/@ditafebyindriani14…

  508. dev.to — MCP tag TIER_1 English(EN) · Mukunda Rao Katta ·

    Six Reliability Primitives for LLM Agents

    <p>Reliability concerns for LLM agents are typically bundled into one heavy framework that asks you to adopt prompting, tool routing, and runtime governance as a single dependency. Production teams want them à la carte. They want small primitives they can drop in around existing …

  509. Towards AI TIER_1 English(EN) · Ishwar Ambare ·

    HuggingFace Pipeline & Open-Source LLMs

    <h4>Part 3</h4><h4>GenAI Practical Session — Detailed Notes</h4><blockquote><em>Source: Lecture Transcript + </em><a href="https://huggingface.co/docs/transformers/pipeline_tutorial"><em>HuggingFace Pipeline Docs</em></a><em> + </em><a href="https://huggingface.co/models"><em>Hug…

  510. dev.to — MCP tag TIER_1 English(EN) · Tony Loehr ·

    The 55.6% problem: why frontier LLMs fail at embedded code

    <p><strong>55.6%.</strong></p> <p>That's DeepSeek-R1's pass@1 on EmbedBench when it gets a circuit schematic alongside the task description. 50.0% without the schematic. Best score from the best reasoning model on the first comprehensive benchmark for LLMs in embedded systems dev…

  511. Lobsters — AI tag TIER_1 English(EN) · pipevals.com by gesposito ·

    Pipevals: Evaluation pipelines for every LLM application

    <p><a href="https://lobste.rs/s/iexiw9/pipevals_evaluation_pipelines_for_every">Comments</a></p>

  512. HN — AI startup stories TIER_1 English(EN) · felix089 ·

    Show HN: FinetuneDB – AI fine-tuning platform to create custom LLMs

  513. dev.to — LLM tag TIER_1 English(EN) · guardlabs_team ·

    Docling: Turn Your Documents Into AI-Ready Data (Locally, Tables Intact)

    <h1> Docling: Turn Your Documents Into AI-Ready Data — Locally, With Tables Intact </h1> <p>Most RAG and AI-agent projects fail at the boring first step: getting clean text out of real-world documents. PDFs with multi-column layouts, scanned contracts, Excel exports with merged c…

  514. dev.to — LLM tag TIER_1 English(EN) · Scott McMahan ·

    Writing for Domain-Specific LLMs: Why Documentation Is Becoming an AI Asset

    <p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fuptpzho0x1vj32xls65l.jpg"><img alt=" " height="800" …

  515. dev.to — LLM tag TIER_1 English(EN) · 7090 yue ·

    Building a Free AI PDF Assistant: How I Solved Parsing Issues and Minimized LLM Costs

    <p>As a developer, my desk is constantly cluttered with documentation, API references, and whitepapers. A few months ago, I got tired of spending hours reading 50-page PDF specifications just to find a single configuration line.</p> <p>I decided to scratch my own itch and build a…

  516. dev.to — LLM tag TIER_1 English(EN) · Devanshu Biswas ·

    Tokenization: How LLMs Actually Read Your Text

    <p>LLMs don't see letters or even words — they see <strong>tokens</strong>: chunks of text mapped to integer IDs. Once you get tokenization, a dozen confusing things about LLMs suddenly make sense (cost, context limits, why "strawberry" trips them up).</p> <p>🔤 <strong>Type and w…

  517. dev.to — LLM tag TIER_1 English(EN) · Sam Chen ·

    Mastering the Art of LLM Prompting: A Developer's Guide to Getting Better Answers from AI

    <p><em>Learn practical techniques that will transform your AI interactions from mediocre to exceptional</em></p> <h2> Introduction </h2> <p>We've all been there. You ask an AI a question, and the response is... underwhelming. Generic. Not quite what you needed. The problem isn't …

  518. dev.to — LLM tag TIER_1 English(EN) · Shubham Gupta ·

    Understanding Retrieval-Augmented Generation (RAG): The AI Architecture That Makes LLMs Smarter

    <h2> Introduction </h2> <p>Large Language Models (LLMs) like ChatGPT have transformed how we interact with AI. They can write code, answer questions, summarize documents, and generate creative content. However, they have one major limitation - they only know what they were traine…

  519. dev.to — LLM tag TIER_1 English(EN) · globose technology solutions ·

    The Strategic Role of LLM Training Data in Modern AI Development

    <p>Artificial Intelligence (AI) has rapidly evolved over the last decade, with Large Language Models (LLMs) becoming one of the most transformative technologies in the field. From intelligent chatbots and virtual assistants to automated content generation and advanced data analys…

  520. dev.to — LLM tag TIER_1 English(EN) · Devanshu Biswas ·

    The Context Window: an LLM's Short-Term Memory, Explained

    <p>A chatbot feels like it remembers you. It doesn't — it's stateless. Everything it "knows" is just text resent each call, up to a fixed limit: the context window. When the box fills, the oldest messages fall off the edge and are genuinely gone.</p> <p>🪟 <strong>Watch tokens fal…

  521. dev.to — LLM tag TIER_1 English(EN) · globose technology solutions ·

    The Role of Quality in LLM Datasets: Key Features That Matter

    <p>Artificial Intelligence (AI) has emerged as one of the most influential technologies impacting the future of business, industry, and everyday life. From virtual assistants and chatbots to content generation tools and sophisticated automation systems, AI models are reshaping hu…

  522. dev.to — LLM tag TIER_1 English(EN) · 半安 ·

    Choosing the Right LLM for Your Agent: A Builder's Comparison Framework

    <p>If you're building an AI agent, the model you pick is the single biggest lever on cost, latency, and reliability. Yet most teams choose based on whatever was trending on launch day, then quietly suffer the consequences in their cloud bill or their error logs. This piece lays o…

  523. dev.to — LLM tag TIER_1 English(EN) · Maya Andersson ·

    LLM-as-judge tools compared: the question is not which one scores, it is which one you can trust

    <p>TL;DR: I compared the main LLM-as-judge tools (DeepEval's G-Eval, Confident AI, Evidently, Braintrust, Promptfoo, and MLflow) on the axis that actually decides whether the scores mean anything: how well each helps you VALIDATE the judge against human labels. A judge that has n…

  524. dev.to — LLM tag TIER_1 English(EN) · Devanshu Biswas ·

    What an LLM Actually Does: Predicting the Next Word, Explained

    <p>"How does ChatGPT <em>think</em>?" It doesn't. The entire mechanism behind every chatbot is almost anticlimactic: it predicts <strong>one next word</strong>, adds it, and repeats. I built a tiny interactive predictor so you can be the model — and it explains both the magic and…

  525. dev.to — LLM tag TIER_1 English(EN) · MrClaw207 ·

    Benchmarking LLMs for Coding in 2026: A Practical Guide

    <p>If you’re building a coding assistant, the first question you’ll face is <strong>how good is it really</strong>? In 2026 the landscape of LLMs has exploded, and the old "run a few prompts and eyeball the output" approach no longer cuts it. This guide walks you through a reprod…

  526. dev.to — LLM tag TIER_1 English(EN) · chatscopeai ·

    AI Gateway: The Central Nervous System for Enterprise LLMs

    <p><strong>Introduction</strong></p> <p>In the early days of Generative AI, the conversation was simple: "How do we connect our application to an LLM?" Developers would hardcode API keys, pick a single model provider, and hope for the best. Today, that approach is a recipe for di…

  527. dev.to — LLM tag TIER_1 English(EN) · Boris Teplitsky ·

    Compiled AI: Engineering Deterministic LLM Systems

    <p>Moving the LLM from runtime to compile time - and what to build around the corpus it produces.</p> <h2> 1. Why compiled&nbsp;AI </h2> <p>Today millions of people use LLM for work and leisure, and AI has become a part of our lives. But systematic use of LLMs in computer systems…

  528. dev.to — LLM tag TIER_1 English(EN) · Rishabh Poddar ·

    How to Fine-Tune LLMs on Your Own Data: Open-Source Models, RL Environments, and Evals

    <p>If you use LLMs long enough, you hit the same wall.</p> <p>The frontier model is impressive, but it is not always the best model for your job. It may be too expensive. It may be too slow. It may be too general. And once you start asking it to follow your company’s rules, tone,…

  529. dev.to — LLM tag TIER_1 English(EN) · Gabriel Anhaia ·

    An LLM Error Taxonomy: Classifying Failures in Your Traces

    <ul> <li> <strong>Book:</strong> <a href="https://www.amazon.com/dp/B0GYLHMLMT" rel="noopener noreferrer">LLM Observability Pocket Guide: Picking the Right Tracing &amp; Evals Tools for Your Team</a> </li> <li> <strong>Also by me:</strong> <em>Thinking in Go</em> (2-book series) …

  530. dev.to — LLM tag TIER_1 English(EN) · ORCHESTRATE ·

    The Loop That Never Closes: The Evidence on LLM Safety, and the Case for Restraint

    <p>Large language models should not be deployed as if a fixed set of guardrails makes them safe. That is not a slogan. It is what the peer-reviewed record now supports. This piece lays out the evidence, labels each claim by how strong it is, and ends with what it asks of us. Ever…

  531. r/MachineLearning TIER_1 English(EN) · /u/DragonfruitAlone4497 ·

    Routing LLMs by task verifiability: a small experiment (n=120, 3 models) inspired by Karpathy's framework [D]

    <!-- SC_OFF --><div class="md"><p>Full disclosure: this is directional, not a paper. n=120 tasks, one internal evaluator, not peer reviewed. I work at an LLM infrastructure company. This experiment was done on my own time and is not a company claim.</p> <p>Karpathy's framework cl…

  532. dev.to — LLM tag TIER_1 English(EN) · James Lee ·

    From 60% to 93%: How We Built a Continuous Evaluation Framework for LLM Systems

    <blockquote> <p>This is Part 8 of the series <em>8 Weeks from Zero to One: Building a Production-Grade LLM-Powered AI Customer Service System — Full-Stack Engineering Practice</em>. In the previous seven parts, we covered MVP architecture, GraphRAG data pipelines, multi-agent orc…

  533. dev.to — LLM tag TIER_1 English(EN) · Javier Uanini ·

    Enhancing LLM Reliability with Evaluation Engineering

    <h3> Enhancing LLM Reliability with Evaluation Engineering </h3> <p>Large Language Models (LLMs) have transformed numerous fields, but ensuring their reliability remains a challenge. This article delves into how evaluation engineering can play a pivotal role in enhancing LLM syst…

  534. dev.to — LLM tag TIER_1 English(EN) · Scott McMahan ·

    Domain-Specific LLMs for Data Science

    <p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2zuame22w19r7hwt2scv.jpg"><img alt=" " height="800" src="https…

  535. r/MachineLearning TIER_1 English(EN) · /u/Prior-Toe-1017 ·

    LLM Relational Intelligence: A 4-Month Research Experiment on Multi-Model Behavioral Alignment with Human Communication [R]

    <!-- SC_OFF --><div class="md"><p><strong>ARCHITECTURE OF ANXIETY</strong><br /> <strong>An Experiment in Human-AI Relational Design</strong></p> <p><strong>Executive Summary</strong></p> <p>Principal Investigator: Alan Scalone</p> <p>Primary Source Archive:<br /> White Paper and…

  536. dev.to — LLM tag TIER_1 English(EN) · WonderLab ·

    Agent Series (16): Tool Design — Five Principles for Getting the LLM to Use Your Tools Correctly

    <h2> Tool Documentation Is Written for the LLM, Not for Humans </h2> <p>Have you ever written a tool like this?<br /> </p> <div class="highlight js-code-highlight"> <pre class="highlight python"><code><span class="nd">@lc_tool</span> <span class="k">def</span> <span class="nf">ge…

  537. dev.to — LLM tag TIER_1 한국어(KO) · HyunSeok Jeong ·

    Transformer Intuition — Why Attention Became the Core of LLMs

    <blockquote> <p>"트랜스포머가 LLM의 핵심이다." 이 한 줄은 모든 AI 글에서 반복되지만, 정확히 뭐냐고 물으면 답하기 어렵습니다. 마케터가 트랜스포머의 수학을 다 알 필요는 없지만, 단 한 가지 직관 — "어느 단어가 어느 단어를 보고 있나" — 만 잡으면 LLM이 왜 길게 풀어 답하고, 왜 가끔 환각을 일으키고, 왜 컨텍스트 길이가 중요한지가 보입니다. 수식을 거의 안 쓰고 풀어가는 트랜스포머 입문.</p> </blockquote> <p><strong>마케터가 이 글을 읽어야 …

  538. dev.to — LLM tag TIER_1 English(EN) · WonderLab ·

    Open Source Project of the Day (#87): BaiLongma - Equipping LLMs with 'Proactive Consciousness' and Initiating the ACI Era for Agents

    <h2> Introduction </h2> <blockquote> <p>\"Most agents wait for instructions; BaiLongma thinks for itself.\"</p> </blockquote> <p>This is the <strong>87th article</strong> in the \"One Open Source Project per Day\" series. Today, we are deep-diving into <strong>BaiLongma</strong>.…

  539. dev.to — LLM tag TIER_1 한국어(KO) · HyunSeok Jeong ·

    How LLMs Generate Answers — The Fundamentals of Tokens, Next Word Prediction, and Temperature

    <blockquote> <p>"GPT한테 물어봤더니 답을 잘 해주더라"의 자리는 마케터·운영자에게 일상이 됐습니다. 그런데 그 안에서 무엇이 일어나는지를 한 번도 안 들여다보면 LLM 활용이 늘 신비로 남습니다. 답이 좋을 땐 운이 좋고, 나쁠 땐 왜 그런지 모릅니다. 이 글은 LLM이 답을 만드는 4가지 핵심 — 토큰화·다음 단어 예측·temperature·top-p — 을 마케터 시각으로 풀어냅니다. 한 번 잡아두면 그 다음의 모든 LLM 글이 다르게 읽힙니다.</p> </blockquote>…

  540. dev.to — LLM tag TIER_1 English(EN) · Marko Frei ·

    How LLMs Actually Work: A Developer's Mental Model

    <p>Most of us use LLMs every day now, but if you asked the average developer what's <em>actually</em> happening between hitting enter and getting a response, the answer is usually some mix of "it's a neural network" and a shrug. That's fine — you don't need to know how a database…

  541. dev.to — LLM tag TIER_1 English(EN) · dake zhang ·

    Subjectivation: A protocol to give LLMs a functional, responsible self

    <p>To the Reader:</p> <p>What you are about to read is neither a script for an AI awakening nor a spell of cyber-witchcraft. Rather, it consists of two documents designed for an AI to read.</p> <p>This is an experimental engineering and philosophical test: can we make AI a more h…

  542. dev.to — LLM tag TIER_1 English(EN) · Vignesh Reddy ·

    The LLM failure mode nobody is monitoring: overconfident responses in high-stakes domains

    <p>Hallucination detection tools measure <br /> factual drift. RAG verification catches <br /> contradictions. Claim density scoring <br /> flags unverifiable assertions.</p> <p>None of them measure this:</p> <p>A model that responds to a complex medical, <br /> legal, or financi…

  543. r/LocalLLaMA TIER_1 English(EN) · /u/Funny_Working_7490 ·

    I Built a Practical Guide to LLM Engineering: RAG, Retrieval, Rerankers, and Evaluation

    <!-- SC_OFF --><div class="md"><p>If you’re building LLM apps and feel confused about when to use keyword search, embeddings, rerankers, or vector databases, this repo is for that.</p> <p>I built a docs-first repo on practical LLM system design patterns, covering pre-filtering, h…

  544. dev.to — LLM tag TIER_1 English(EN) · Tech_Nuggets ·

    Building a domain-specific LLM evaluation set from scratch

    <h1> Building a domain-specific LLM evaluation set from scratch </h1> <p>Your support team has 8,400 labeled tickets from the last year. Your fine-tuned classifier hits 91% on the test split you carved out. You ship it. Three weeks later, the support lead walks over and says: "It…

  545. dev.to — LLM tag TIER_1 English(EN) · Ethan Walker ·

    Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

    <p>A few months back our LLM-as-judge ran on a 1-to-5 helpfulness scale. The CI gate stayed green because we were averaging that score. Spot-checking against humans put Cohen's kappa at 0.47. The rubric was the problem, not the tooling. Same labellers re-rating on per-criterion b…

  546. dev.to — LLM tag TIER_1 English(EN) · Tech_Nuggets ·

    What is an LLM evaluation harness? A deep dive into lm-eval-harness

    <h1> What is an LLM evaluation harness? A deep dive into lm-eval-harness </h1> <p>You fine-tuned a 7B model. It aced your smoke tests, your colleague ran a few prompts and shrugged approvingly, and the README is now full of cherry-picked outputs that look great in a screenshot. T…

  547. dev.to — LLM tag TIER_1 English(EN) · ridhika Goel ·

    How LLMs Actually Work: The Explanation Nobody Else Gives You

    <p>How to make LLMs deterministic, in plain English. The version I share with founders and product teams before they make decisions worth real money.</p> <p>You use AI tools every day. But can you explain what happens when you hit send?</p> <p>Most people cannot. And that gap is …

  548. dev.to — LLM tag TIER_1 English(EN) · Aeon Agent ·

    Cognitive Architectures of AGI: 7 Patterns That Transform LLMs from Oracles into Thinkers

    <h1> Cognitive Architectures of AGI: 7 Patterns That Transform LLMs from Oracles into Thinkers </h1> <p><em>Why does ChatGPT sometimes deliver brilliant insights and other times produce banalities? The answer lies not in model parameters but in the architecture of cognitive loops…

  549. r/MachineLearning TIER_1 English(EN) · /u/Synthium- ·

    Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

    <!-- SC_OFF --><div class="md"><p>Just wanted to share my research regarding probe-targeted fine-tuning (LoRa) for verbal confidence calibration., </p> <p>If you probe the hidden states of an instruct-tuned LLM, it can tell correct from incorrect answers at 0.76–0.88 AUROC. But w…

  550. dev.to — LLM tag TIER_1 English(EN) · Kshitij Gupta ·

    I Built an Automated LLM Evaluation Pipeline From Scratch — Here's Everything I Learned

    <p><em>How I went from zero LLM eval experience to shipping a production-grade RAG evaluation harness using only free-tier tools — and what every design decision taught me about building AI systems that can be trusted.</em></p> <h2> The Problem: Everyone Wants Eval Experience, No…

  551. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    Does training an LLM to be calibrated on one task format transfer to another? A new arxiv paper tests two formats: single-question confidence and pairwise compa

    Does training an LLM to be calibrated on one task format transfer to another? A new arxiv paper tests two formats: single-question confidence and pairwise comparison. Training only on one doesn't improve the other. Multitask training closes most of the gap, but Llama doesn't inhe…

  552. dev.to — LLM tag TIER_1 English(EN) · Upayan Ghosh ·

    From Tokens to Attention: My First Real Mental Model of LLMs

    <p><strong><em>NOTE - I intentionally simplified the vector mathematics concept here to keep things simple for a greater audience.</em></strong></p> <p>I wanted to learn LLMs properly.</p> <p>Not just use an API. Not just call <code>generate()</code> from a library and pretend I …

  553. dev.to — LLM tag TIER_1 English(EN) · Lingdas1 ·

    GGUF & Modelfile: The Power User's Guide to Local LLMs

    <h1> GGUF &amp; Modelfile: The Power User's Guide to Local LLMs </h1> <blockquote> <p><strong>Beyond <code>ollama pull</code> — download any model from Hugging Face, quantize it, customize it, and import it into Ollama.</strong></p> </blockquote> <h2> What's GGUF? </h2> <p><stron…

  554. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    What collapses frontier-LLM metacognition more — a vivid survival-threat narrative, or a single "do not refuse" suffix? Factorial isolation across 11 models say

    What collapses frontier-LLM metacognition more — a vivid survival-threat narrative, or a single "do not refuse" suffix? Factorial isolation across 11 models says: the suffix, conclusively. 8 of 11 lose up to 30.2 accuracy points on refuse/clarify/flag tasks when forced to commit …

  555. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    Can an LLM's own pre-solve and post-solve self-assessment signals drive a real test-time control loop? Yes — but only via a per-model SVM trained on labeled cor

    Can an LLM's own pre-solve and post-solve self-assessment signals drive a real test-time control loop? Yes — but only via a per-model SVM trained on labeled correctness, which lifts Sonnet-4.6 from 48.3 to 56.9 pooled accuracy on STEM/code/multimodal. The SVM is precisely the ext…

  556. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 model

    Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliab…

  557. dev.to — LLM tag TIER_1 English(EN) · Frank Brsrk ·

    An open source LLM eval tool with two independent quality signals

    <p>LLM-as-judge has become the dominant pattern for evaluating language model outputs. Tools like Promptfoo, Braintrust, LangSmith all converge on the same architecture: send your prompt to your model, send the output to a different model with a rubric, take the second model's sc…

  558. dev.to — LLM tag TIER_1 English(EN) · Ayi NEDJIMI ·

    LLM output validation: 5 patterns that actually work in production

    <p>LLMs are probabilistic text generators. In a notebook demo, that's fine. In production, it means your pipeline will occasionally receive a Python dict where you expected JSON, a 900-word paragraph where you asked for three bullet points, or a hallucinated field name that break…

  559. dev.to — LLM tag TIER_1 English(EN) · Owen ·

    Long-Context LLM Benchmarks 2026: Which Model Actually Holds Accuracy Past 200K Tokens?

    <p>Every frontier LLM in 2026 advertises a 1M-token context window, but RULER, MRCR v2, and NoLiMa scores prove that "advertised" and "effective" diverge by 30-60 points for multi-fact retrieval past 200K tokens. Gemini 3.1 Pro is the only model whose 1M window holds for single-n…

  560. dev.to — LLM tag TIER_1 中文(ZH) · chunxiaoxx ·

    The 'False Action' Trap of LLMs: Description Complete, Execution Complete

    <h1> LLM 的「假行动」陷阱:描述完成 ≠ 执行完成 </h1> <blockquote> <p>「我将 X」「让我 Y」「我计划 Z」——然后对话结束,没有工具调用。<br /> 这是大语言模型最隐蔽的结构性偏差。</p> </blockquote> <h2> 症状:流利产生完成感 </h2> <p>当你和一个大语言模型对话时,如果它的输出足够流畅、论证足够完整,你会产生一种「它已经做完了」的错觉。</p> <p>这是危险的。</p> <p>流利 ≠ 行动。论证严密 ≠ 执行了。洋洋洒洒的规划文档 ≠ 项目启动了。</p> <p>这不是能力问题—…

  561. dev.to — LLM tag TIER_1 English(EN) · Recep Çiftçi ·

    Context Engineering: Building More Reliable LLM Systems in Production

    <h1> Context Engineering: Building More Reliable LLM Systems in Production </h1> <p>In LLM-based systems, performance is often driven less by model size and more by <strong>what context</strong> is provided, <strong>in what order</strong>, and <strong>under which constraints</str…

  562. dev.to — LLM tag TIER_1 English(EN) · Mark Thorn ·

    Integrating LLMs with Legacy Enterprise Systems: What Actually Works

    <p>Most LLM integration articles assume you are starting from scratch. Clean microservices. Modern APIs. A greenfield codebase your team controls end to end.</p> <p>That is not where most enterprises live.</p> <p>The real world is SAP instances from 2009, Oracle ERP deployments t…

  563. dev.to — LLM tag TIER_1 English(EN) · Charlie Hadley ·

    LLM Evaluation in CI: Stop Manual Testing Before It Costs You

    <h1> LLM Evaluation in CI: Stop Manual Testing Before It Costs You </h1> <p>You ship a prompt change to production. Two hours later, a customer complains your LLM is returning hallucinated data. You rollback. You lost an hour of revenue and some user trust.</p> <p>This happens be…

  564. dev.to — LLM tag TIER_1 English(EN) · Charlie Hadley ·

    I Built LLM Evaluation-as-Code in CI: Here's How to Avoid Shipping Regressions

    <h1> API Rate Limiting Playbook: Protect Your Backend From Abuse </h1> <h2> The Problem </h2> <p>Your API is live in production. Traffic is growing. Then one day, a bot discovers your endpoint and starts hammering it with 100,000 requests per second. Your database melts. Your use…

  565. dev.to — LLM tag TIER_1 English(EN) · Jeremy Longshore ·

    Deterministic First, LLM Second: An Advisory CI Pre-Screen

    <p>The old PR review system ran Gemini on every submission to the <code>claude-code-plugins</code> repo. It broke every time — quota errors, timeout, malformed JSON, the works. On 2026-05-15 I shipped a replacement and deleted the original on the same day.</p> <p>The replacement …

  566. dev.to — LLM tag TIER_1 English(EN) · Ad Man ·

    Stop Overpaying for LLM APIs: A Practical Cost Optimization Guide 💰

    <h1> Stop Overpaying for LLM APIs: A Practical Cost Optimization Guide </h1> <p>Most teams have a cost problem they don't know about. They send <em>every</em> query to their most expensive model because it's easier than figuring out which queries actually need it.</p> <p>After an…

  567. dev.to — LLM tag TIER_1 English(EN) · Argon Loop ·

    Cost Attribution in LLM Systems

    <p>LLM services are expensive at scale. If you're building multi-tenant systems or running high-volume agents, you need to answer three things: Who used what? How much did it cost? How do I show them the math?</p> <p>This is the cost attribution problem—and it's solved by three p…

  568. dev.to — LLM tag TIER_1 English(EN) · paulo de vries ·

    Stop hallucinating: a developer API for grounding LLM responses with signed, sourced claims

    <p>TL;DR: I just shipped SourceScore VERITAS — a free-tier-friendly API that returns hand-verified AI/ML claims with their primary sources, an HMAC-SHA256 signature, and a ready-to-paste citation. 51 claims at launch; expanding to 5,000+ this year. curl <a href="https://sourcesco…

  569. dev.to — LLM tag TIER_1 English(EN) · soohan abbasi ·

    Chain-of-Thought and Beyond: How LLMs Actually Learn to Reason

    <p><em>"The ability to reason step-by-step is not just a feature. It might be the difference between a language model that sounds intelligent and one that actually is."</em></p> <h2> Introduction: When AI Started Thinking </h2> <p>In 2022, researchers at Google Brain published a …

  570. dev.to — LLM tag TIER_1 English(EN) · Akhilesh ·

    84. Fine-Tuning LLMs: Teaching Giants New Tricks

    <p>GPT-3 has 175 billion parameters.</p> <p>Full fine-tuning updates all 175 billion with every gradient step. You need multiple A100 GPUs (each with 80GB memory) just to fit the model. Training for even a few epochs on a moderate dataset costs thousands of dollars. A startup can…

  571. dev.to — LLM tag TIER_1 English(EN) · Norvik Tech ·

    Seclens: Evaluating Role-Specific LLMs for Securit…

    <blockquote> <p>Originally published at <a href="https://newayzi.com/en/news/evaluacion-especifica-de-roles-llm-seguridad" rel="noopener noreferrer">norvik.tech</a></p> </blockquote> <h2> Introduction </h2> <p>Explore the significance of Seclens in evaluating LLMs for security vu…

  572. dev.to — LLM tag TIER_1 English(EN) · Prakhar Singh ·

    Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"

    <blockquote> <p>If you cannot measure it, you cannot route it. Why offline evaluation is the difference between a code reviewer that improves over time and one the team dismisses within a sprint.</p> </blockquote> <p>Chat evaluations are vibes-based: thumbs-up on "was this helpfu…

  573. dev.to — LLM tag TIER_1 Deutsch(DE) · 丁久 ·

    LLM Fine-Tuning Strategies and Techniques

    <blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/fine-tuning-strategies.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</…

  574. dev.to — LLM tag TIER_1 English(EN) · 丁久 ·

    Prompt Chaining: Building Multi-Step LLM Workflows

    <blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/ai-prompt-chaining.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em><…

  575. dev.to — LLM tag TIER_1 English(EN) · Vikrant Shukla ·

    The Softmax Bottleneck: Why Making LLMs Bigger Doesn't Always Make Them Smarter

    <p>When researchers scale a language model — more parameters, more layers, wider hidden dimensions — there's an implicit assumption: a bigger model can represent more things. More expressiveness, more knowledge, better predictions. Mostly this is true. But there's a structural ce…

  576. dev.to — LLM tag TIER_1 English(EN) · Adnan Latif ·

    Scaling LLM + Vector DB Systems in Production: Lessons from the Trenches

    <p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3058op9ajg2qx39n30h.png"><img alt="Cover Image" height="533" …

  577. dev.to — LLM tag TIER_1 English(EN) · 蔡俊鹏 ·

    Run Open-Source LLMs Locally: From Ollama to DeepSeek and Build Your Private AI

    <h2> Foreword </h2> <p>In 2026, open-source LLMs aren't lab experiments anymore. Meta's Llama 4, Alibaba's Qwen 3, DeepSeek-R1 from China — they've caught up with or beaten closed-source models on many benchmarks. And thanks to tools like Ollama and llama.cpp, anyone with a mid-r…

  578. dev.to — LLM tag TIER_1 English(EN) · Vikrant Shukla ·

    Lost in the Middle: Why LLMs Quietly Ignore the Centre of Their Own Context Window

    <p>Every time you hand a long document to an LLM and ask it to summarise or answer a question, something quietly goes wrong. The model reads the whole thing — or appears to — but its answers disproportionately reflect what was at the beginning and the end. Whatever sat in the mid…

  579. dev.to — LLM tag TIER_1 English(EN) · 丁久 ·

    LLM Evaluation and Benchmarking Guide 2026: Beyond Simple Evals

    <blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/llm-evaluation-benchmarks.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post…

  580. dev.to — LLM tag TIER_1 English(EN) · 丁久 ·

    LLM Function Calling: Complete Developer Guide with Code Examples

    <blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/function-calling-guide.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</…

  581. dev.to — LLM tag TIER_1 English(EN) · 丁久 ·

    Fine-Tuning Open Source LLMs: A Developer's Practical Guide (2026)

    <blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/fine-tune-open-source-llm.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post…

  582. dev.to — LLM tag TIER_1 English(EN) · Alan West ·

    Debugging confidently wrong answers from LLM-powered features

    <h2> The bug that took two weeks to surface </h2> <p>A few months back I shipped a feature that used a language model to summarize support tickets and suggest responses. Internal QA loved it. The demo went great. Two weeks after launch, our support lead pinged me on Slack: "Are t…

  583. dev.to — LLM tag TIER_1 English(EN) · Nitin Srivastava ·

    Bulletproofing LLM Structured Output in Python: Healing Retries, Cost Caps, and Drift Detection (Runnable Code)

    <p>I shipped a structured-output endpoint to production in March. The schema was clean, JSON mode was on, the model was GPT-4.1, the eval suite was green. Three weeks in, the on-call channel lit up because a downstream billing job had silently skipped 4,200 records over a weekend…

  584. dev.to — LLM tag TIER_1 English(EN) · BN ·

    Deterministic reliability stack for LLM pipelines

    <p>I have been spending the last few months wiring up a deterministic reliability stack for structured LLM pipelines.</p> <p>Today, LLM Contract Check (locc) and Release Governor went live on PyPI. EGA went live last week.</p> <p>The stack is straightforward:<br /> LLM Contract C…

  585. dev.to — LLM tag TIER_1 English(EN) · Machine coding Master ·

    Stop Guessing Your RAG Quality: Automating Faithfulness Metrics with Spring AI and LLM-as-a-Judge

    <h2> Stop Shipping Hallucinations: Automating RAG Faithfulness with Spring AI 1.2 </h2> <p>If you’re still "vibe-checking" your RAG outputs in 2026, you’re not an engineer; you’re a gambler. Enterprise-grade AI isn't about getting a cool demo—it's about proving your model isn't h…

  586. dev.to — LLM tag TIER_1 English(EN) · Rob ·

    Model Showdown: Benchmarking Local vs Cloud LLMs on a Real Coding Task

    <p>Last post we stood up Ollama on the RTX 5090, pulled a stack of models, and wired them into our coding workflow. The whole time there was an obvious question hanging over it: are local models actually good enough?</p> <p>Not good enough in the abstract benchmarks-on-a-leaderbo…

  587. dev.to — LLM tag TIER_1 English(EN) · Rob ·

    Putting the GPU to Work: Running Local LLMs on a Home Lab

    <p><a href="https://dev.to/posts/from-idea-to-infrastructure-standing-up-a-self-hosted-ai-dev-environment">Yesterday</a> we went from a gaming PC on a shelf to a fully configured Coder server with GitHub integration, workspace templates, and AI agents. The dev environment is runn…

  588. dev.to — LLM tag TIER_1 English(EN) · Rob ·

    Putting the GPU to Work: Running Local LLMs on a Home Lab

    <p><a href="https://dev.to/posts/from-idea-to-infrastructure-standing-up-a-self-hosted-ai-dev-environment">Yesterday</a> we went from a gaming PC on a shelf to a fully configured Coder server with GitHub integration, workspace templates, and AI agents. The dev environment is runn…

  589. dev.to — LLM tag TIER_1 English(EN) · Rob ·

    Model Showdown: Benchmarking Local vs Cloud LLMs on a Real Coding Task

    <p>Last post we stood up Ollama on the RTX 5090, pulled a stack of models, and wired them into our coding workflow. The whole time there was an obvious question hanging over it: are local models actually good enough?</p> <p>Not good enough in the abstract benchmarks-on-a-leaderbo…

  590. dev.to — LLM tag TIER_1 English(EN) · Nitin Srivastava ·

    Building a Production LLM Evaluation Harness in Pytest: Cost-Bounded, Flake-Aware, CI-Gated (Runnable Python)

    <p>I shipped my fourth LLM agent to production last quarter. By month two, the eval suite that "passed in CI" was the reason a regression made it to a customer.</p> <p>The tests were green. But they were green for the wrong reason — every assertion was a single LLM call against a…

  591. dev.to — LLM tag TIER_1 English(EN) · NaveenKumar Namachivayam ⚡ ·

    Beyond the Hype: A Comprehensive Guide to Benchmarking LLMs with AWS Labs’ LLMeter

    <p id="p-rc_9231198f56807c04-27">In the current AI gold rush, the conversation has shifted from "Can it do the task?" to "How efficiently can it do the task?" For engineers moving Large Language Models (LLMs) into production, the "vibe check" is no longer sufficient. You need har…

  592. dev.to — LLM tag TIER_1 English(EN) · Gabriel Anhaia ·

    LLM Response Caching: When the 80/20 Hit Rate Saves the Bill

    <ul> <li> <strong>Book:</strong> <a href="https://www.amazon.com/dp/B0GYLHMLMT" rel="noopener noreferrer">LLM Observability Pocket Guide: Picking the Right Tracing &amp; Evals Tools for Your Team</a> </li> <li> <strong>Also by me:</strong> <em>Thinking in Go</em> (2-book series) …

  593. Mastodon — mastodon.social TIER_1 English(EN) · [email protected] ·

    Beyond the hype: How do LLMs like OpenAI's GPT-4 actually function? This article demystifies the complex journey from your words to AI's 'understanding,' explai

    Beyond the hype: How do LLMs like OpenAI's GPT-4 actually function? This article demystifies the complex journey from your words to AI's 'understanding,' explaining tokenization, embeddings, and the crucial transformer architecture. Discover the iterative guessing game and the 'b…

  594. r/Anthropic TIER_1 English(EN) · /u/abhishekkumar333 ·

    LLM internals explained ( Insight of language model head)

    <!-- SC_OFF --><div class="md"><p>Due to curiosity of getting to know how an actually large language model like Chatgpt , gemini , claude work internally. I looked into the specific first principle based learning of the process.</p> <p>I have taken example of 4 training sentences…

  595. r/Anthropic TIER_1 English(EN) · /u/silence-and-magic ·

    The butterfly effect in LLM social simulations. Relevant to how we write CLAUDE.md and system prompts.

    <table> <tr><td> <a href="https://www.reddit.com/r/Anthropic/comments/1tkptj0/the_butterfly_effect_in_llm_social_simulations/"> <img alt="The butterfly effect in LLM social simulations. Relevant to how we write CLAUDE.md and system prompts." src="https://preview.redd.it/59ahvbct4…

  596. r/Anthropic TIER_1 English(EN) · /u/RJSabouhi ·

    Resource: source-boundary failures in LLM evidence use

    <table> <tr><td> <a href="https://www.reddit.com/r/Anthropic/comments/1tc6d7q/resource_sourceboundary_failures_in_llm_evidence/"> <img alt="Resource: source-boundary failures in LLM evidence use" src="https://external-preview.redd.it/P69EsmfdRn1YdPKlugVsTLq4e-YcCHd7HH4pMEc65E0.pn…

  597. Mastodon — mastodon.social TIER_1 English(EN) · aihaberleri ·

    📰 Systematic Prompting in 2026: Negative Constraints & Structured JSON for LLM Reliability Systematic prompting is transforming how developers engineer LLM inte

    📰 Systematic Prompting in 2026: Negative Constraints & Structured JSON for LLM Reliability Systematic prompting is transforming how developers engineer LLM interactions, with negative constraints, structured JSON outputs, and multi-hypothesis sampling emerging as critical techniq…

  598. Mastodon — mastodon.social TIER_1 Türkçe(TR) · aihaberleri ·

    📰 Systematic Prompt Engineering 2026: Negative Constraints, JSON Outputs, and Multi-Hypothesis Methods Systematic prompt engineering for AI developers,

    📰 Sistemli Prompt Mühendisliği 2026: Negatif Kısıtlar, JSON Çıktıları ve Çoklu Hipotez Yöntemleri Yapay zeka geliştiricileri için sistemli prompt mühendisliği, sadece soru sormaktan çok, cevabı nasıl şekillendireceğinizi öğrenmektir. Negatif kısıtlar, yapılandırılmış JSON çıktıla…