Google Research evaluates LLM alignment and improves factuality
ByPulseAugur Editorial·[598 sources]·
Google Research has developed a new framework to evaluate the behavioral alignment of large language models with human social inclinations. This approach adapts established psychological questionnaires into large-scale situational judgment tests, allowing for the quantification of model tendencies in realistic scenarios. The research identifies gaps where model behaviors deviate from human consensus or fail to capture the range of human opinions, aiming to improve LLM navigation of social dynamics. Separately, Google Research also introduced SLED, a novel decoding strategy that enhances LLM factuality by utilizing all model layers instead of just the final one, without requiring external data or fine-tuning.
AI
IMPACT
New methods for evaluating LLM alignment and improving factuality could lead to more trustworthy and socially adept AI systems.
RANK_REASON
The cluster contains two research papers from Google Research detailing new methods for evaluating LLM alignment and improving LLM factuality.
arXiv:2606.23276v2 Announce Type: replace Abstract: Knowledge Editing (KE) has emerged as a frontier for updating specific facts in LLMs without costly retraining, but its reliability and underlying mechanisms remain poorly understood. In this work, we examine KE from an adversar…
Knowledge Editing (KE) has emerged as a frontier for updating specific facts in LLMs without costly retraining, but its reliability and underlying mechanisms remain poorly understood. In this work, we examine KE from an adversarial elicitation perspective, revealing that edited k…
arXiv:2606.20482v1 Announce Type: new Abstract: To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First…
arXiv:2606.19588v1 Announce Type: new Abstract: Formal tools such as SAT and SMT solvers are increasingly embedded in language model reasoning pipelines when a safety or security critical question can be formulated in logic. Unlike chain of thought whose steps are sampled from th…
To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First, the users rarely provide explicit feedback for…
Heterogeneous LLM debate is motivated by the promise that diverse peers correct one another, but the same exchange that carries correction also carries adversarial influence. We measure which dominates by tracking how a heterogeneous peer changes the honest agents' revision behav…
arXiv cs.CL
TIER_1English(EN)·Naihao Deng, Yiming Feng, Chimaobi Okite, Kaijian Zou, Lu Wang, Rada Mihalcea, Yulong Chen·
arXiv:2606.18656v1 Announce Type: new Abstract: Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper high…
arXiv cs.AI
TIER_1English(EN)·Sunnie S. Y. Kim, Margit Bowler, Leon A Gatys·
arXiv:2606.18258v1 Announce Type: cross Abstract: Large language models (LLMs) exhibit a wide range of human-like behaviors, from expressing thoughts and emotions, to engaging in relationship-building with users, to refusing requests and maintaining boundaries. Despite their prev…
arXiv cs.LG
TIER_1English(EN)·Zilong Zhang, Yi-Ting Hung, Lei Ding, Chi-Kuang Yeh·
arXiv:2606.19057v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly used as judges for scalable evaluation, yet such LLM--as--a--Judge systems exhibit systematic biases that are decoupled from semantic quality, most notably verbosity bias. Meanwhile, h…
arXiv cs.AI
TIER_1English(EN)·Xi Fang, Weijie Xu, Yuchong Zhang, Stephanie Eckman, Scott Nickleach, Chandan K. Reddy·
arXiv:2510.09905v2 Announce Type: replace Abstract: When an AI assistant remembers that Sarah is a single mother working two jobs, does it interpret her stress differently than if she were a wealthy executive? As personalized AI systems increasingly incorporate long-term user mem…
arXiv cs.AI
TIER_1English(EN)·Mika M\"antyl\"a, Patricia Matsubara, Katia Romero Felizardo, Miikka Kuutila, Marco Gerosa, Savio de Sousa Sampaio, Tayana Conte, Igor Steinmacher·
arXiv:2606.17588v1 Announce Type: cross Abstract: Several studies have examined the use of large language models (LLMs) for title-abstract screening in systematic reviews (SRs), reporting mixed accuracy. However, questions of reliability remain largely unaddressed. In this study,…
arXiv cs.CL
TIER_1English(EN)·Hyungwon Kim, Kandarp Joshi, Lillian Zhou, Pavel Golik, Petar Aleksic·
arXiv:2606.17281v1 Announce Type: new Abstract: While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To pr…
arXiv cs.CL
TIER_1English(EN)·Ali Marashian, Alexis Palmer, Katharina von der Wense·
arXiv:2606.17234v1 Announce Type: new Abstract: The rapid rise in popularity of large language models (LLMs) for translation calls for a thorough study of the reliability of their confidence in their own outputs. Unlike many generation tasks, translation errors and confidence lev…
arXiv cs.LG
TIER_1English(EN)·SongEun Kim, Seungyoo Lee, Edwin Fong, Hyungi Lee, Juho Lee·
arXiv:2606.17832v1 Announce Type: new Abstract: Large language models (LLMs) are often hypothesized to perform implicit Bayesian inference, yet a key coherence condition, the martingale property of predictive beliefs, has been shown to fail in controlled synthetic in-context lear…
arXiv cs.CL
TIER_1English(EN)·Omar Sharif, Eftekhar Hossain, Nikhil Singh, Patrick Ng·
arXiv:2601.00215v2 Announce Type: replace-cross Abstract: Reinforcement learning with verifiable rewards has driven major gains in LLM reasoning, and it is intuitive to assume this recipe will transfer well to multimodal models. However, multimodal models do two things: first, pe…
arXiv:2606.17609v1 Announce Type: new Abstract: Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the sam…
arXiv:2606.17506v1 Announce Type: new Abstract: Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biase…
Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper highlights the need for principled approaches to mor…
Large language models (LLMs) are often hypothesized to perform implicit Bayesian inference, yet a key coherence condition, the martingale property of predictive beliefs, has been shown to fail in controlled synthetic in-context learning settings. We revisit this question in a mor…
Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the same question in open generation. We ask what pruni…
Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biased content, which current methods do not systemat…
arXiv:2606.16496v1 Announce Type: new Abstract: Large multimodal language models (LLMs) have emerged as powerful tools for guiding evolutionary search toward interpretable programmatic policies. However, existing frameworks rely on a monolithic model call to simultaneously interp…
arXiv:2606.17024v1 Announce Type: new Abstract: Sparse reward reinforcement learning (RL) has become a standard tool for improving LLM reasoning, but its success depends critically on the coverage present in the base model. In practice, models are often primed for RL through \emp…
arXiv cs.CL
TIER_1English(EN)·Katharina Trinley, Jesujoba O. Alabi, Dietrich Klakow, Vagrant Gautam·
arXiv:2606.16407v1 Announce Type: new Abstract: Faithful and robust pronoun use is important for fair and coherent generations, yet large language models largely fail when multiple referents use different pronouns. To study the interplay of reasoning, repetition, and bias in this…
arXiv cs.CL
TIER_1English(EN)·Xuran Li, Guanqin Zhang, Imran Razzak, Hakim Hacid, Eleanna Kafeza, Hao Xue, Flora D. Salim·
arXiv:2606.16368v1 Announce Type: new Abstract: Current evaluation paradigms for Large Language Model (LLM) personalization rely heavily on brittle surface-matching metrics or computationally expensive LLM-as-a-judge protocols, both of which lack interpretability. To address thes…
arXiv cs.CL
TIER_1English(EN)·Nafiseh Nikeghbal, Amir Hossein Kargaran, Shaghayegh Kolli, Jana Diesner·
arXiv:2606.16011v1 Announce Type: new Abstract: Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plaus…
arXiv cs.AI
TIER_1English(EN)·Erica Zhang, Fangzhao Zhang, Aneesh Pappu, Batu El, Jose Blanchet, Susan Athey, Jiashuo Liu, James Zou·
arXiv:2605.13909v2 Announce Type: replace-cross Abstract: Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction…
arXiv:2510.13940v4 Announce Type: replace-cross Abstract: Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a si…
arXiv cs.AI
TIER_1English(EN)·Uljad Berdica, Fernando Acero, Anton Ipsen, Parisa Zehtabi, Michael Cashmore, Manuela Veloso·
arXiv:2604.05859v2 Announce Type: replace Abstract: We study Contextual Multi-Armed Bandits (CMABs) for non-episodic decision-making problems where the context includes both textual and numerical information (e.g., recommendation systems, dynamic portfolio adjustments, offer sele…
arXiv cs.AI
TIER_1English(EN)·Louie Hong Yao, Nicholas Jarvis, Tiffany Zhan, Saptarshi Ghosh, Linfeng Liu, Tianyu Jiang·
arXiv:2509.22888v2 Announce Type: replace Abstract: Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their inherently multidimensional nature. We present JE-IRT, a geometric item-response framework that embeds both LLMs and questions in a…
arXiv:2606.15610v1 Announce Type: cross Abstract: LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or agr…
arXiv cs.AI
TIER_1English(EN)·Aina Vila Pons, Ioannis Tzachristas, Constantinos Antoniou·
arXiv:2606.15314v1 Announce Type: cross Abstract: Industrial retrofit planning depends on structured operational data rather than free text: planners must estimate whether a newly registered prototype will require a retrofit, which retrofit package it will need, and how long the …
arXiv cs.AI
TIER_1English(EN)·Olivia Peiyu Wang, Sanna Wong-Toropainen, Daneshvar Amrollahi, Ryan Bai, Tashvi Bansal, Arush Garg, Leilani H. Gilpin·
arXiv:2606.16118v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong performance on reasoning tasks, but whether this reflects faithful logical inference or heuristic approximation remains unclear. We study this question in legal entailment by comparing thr…
arXiv:2606.15474v1 Announce Type: new Abstract: Continuous evaluation of LLM products relies on a strong LLM judge treated as ground truth: a cheap monitor scores every interaction and a team is paged when the score drifts down. But the judge is itself a model behind an API, and …
arXiv:2606.15029v1 Announce Type: new Abstract: LLM judges are used to reduce the need for costly human labor in evaluating open-ended text generation. However, the reliability of these judges depends critically on their alignment with human raters -- a property that itself depen…
arXiv:2606.14838v1 Announce Type: new Abstract: How to define a good explanation is a long-standing philosophical debate which has found recent renewed interest in the context of AI outputs. Explainability is crucial for AI adoption in many contexts, but in order to produce good …
While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabiliti…
arXiv cs.CL
TIER_1English(EN)·Katharina von der Wense·
The rapid rise in popularity of large language models (LLMs) for translation calls for a thorough study of the reliability of their confidence in their own outputs. Unlike many generation tasks, translation errors and confidence levels can be useful at different levels of granula…
Sparse reward reinforcement learning (RL) has become a standard tool for improving LLM reasoning, but its success depends critically on the coverage present in the base model. In practice, models are often primed for RL through \emph{mid-training} on curated reasoning traces that…
Large multimodal language models (LLMs) have emerged as powerful tools for guiding evolutionary search toward interpretable programmatic policies. However, existing frameworks rely on a monolithic model call to simultaneously interpret visual behavioral evidence and synthesize co…
Faithful and robust pronoun use is important for fair and coherent generations, yet large language models largely fail when multiple referents use different pronouns. To study the interplay of reasoning, repetition, and bias in this task, prior work relies exclusively on behaviou…
Current evaluation paradigms for Large Language Model (LLM) personalization rely heavily on brittle surface-matching metrics or computationally expensive LLM-as-a-judge protocols, both of which lack interpretability. To address these limitations, we introduce Natural Language Inf…
arXiv:2606.14119v1 Announce Type: new Abstract: Fault diagnostics and recovery in smart factories is challenging because critical information is dispersed across manuals of multiple machines which are interconnected through the manufacturing process. Large Language Models (LLMs) …
arXiv:2606.13931v1 Announce Type: new Abstract: Lawyer-client consultation is a critical starting point for legal services. Effective legal assistance hinges on eliciting sufficient and truthful information from clients in order to devise strategies that best protect their intere…
arXiv cs.AI
TIER_1English(EN)·Toni J. B. Liu, Baran Zadeo\u{g}lu, Nicolas Boull\'e, Rapha\"el Sarfati, Gurbir Arora, Christopher J. Earls·
arXiv:2601.16407v3 Announce Type: replace-cross Abstract: Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given p…
arXiv cs.AI
TIER_1English(EN)·Shaun Feakins, Ibrahim Habli, Kim Littler, Robert Palin·
arXiv:2606.14327v1 Announce Type: cross Abstract: This paper appraises recent frameworks within AI development to integrate LLMs into control tasks in automotive contexts from the perspective of safety assurance. This work has built upon the rapid integration of LLMs across autom…
arXiv:2606.13685v1 Announce Type: cross Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanni…
arXiv cs.LG
TIER_1English(EN)·Angira Sharma, Christian Schroeder de Witt, Philip Torr, Anisoara Calinescu, Jialin Yu·
arXiv:2606.14388v1 Announce Type: new Abstract: Interventions designed to modify a particular behavior in LLMs, such as refusal or sycophancy, often produce unintended changes in other behaviors. This lack of targeted control makes it difficult to design and implement reliable sa…
arXiv:2606.14150v1 Announce Type: cross Abstract: Pruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5--0.8 with six methods spanning depth, width, and sparse granularities, under two c…
arXiv:2606.13944v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly characterised in recent evaluation work as having stable, model-level preference and value systems. However, accompanying robustness checks are limited to incidental prompt perturbations…
ExpRL uses human-written question-answer data as reward scaffolds to provide automated reinforcement learning priming for language models, outperforming traditional methods on math reasoning tasks.
Open-source LLM agent ecosystems are growing rapidly, yet the security of community-contributed skills - modular tool definitions that extend agent capabilities - remains largely unvetted. The gap we fill: existing scanners operate at the code layer and are structurally blind to …
Answer stability in large language models is evaluated through controlled challenges that measure response consistency when correct answers face plausible counterarguments, revealing significant variation in model reliability beyond traditional accuracy metrics.
Interventions designed to modify a particular behavior in LLMs, such as refusal or sycophancy, often produce unintended changes in other behaviors. This lack of targeted control makes it difficult to design and implement reliable safety controls. To understand these side-effects,…
This paper appraises recent frameworks within AI development to integrate LLMs into control tasks in automotive contexts from the perspective of safety assurance. This work has built upon the rapid integration of LLMs across automotive settings. However, we find that at present, …
Pruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5--0.8 with six methods spanning depth, width, and sparse granularities, under two controlled token-matched settings. (1) With the sam…
Fault diagnostics and recovery in smart factories is challenging because critical information is dispersed across manuals of multiple machines which are interconnected through the manufacturing process. Large Language Models (LLMs) can provide a promising approach. In this paper,…
arXiv cs.CL
TIER_1English(EN)·Sangho Kim, Heejin Kim, Yoonhee Park, Hyunggeun Jeon, Jaejin Lee·
arXiv:2606.12922v1 Announce Type: new Abstract: Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that measures…
arXiv cs.AI
TIER_1English(EN)·Ashutosh Hathidara, Sai Shruthi Sistla, Sebastian Schreiber, Sahil Bansal·
arXiv:2606.12451v1 Announce Type: new Abstract: Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, paramet…
arXiv:2606.12702v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly integrated into clinical systems, making it essential to evaluate the real-world utility of these systems. However, static benchmarks tend to measure correctness rather than user accepta…
arXiv:2606.12754v1 Announce Type: cross Abstract: Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distributions of responses, and that their judgments are unstable across wording variations. We d…
arXiv cs.AI
TIER_1English(EN)·Benno Krojer, Shravan Nayak, Oscar Ma\~nas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, Marius Mosbach·
arXiv:2602.00462v4 Announce Type: replace-cross Abstract: Transforming a large language model (LLM) into a vision-language model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simpl…
arXiv cs.CL
TIER_1English(EN)·Aviya Maimon, Amir DN Cohen, Gal Vishne, Shauli Ravfogel, Reut Tsarfaty·
arXiv:2507.20208v2 Announce Type: replace Abstract: Current evaluations of large language models (LLMs) rely heavily on a growing collection of benchmarks and on aggregate benchmark scores, yet it remains unclear what this comparison actually captures, and what these scores revea…
arXiv cs.CL
TIER_1English(EN)·Laura Majer, Jan \v{S}najder, Martin Tutek·
arXiv:2606.13254v1 Announce Type: new Abstract: The growing need to represent diverse perspectives has increased interest in pluralistic LLM generation. Although difficult to operationalize, identifying perspectives expressed in text would provide clear guidance on pluralistic al…
arXiv cs.CL
TIER_1English(EN)·Camilla Dalerci, Thilo Michael, Robin Schaefer, Daniel Weinland·
arXiv:2606.13111v1 Announce Type: new Abstract: We present M\"OVE (Modelle f\"ur die \"Offentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public ad…
Large language models (LLMs) are increasingly characterised in recent evaluation work as having stable, model-level preference and value systems. However, accompanying robustness checks are limited to incidental prompt perturbations such as syntax variation and option reordering.…
Lawyer-client consultation is a critical starting point for legal services. Effective legal assistance hinges on eliciting sufficient and truthful information from clients in order to devise strategies that best protect their interests. This task requires Large Language Models (L…
The growing need to represent diverse perspectives has increased interest in pluralistic LLM generation. Although difficult to operationalize, identifying perspectives expressed in text would provide clear guidance on pluralistic alignment and more clearly articulate the pluralis…
Evaluating new large language models typically requires costly human annotation campaigns at scale. LLM-as-a-judge offers a cheaper alternative, but judge scores carry systematic errors - such as position bias, self-preference, or intransitivity - that can strongly miscalibrate t…
We present MÖVE (Modelle für die Öffentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, …
Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that measures political bias through option-level likelihoods…
arXiv:2606.11316v1 Announce Type: new Abstract: Large language models are increasingly deployed across professional domains, bringing hard-to-predict risks, including the generation of harmful or disrespectful content. Although substantial progress has been made in developing saf…
arXiv cs.AI
TIER_1English(EN)·Kaituo Zhang, Mingzhi Hu, Hoang Anh Duy Le, Fariha Kabir Torsha, Zhimeng Jiang, Minh Khai Bui, Chia-Yuan Chang, Yu-Neng Chuang, Zhen Xiong, Ying Lin, Guanchu Wang, Na Zou·
arXiv:2601.17717v3 Announce Type: replace Abstract: Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities. By transforming data from a scarce resource into a controllable asset, LLMs mitigate the bottlenecks imposed by the acqui…
arXiv:2606.12385v1 Announce Type: new Abstract: Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own…
arXiv cs.CL
TIER_1English(EN)·Hongjian Zhou, Xinyu Zou, Jinge Wu, Sean Wu, Junchi Yu, Bradley Max Segal, Tobias Erich Niebuhr, Sara Amro, Michael Petrus, Sheikh Momin, Alexandra M. Cardoso Pinto, Rachel Niesen, Laura Sophie Wegner, Dhruv Darji, Jung Moses Koo, Joshua Fieggen, Kapil N…·
arXiv:2606.12291v1 Announce Type: new Abstract: Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assu…
arXiv:2606.11375v1 Announce Type: cross Abstract: Standard linear probing declares a property "encoded" when a classifier on hidden states achieves high accuracy. The protocol works well on a snapshot but breaks across pre-training: probe accuracy saturates within the first few t…
arXiv:2506.08473v4 Announce Type: replace Abstract: Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. We observe that perturbations orthogonal to the ali…
arXiv cs.CL
TIER_1English(EN)·Muhammed Saeed, Simon Razniewski·
arXiv:2601.07506v2 Announce Type: replace Abstract: While large language models (LLMs) are increasingly used as automatic judges for question answering (QA) and other reference-conditioned evaluation tasks, little is known about their ability to adhere to a provided reference. We…
Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate re…
Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is in…
arXiv cs.AI
TIER_1English(EN)·Sophie Hao, William Merrill·
arXiv:2605.16430v2 Announce Type: replace-cross Abstract: Scaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure. While it is established that scaling up LLMs reliably increases model …
arXiv cs.AI
TIER_1English(EN)·Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Jingxiang Chen, Mohammad Kachuee, Teja Gollapudi, Yiwei Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong·
arXiv:2509.25760v2 Announce Type: replace-cross Abstract: While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside thei…
arXiv:2606.11016v1 Announce Type: new Abstract: We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a systematic underlying decision structure. Using synthetic binary decision settings in which …
arXiv:2606.09876v1 Announce Type: new Abstract: Large language models often express high confidence in answers that are wrong. Standard calibration remedies typically act globally or at the score level, reducing unwarranted confidence but also risking erosion of warranted confide…
arXiv cs.CL
TIER_1English(EN)·Qian Zhu, Xinnan Guo, Jingjing Huo, Jun Li, Pan Liu, Wenyan Yang, Wanqing Xu, Xuan Lin·
arXiv:2603.14463v2 Announce Type: replace Abstract: Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucina…
arXiv:2501.14717v2 Announce Type: replace Abstract: Table modeling has progressed for decades. In this work, we revisit this trajectory and highlight emerging challenges in the LLM era, particularly the paradox of choice: the difficulty of attributing performance gains amid diver…
arXiv cs.CL
TIER_1English(EN)·Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, Jun Zhao·
arXiv:2606.10875v1 Announce Type: new Abstract: Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic st…
ModSleuth is an agentic system that recursively reconstructs large-scale dependency graphs for LLM development by analyzing public artifacts and resolving inconsistencies in documentation and artifact identities.
We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a systematic underlying decision structure. Using synthetic binary decision settings in which models choose between profiles defined by graded…
Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic study on how knowledge influences tool-use perform…
arXiv:2606.09038v1 Announce Type: new Abstract: Large Language Models (LLMs) have enabled increasingly personalized interactions by adapting to users' preferences, contexts, and long-term histories. However, the mechanisms that enable personalization also expand the safety landsc…
arXiv:2601.14063v2 Announce Type: replace-cross Abstract: Cross-cultural competence in large language models (LLMs) requires understanding and adapting Culture-Specific Items (CSIs) across varying cultural contexts. However, progress in evaluating this capability remains limited …
arXiv cs.AI
TIER_1English(EN)·Yasushi Sakai, Allen Song, Kent Larson·
arXiv:2606.08098v1 Announce Type: new Abstract: Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. We show that piping the signals every sample carries into a delegation-based aggregator (Propagational Proxy Voting, PPV) y…
arXiv cs.AI
TIER_1English(EN)·Anissa Alloula, Federico Licini, Ava Batchkala, Seraphina Goldfarb-Tarrant·
arXiv:2606.07874v1 Announce Type: new Abstract: LLMs-as-judges are the only way to evaluate safety at scale. Despite their importance, LLM-judges themselves are rarely evaluated beyond human agreement in simple, static benchmarks. We therefore investigate two under-explored but c…
arXiv:2605.15416v2 Announce Type: replace-cross Abstract: Jung et al. (2025) introduce a hypothesis testing framework for guaranteeing agreement between large language models (LLMs) and human judgments, relying on the assumption that the model's estimated confidence is monotonic …
arXiv:2601.21996v2 Announce Type: replace-cross Abstract: While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Inf…
arXiv cs.LG
TIER_1English(EN)·Dennis Frauen, Athiya Deviyani, Mihaela van der Schaar, Stefan Feuerriegel·
arXiv:2601.21816v2 Announce Type: replace Abstract: Evaluating the performance of large language models (LLMs) from human preference data is crucial for obtaining LLM leaderboards. However, many existing approaches either rely on restrictive parametric assumptions or lack valid u…
arXiv:2606.07834v1 Announce Type: cross Abstract: LLM judges increasingly turn verdicts into system commitments. Under mixed evidence (claims with both supporting and refuting sources) this is unsafe: when the schema exposes CONFLICTING as the authorized non-directional verdict, …
arXiv:2606.07069v1 Announce Type: new Abstract: We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require reas…
arXiv:2601.10896v2 Announce Type: replace Abstract: LLMs are increasingly used as third-party judges, yet their reliability when evaluating speakers in dialogue remains poorly understood. We show that LLMs judge identical claims differently depending on framing: the same content …
arXiv:2606.06755v1 Announce Type: new Abstract: Authorship attribution research has traditionally focused on long-form, expressive texts; however, interactions with large language models (LLMs) are typically brief and task-driven prompts. This raises a fundamental question: do su…
arXiv:2603.26846v2 Announce Type: replace-cross Abstract: As Large Language Models (LLMs) expand in capability and application scope, their trustworthiness becomes critical. A vital risk is intrinsic deception, wherein models strategically mislead users to achieve their own objec…
arXiv cs.AI
TIER_1English(EN)·Gonzalo Mancera, Daniel DeAlcala, Aythami Morales, Julian Fierrez, Ruben Tolosana, Francisco Jurado·
arXiv:2606.06946v1 Announce Type: cross Abstract: We present LoRA-MINT, a new methodology for Membership Inference Test (MINT) applied to recent Large Language Models (LLMs) fine-tuned for specific Natural Language Processing (NLP) tasks through Low-Rank Adaptation (LoRA). The pr…
Multilingual LLM-as-a-judge is widely used to evaluate model outputs across languages, but suffers from cross-lingual inconsistency (Fu and Liu, 2025). Existing methods typically treat this inconsistency as noise and mitigate it through voting or aggregation. In this work, we ins…
arXiv cs.AI
TIER_1English(EN)·Cristina Carleo, Pietro Liguori, Naghmeh Ivaki, Domenico Cotroneo·
arXiv:2606.05396v1 Announce Type: cross Abstract: Producing a labeled vulnerable code at scale is a recurring obstacle for learning-based vulnerability detection: mined corpora carry substantial label noise, and existing LLM-based augmentation propagates these inaccuracies becaus…
arXiv cs.AI
TIER_1English(EN)·Oleg Somov, Mikhail Chaichuk, Gleb Ershov, Karim Vafin, Mikhail Seleznyov, Alexander Panchenko, Elena Tutubalina·
arXiv:2603.16475v2 Announce Type: replace Abstract: In schema-guided reasoning (SGR) pipelines, LLMs produce explicit intermediate structures -- rubrics, checklists, or verification queries -- before committing to a final decision. SGR is increasingly adopted because it promises …
LLM judges increasingly turn verdicts into system commitments. Under mixed evidence (claims with both supporting and refuting sources) this is unsafe: when the schema exposes CONFLICTING as the authorized non-directional verdict, returning SUPPORTS/REFUTES is an unauthorized dire…
Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-phase decision problem. In the query phase, the decis…
We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require reasoning in order to be answered correctly. Each qu…
We present LoRA-MINT, a new methodology for Membership Inference Test (MINT) applied to recent Large Language Models (LLMs) fine-tuned for specific Natural Language Processing (NLP) tasks through Low-Rank Adaptation (LoRA). The primary goal is to assess whether individual samples…
arXiv:2606.06087v1 Announce Type: new Abstract: Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present Latent…
arXiv cs.CL
TIER_1English(EN)·Amirhossein Ghaffari, Ali Goodarzi, Huong Nguyen, Simo Hosio, Lauri Lov\'en, Ekaterina Gilman·
arXiv:2606.06027v1 Announce Type: cross Abstract: Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artif…
arXiv cs.CL
TIER_1English(EN)·Kuan-Yen Chen, Fang-Yi Su, Jung-Hsien Chiang·
arXiv:2606.05976v1 Announce Type: cross Abstract: Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a cap…
arXiv cs.CL
TIER_1English(EN)·Taewon Yun, Hyeonseong Park, Jeonghwan Choi, Hayoon Park, Yeeun Choi, Hwanjun Song·
arXiv:2606.05563v1 Announce Type: cross Abstract: Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly st…
arXiv:2606.05384v1 Announce Type: cross Abstract: LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We sh…
arXiv:2606.06350v1 Announce Type: new Abstract: Reliable rubric grading requires more than accurate score prediction. Each judgement must be grounded in the mark scheme and evidence from the student answer. Existing credit-assignment and intervention methods, primarily designed f…
arXiv cs.CL
TIER_1English(EN)·Gianluca Barmina, Peter Schneider-Kamp, Lukas Galke Poech·
arXiv:2606.06286v1 Announce Type: new Abstract: Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-awar…
arXiv:2606.05804v1 Announce Type: new Abstract: Prompted knowledge cutoff instructs a large language model (LLM) to act as if information beyond a specified cutoff date were unavailable. However, prior work mainly relies on direct-answer generation, which struggles when post-cuto…
arXiv:2606.05793v1 Announce Type: new Abstract: While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral exec…
arXiv cs.LG
TIER_1English(EN)·Arslan Bisharat, Brian Ortiz, Eric Spencer, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad·
arXiv:2606.05792v1 Announce Type: cross Abstract: TLA+ has supported industrial verification at companies such as Amazon and Microsoft, yet writing correct TLA+ specifications from natural language still requires time and expertise, which limits adoption. LLMs show promise, but n…
arXiv cs.LG
TIER_1English(EN)·Rohan N. Pradhan, Steve Goley·
arXiv:2606.05403v1 Announce Type: new Abstract: Language models increasingly act as epistemic proxies, synthesizing evidence from multiple sources to inform decisions. Whether they evaluate the quality of that evidence, or merely aggregate it based on surface presentation, remain…
Tool-calling language model agents exhibit improved safety after initial interactions, with a systematic benchmark demonstrating enhanced security through prior task completion.
Authorship attribution research has traditionally focused on long-form, expressive texts; however, interactions with large language models (LLMs) are typically brief and task-driven prompts. This raises a fundamental question: do such prompts contain a stable, author-identifiable…
Reliable rubric grading requires more than accurate score prediction. Each judgement must be grounded in the mark scheme and evidence from the student answer. Existing credit-assignment and intervention methods, primarily designed for self-contained reasoning tasks such as mathem…
Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that con…
Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills …
Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framewor…
Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a capability deficit or a role-label artifact: does an …
Prompted knowledge cutoff instructs a large language model (LLM) to act as if information beyond a specified cutoff date were unavailable. However, prior work mainly relies on direct-answer generation, which struggles when post-cutoff knowledge is not explicitly queried but is on…
arXiv:2606.04915v1 Announce Type: new Abstract: Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation…
arXiv cs.CL
TIER_1English(EN)·XiuYu Zhang, Yi Shan, Junfeng Fang, Zhenkai Liang·
arXiv:2606.05122v1 Announce Type: new Abstract: Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: promp…
arXiv:2601.05633v2 Announce Type: replace Abstract: Recent LLMs excel at formal tasks such as mathematical reasoning and code generation, but still struggle with broader abilities such as planning, creativity, and social intelligence. Inspired by human learning, where formal inst…
arXiv cs.LG
TIER_1English(EN)·Rachit Bansal, Clara Mohri, Tian Qin, David Alvarez-Melis, Sham Kakade·
arXiv:2606.04272v1 Announce Type: new Abstract: The standard LLM training pipeline applies reinforcement learning (RL) only after pre-training and supervised fine-tuning (SFT). We question this status quo by training a LLM from scratch and applying RL, SFT, and SFT followed by RL…
arXiv:2606.04035v1 Announce Type: cross Abstract: We present a systematic study of domain-dependent safety behavior in open-weight LLMs: 7 standardized experiments across 7 ethical domains, testing 5 models (12B--70B) in 4,200 interactions with dual-judge validation. Using a dual…
arXiv cs.AI
TIER_1English(EN)·Liang Shan, Kaicheng Shen, Wen Wu, Zhenyu Ying, Chaochao Lu, Yan Teng, Jingqi Huang, Qingshan Liu, Guangze Ye, Guoqing Wang, Jie Zhou, Liang He·
arXiv:2511.07107v3 Announce Type: replace Abstract: Ensuring the safety of Large Language Models (LLMs) is critical for real-world deployment. However, current safety measures often fail to address implicit, domain-specific risks. To investigate this gap, we introduce a dataset o…
arXiv cs.AI
TIER_1English(EN)·Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, Fei Huang·
arXiv:2505.11166v3 Announce Type: replace-cross Abstract: Despite advances in pretraining with extended context sizes, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context align…
arXiv:2603.19225v3 Announce Type: replace-cross Abstract: Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynam…
arXiv cs.CL
TIER_1English(EN)·Ming-Hao Hsu, Xiaohai Tian, Jun Zhang, Zhizheng Wu·
arXiv:2606.04474v1 Announce Type: new Abstract: Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T) matche…
PropMe framework evaluates language model memorization by distinguishing between forced reproduction capabilities and natural propensity, using SimpleTrace for deterministic attribution and propensity-transformed metrics across open models and datasets.
LatentSkill enables efficient deployment of textual skills in agent systems by converting them into LoRA adapters stored in weight space, reducing context overhead while maintaining modularity and composability.
Parametric tool retrieval models show reduced performance and understanding when evaluated with realistic ambiguous queries compared to standard benchmarks, revealing a dissociation between knowledge retrieval and true tool comprehension.
SoCRATES presents a realistic multi-domain benchmark for evaluating proactive LLM mediators across various socio-cognitive adaptation axes, demonstrating that even top-performing models only resolve about one-third of the consensus gap in conflict resolution.
Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an e…
Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an e…
Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with plac…
Conversational Search (CS) considers retrieval of relevant documents based on conversational context. Large Language Models (LLMs) have significantly enhanced CS by enabling effective query rewriting. However, employing LLMs during inference poses efficiency challenges. A method …
Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, synt…
arXiv:2606.03036v1 Announce Type: new Abstract: LLMs have evolved from basic chatbots to the backbone of the AI ecosystem, now widely used in healthcare, schools, and government services. The domain-wide adoption of LLMs necessitates continuous evaluation to ensure their safety a…
arXiv cs.CL
TIER_1English(EN)·Chaoyi Xiang, Olga Ohrimenko, Benjamin I. P. Rubinstein, Lea Frermann·
arXiv:2606.03291v1 Announce Type: new Abstract: Large language models (LLMs) can memorize sensitive facts, motivating unlearning methods that remove targeted knowledge without costly retraining. However, unlearning research remains heavily English-centric. We study multilingual u…
arXiv:2606.03043v1 Announce Type: new Abstract: LMs-as-judges are now standard, yet judges agree strongly with one another while agreeing only weakly with humans. We test whether this reflects shared signal or shared bias by measuring four geometric quantities on the standard LLM…
arXiv cs.AI
TIER_1English(EN)·Lukas Fesser, Yasha Ektefaie, Ada Fang, Sham M. Kakade, Marinka Zitnik·
arXiv:2604.12176v2 Announce Type: replace Abstract: Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables. This ability is central to scientific reasoning, but existing evaluations of relational reasoning in large lan…
arXiv:2606.02606v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly deployed as continuously evolving services, where frequent base-model updates may invalidate previously deployed task-specific Low-Rank Adaptation (LoRA) adapters. For service provider…
arXiv:2606.03092v1 Announce Type: new Abstract: Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet real-world deployment is constrained by strict computational budgets. In this work, we formulate inference budget allocati…
arXiv:2602.07842v2 Announce Type: replace Abstract: Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these metho…
arXiv:2606.03785v1 Announce Type: new Abstract: Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, le…
arXiv:2606.03318v1 Announce Type: new Abstract: Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions a…
Self-Evaluation Elicitation (SEE) method improves model calibration for quality assessment through calibration-coupled reinforcement learning and masked distillation, demonstrating transferable quality evaluation beyond specific judge preferences.
Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage …
Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage …
Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions and lacks experience-oriented evaluation. These l…
Large language models (LLMs) can memorize sensitive facts, motivating unlearning methods that remove targeted knowledge without costly retraining. However, unlearning research remains heavily English-centric. We study multilingual unlearning by extending the TOFU benchmark to fiv…
arXiv:2606.00476v1 Announce Type: new Abstract: Do LLM agents act on the reasoning they state? This question of process fidelity is central to using LLMs in social simulation, yet it is hard to measure where no reference for correct behavior exists. We study it in acontrolled set…
arXiv cs.LG
TIER_1English(EN)·Weitao Li, Hao Zhou, Xuanyu Lei, Fandong Meng, Yuanhang Liu, Jingyi Ren, Ante Wang, Xiaolong Wang, Yuanchi Zhang, Fuwen Luo, Guangwen Yang, Lin Gan, Weizhi Ma, Yang Liu·
arXiv:2606.00869v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become central to LLM reasoning, but its outcome-level rewards can make models more willing to give confident answers when evidence or reasoning is unreliable. Existing SFT o…
arXiv cs.CL
TIER_1English(EN)·Yoonah Park, Haesung Pyun, Yohan Jo·
arXiv:2509.23782v4 Announce Type: replace Abstract: While large language models (LLMs) perform strongly on diverse tasks, their trustworthiness is limited by erratic behavior that is unfaithful to their internal knowledge. In particular, LLMs often fail on multiple-choice questio…
arXiv cs.CL
TIER_1English(EN)·Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh, Isabelle Augenstein·
arXiv:2606.02493v1 Announce Type: new Abstract: Large language models (LLMs) are being increasingly used to answer subjective, information-seeking questions, where users are sensitive to how responses are communicated, not just whether the answers are correct. Existing LLM evalua…
arXiv:2606.01879v1 Announce Type: new Abstract: Existing research largely reduces cultural intelligence in LLMs to a knowledge-level problem, overlooking whether models can effectively utilize their acquired knowledge in realistic scenarios. To bridge this gap, we introduce Cultu…
arXiv:2606.01168v1 Announce Type: new Abstract: Chain-of-Thought (CoT) has significantly enhanced LLM reasoning, yet often incurs substantial computational overhead due to "overthinking": generating excessively long rationales without commensurate accuracy gains. Existing efficie…
arXiv:2606.00975v1 Announce Type: new Abstract: LLM chatbots increasingly serve as a first source of support for people in psychological distress, including those whose distress is entangled with delusional beliefs. Prior work on LLM mental-health safety largely evaluates general…
arXiv cs.CL
TIER_1English(EN)·F. Carichon, S. Sharma, M. Girard, R. Rampa, G. Farnadi·
arXiv:2606.00875v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for tasks involving creative problem solving and idea generation. However, there is a lack of consensus concerning their creative capabilities: some studies report superior performa…
arXiv:2606.00596v1 Announce Type: new Abstract: Large language models have rapidly evolved in multilingual competence and reasoning capacity, enabling their integration into Social Sciences and Humanities research workflows. Yet existing evaluation paradigms remain anchored in ta…
arXiv cs.CL
TIER_1English(EN)·Delip Rao, Chris Callison-Burch·
arXiv:2606.00093v1 Announce Type: new Abstract: Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $\kappa$, and one or more rank correlations. A survey of 24 recent LLM-as-judge pape…
arXiv cs.AI
TIER_1English(EN)·Shei Pern Chua, Zhen Leng Thai, Kai Jun Teh, Xiao Li, Qibing Ren, Xiaolin Hu·
arXiv:2509.05367v5 Announce Type: replace-cross Abstract: Large Language Model safety alignment predominantly operates on a binary assumption that requests are either safe or unsafe. This classification proves insufficient when models encounter ethical dilemmas, where the capacit…
arXiv:2606.01682v1 Announce Type: cross Abstract: Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids…
arXiv:2606.01637v1 Announce Type: cross Abstract: Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conformity: a model may abandon its own answer simply because others agree on a different one. …
arXiv:2606.00642v1 Announce Type: new Abstract: Reasoning traces have become a valuable form of learning signals for improving and transferring the capabilities of large language models. In particular, detailed traces can help distill reasoning behavior from stronger teacher mode…
arXiv:2606.00251v1 Announce Type: new Abstract: The ability to recognize one's own limitations and decide whether to solve a problem or delegate is fundamental for reliable intelligent systems. Yet we show that modern large language models systematically lack this ability: across…
LLMs have evolved from basic chatbots to the backbone of the AI ecosystem, now widely used in healthcare, schools, and government services. The domain-wide adoption of LLMs necessitates continuous evaluation to ensure their safety and fairness. Common issues encountered after dep…
Inference-time scaling is enhanced through constrained optimization that allocates computational resources based on economic principles, improving performance in resource-constrained environments.
Large language models (LLMs) are being increasingly used to answer subjective, information-seeking questions, where users are sensitive to how responses are communicated, not just whether the answers are correct. Existing LLM evaluations for subjective cultural queries largely fo…
Existing research largely reduces cultural intelligence in LLMs to a knowledge-level problem, overlooking whether models can effectively utilize their acquired knowledge in realistic scenarios. To bridge this gap, we introduce CultureForest, a benchmark for \textit{Cultural Norm …
Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during ge…
arXiv cs.AI
TIER_1English(EN)·Chanwoo Park, Ziyang Chen, Asuman Ozdaglar, Kaiqing Zhang·
arXiv:2511.04393v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly deployed as "agents" for decision-making (DM) in interactive and dynamic environments. Yet, since they were not originally designed for DM, recent studies show that LLMs can struggle…
arXiv cs.AI
TIER_1English(EN)·Junhyuk Choi, Sohhyung Park, Chanhee Cho, Hyeonchu Park, Bugeun Kim·
arXiv:2602.00521v2 Announce Type: replace Abstract: While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and re…
arXiv:2605.31287v1 Announce Type: cross Abstract: Managers in manufacturing settings rely on digital interfaces to interpret operational data for decision-making, but growing data volume and complexity can make relevant insights difficult to identify efficiently. While dashboards…
arXiv cs.LG
TIER_1English(EN)·Ali Dadsetan, Frank Rudzicz·
arXiv:2510.01137v3 Announce Type: replace Abstract: Privacy is a central concern when fine-tuning large language models (LLMs) on sensitive data, and differentially private stochastic gradient descent (DP-SGD) -- which clips per-sample gradients and adds calibrated Gaussian noise…
arXiv:2604.10495v2 Announce Type: replace Abstract: As Large Language Models (LLMs) are increasingly deployed in real-world applications, reliable uncertainty quantification (UQ) becomes critical for safe and effective use. Most existing UQ approaches for language models aim to p…
arXiv cs.AI
TIER_1English(EN)·Iv\'an Arcuschin, David Chanin, Adri\`a Garriga-Alonso, Oana-Maria Camburu·
arXiv:2602.10117v5 Announce Type: replace-cross Abstract: Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these unverbalized biases. Monitoring models via their stated reasoning is the…
arXiv:2510.03415v3 Announce Type: replace-cross Abstract: Recent work asks whether large language models (LLMs) condition their reasoning on explicit rules rather than statistical regularities from pretraining. Program execution provides a canonical instance: formal semantics def…
arXiv:2503.05846v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have achieved impressive progress across a wide range of tasks, yet their heavy reliance on English-centric training data leads to significant performance degradation in non-English languages. …
arXiv cs.AI
TIER_1English(EN)·Caroline Wang, Daniel Kasenberg, Kim Stachenfeld, Pablo Samuel Castro·
arXiv:2602.10324v2 Announce Type: replace Abstract: As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provide…
Chunk-Level Guided Generation uses a large language model as a process scorer to select fixed-length candidate chunks during small model generation, improving reasoning accuracy over traditional methods like majority voting and PRM guided search.
Managers in manufacturing settings rely on digital interfaces to interpret operational data for decision-making, but growing data volume and complexity can make relevant insights difficult to identify efficiently. While dashboards remain dominant in industrial contexts, Large Lan…
arXiv cs.AI
TIER_1English(EN)·Rebecca M. M. Hicke, Kiran Tomlinson·
arXiv:2605.29018v1 Announce Type: new Abstract: Although a growing body of research has begun to describe user--LLM interactions, the picture it paints is largely static; little is known about how individual users change their behavior over time. To address this gap, we analyze t…
arXiv:2510.00777v2 Announce Type: replace Abstract: LLM-generated drafts often contain subtle factual or logical errors, yet prior work shows that models struggle to reliably integrate multi-turn feedback aimed at fixing them. We propose in-place feedback, an interaction paradigm…
arXiv:2510.14365v4 Announce Type: replace Abstract: This work investigates the resilience of contemporary large language models (LLMs) against frequent character-level perturbations. We examine three types of character-level perturbations including introducing numerous typos with…
arXiv:2508.19202v3 Announce Type: replace Abstract: Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for ass…
arXiv cs.CL
TIER_1English(EN)·Wajdi Zaghouani, Kholoud K. Aldous, Yicheng Gao·
arXiv:2605.29667v1 Announce Type: new Abstract: When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries, leav…
arXiv cs.CL
TIER_1English(EN)·Yeyong Yu, Wenya Hu, Xing Wu, Quan Qian·
arXiv:2605.29555v1 Announce Type: new Abstract: As candidate generation and high-throughput experimentation advance, the primary bottleneck in materials discovery is shifting from property prediction to making reliable evaluations among massive candidate sets. We propose a Knowle…
arXiv cs.CL
TIER_1English(EN)·Xinming Yang, Jun Li·
arXiv:2605.29007v1 Announce Type: new Abstract: Personalized tutoring, teacher training, and education research need access to \emph{targeted} synthetic misconceptions, but privacy and IRB constraints make labelled corpora of real student errors scarce. LLMs could in principle ge…
arXiv cs.CL
TIER_1English(EN)·Mohamed Abdelwahab, Michelle Yu Collins, Sihan Chen, Yi Cheng Zhao, Zafarullah Mahmood, Jiading Zhu, Soliman Ali, Jonathan Rose·
arXiv:2605.28823v1 Announce Type: new Abstract: As the influence of LLMs expands, it is imperative to gain insight into their decisions. One way to do that is to develop probes that detect the presence or absence of a broad set of concepts within the embeddings computed in an LLM…
arXiv:2410.10398v3 Announce Type: replace-cross Abstract: As large language models (LLMs) increasingly engage in complex social interactions, ensuring that their behaviors align with human ethical principles and intentions, known as value alignment, has become a critical scientif…
arXiv:2601.21909v2 Announce Type: replace Abstract: Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental g…
arXiv cs.AI
TIER_1English(EN)·Ruoxi Su, Yuhan Liu, Jingyu Hu·
arXiv:2605.29458v1 Announce Type: cross Abstract: Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona information is often provided as static descriptions that miss the values, experiences, and …
arXiv:2605.30036v1 Announce Type: new Abstract: Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this wor…
arXiv cs.AI
TIER_1English(EN)·Yunjin Qi, Zhaojun Jiang, Xuan Wu, Hanxi Pan, Yixuan Wang, Yanfang Liu, Xiang Ji, Churu Yu, Chunyuan Zheng, Yingze Chen, Jie He, Liuqing Chen, Zaifeng Gao·
arXiv:2605.29685v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly applied in social contexts such as emotional companionship and customer service, measuring their social intelligence has become critical to the quality and safety of human-AI interact…
Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value th…
When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries, leaving models exposed to adversarial prompts that e…
arXiv cs.LG
TIER_1English(EN)·Chacha Chen, Matthew J\"orke, Adam Goli\'nski, Masha Fedzechkina, Guillermo Sapiro, Sinead Williamson, Nicholas Foti·
arXiv:2605.06915v2 Announce Type: replace Abstract: Modern AI systems are being deployed in complex domains such as medicine, science, and law, where it is important that they not only produce correct answers, but also represent and update uncertain beliefs about the world as new…
arXiv:2507.06999v2 Announce Type: replace-cross Abstract: Reasoning is essential for large language models (LLMs), especially in complex tasks such as mathematical problem solving. However, multimodal reasoning still faces challenges in modality alignment and training scalability…
arXiv:2605.11458v2 Announce Type: replace Abstract: On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such m…
arXiv:2605.28398v1 Announce Type: new Abstract: Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost. However, existing methods for adaptive thinking-mode selecti…
arXiv:2605.28778v1 Announce Type: new Abstract: LLMs' linguistically expressed confidence should faithfully reflect their intrinsic uncertainty. While recent work shows LLMs struggle to use epistemic markers (e.g., "it is likely...") in a human-aligned fashion, it remains unclear…
arXiv cs.AI
TIER_1English(EN)·Camilo Chac\'on Sartori, Jos\'e H. Garc\'ia·
arXiv:2605.27789v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) systems are often compared by asking a large language model (LLM) judge which answer is better. For multi-hop RAG, this has become a measurement problem as much as a modeling problem: the same sc…
arXiv:2509.21128v2 Announce Type: replace Abstract: Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these metho…
arXiv:2605.28388v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Reward (RLVR) is empirically shown to notably enhance the reasoning performance of large language models (LLMs), particularly in mathematics and programming. However, the mechanistic role of Sa…
LLMs' linguistically expressed confidence should faithfully reflect their intrinsic uncertainty. While recent work shows LLMs struggle to use epistemic markers (e.g., "it is likely...") in a human-aligned fashion, it remains unclear whether models can apply their own linguistic c…
arXiv cs.AI
TIER_1English(EN)·Shivam Rawat, Lucie Flek, Florian Mai, Nicholas Kluge Corr\^ea·
arXiv:2604.21454v2 Announce Type: replace-cross Abstract: Reasoning in large language models is often discussed as a single capability, but some of its gains may stem from simpler underlying operations. We examine two such primitives, recall and state-tracking, through five contr…
arXiv cs.AI
TIER_1English(EN)·Wenda Xu, Sweta Agrawal, Vil\'em Zouhar, Markus Freitag, Daniel Deutsch·
arXiv:2509.26600v2 Announce Type: replace-cross Abstract: As LLMs rapidly saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) -- where a model generates test inputs (LLM-as-a-testset) and evaluates outputs (LLM-as-an-evaluator) -- has gained…
arXiv:2508.18444v2 Announce Type: replace-cross Abstract: With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with human values, but this comes at the cost of transparency. Although promising results…
arXiv cs.AI
TIER_1English(EN)·Shashwat Singh, Tal Linzen, Shauli Ravfogel·
arXiv:2605.26242v1 Announce Type: new Abstract: Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may b…
arXiv cs.AI
TIER_1English(EN)·Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dongsheng Li, Yuqing Yang·
arXiv:2603.15500v2 Announce Type: replace Abstract: LLMs often exhibit Aha moments such as self-correction after tokens like "Wait," yet the underlying mechanism remains unclear. Standard LLMs collapse mainly through silent divergence, where trajectories drift from the correct an…
arXiv cs.AI
TIER_1English(EN)·Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin·
arXiv:2605.27288v1 Announce Type: cross Abstract: Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we …
arXiv:2605.26322v1 Announce Type: new Abstract: Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, where performance is judged solely by the final answer…
arXiv cs.AI
TIER_1English(EN)·Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin·
arXiv:2603.11394v3 Announce Type: replace-cross Abstract: Large language models (LLMs) excel on static benchmarks, but their performance across multi-turn conversations, which better reflect real-world usage, remains understudied. Addressing this gap is critical in high-stakes se…
HRBench presents a unified evaluation framework for hybrid-reasoning LLMs that systematically compares thinking-mode switching strategies across different training regimes and model scales.
Empirical analysis reveals limited alignment between LLM-generated reviews and human reviews, with varying performance across different prompts and models, and demonstrates that authors can strategically improve paper scores through iterative revision based on LLM feedback.
Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we hypothesize that conformity is also driven by a mo…
arXiv cs.CL
TIER_1English(EN)·Nura Aljaafari, Marco Valentino, Andr\'e Freitas·
arXiv:2605.25520v1 Announce Type: new Abstract: Predicting a label correctly does not necessarily require representing the operation that produces it. Transformer representations are known to carry label-level information, but whether they encode semantic operations producing tho…
arXiv:2605.23926v1 Announce Type: new Abstract: Reasoning-capable large language models solve hard problems by emitting long chains of thought, paying heavily in latency, GPU time, and energy. Casual inspection of their traces reveals extensive reformulation, verification, and ci…
arXiv:2605.23965v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equiva…
arXiv:2605.24661v1 Announce Type: new Abstract: LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into the underlying reasoning processes that produce those …
arXiv:2602.21198v3 Announce Type: replace-cross Abstract: Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into exper…
arXiv:2605.24432v1 Announce Type: new Abstract: Large Language Model (LLM) interactions are typically underspecified, with users clarifying all necessary details across multiple conversational turns. Yet recent work shows that LLMs perform far worse in this multi-turn setting tha…
arXiv cs.CL
TIER_1English(EN)·Jinyan Su, Claire Cardie·
arXiv:2605.25284v1 Announce Type: new Abstract: User queries are often underspecified and may admit multiple valid interpretations. Rather than silently making assumptions about the user's intent, a helpful assistant should surface such ambiguity by asking a clarifying question. …
arXiv cs.LG
TIER_1English(EN)·Dennis Frauen, Marie Brockschmidt, Konstantin Hess, Haorui Ma, Yuchen Ma, Abdurahman Maarouf, Maresa Schr\"oder, Jonas Schweisthal, Yuxin Wang, Athiya Deviyani, Sonali Parbhoo, Rahul G. Krishnan, Stefan Feuerriegel·
arXiv:2605.25998v1 Announce Type: new Abstract: Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM develop…
arXiv cs.LG
TIER_1English(EN)·Jackie Baek, Yunhan Chen, Ziyu Chi, Will Ma·
arXiv:2602.06357v2 Announce Type: replace Abstract: LLMs can generate a wealth of data, ranging from simulated personas imitating human valuations and preferences, to demand forecasts based on world knowledge. But how well do such LLM-generated distributions support downstream de…
Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion …
Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM development and evaluation are inherently causal: What …
Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM development and evaluation are inherently causal: What …
Predicting a label correctly does not necessarily require representing the operation that produces it. Transformer representations are known to carry label-level information, but whether they encode semantic operations producing those labels is unclear. We investigate this in Nat…
arXiv cs.AI
TIER_1English(EN)·Dongxin Guo, Jikun Wu, Siu Ming Yiu·
arXiv:2605.23039v1 Announce Type: cross Abstract: How do learners acquire knowledge of what is unacceptable without negative evidence? Construction Grammar proposes statistical preemption: exposure to a conventional form (e.g., "donated the books to the library") preempts structu…
arXiv:2605.23147v1 Announce Type: cross Abstract: Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an ear…
arXiv cs.LG
TIER_1English(EN)·Tim Tomov, Dominik Fuchsgruber, Stephan G\"unnemann·
arXiv:2601.21500v2 Announce Type: replace Abstract: In many applications of LLMs, natural language responses often have an underlying structure such as representing discrete labels, numerical values, or graphs. Yet, existing decoding and uncertainty estimation methods operate onl…
arXiv:2605.23071v1 Announce Type: new Abstract: Large language models (LLMs) increasingly rely on long-context processing, but expanding context windows introduces substantial computational and financial costs. Existing context reduction approaches, including retrieval and memory…
arXiv:2605.20087v2 Announce Type: replace-cross Abstract: Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-world multi-turn human-…
Large language models may not genuinely detect their internal states, as their apparent introspective abilities could reflect surface-level pattern matching rather than true metacognitive monitoring.
Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable checks or ground-truth answers, but provides limit…
arXiv:2605.20382v1 Announce Type: cross Abstract: Language models are trained to follow instructions, but they are also powerful pattern completers. What happens when these two objectives conflict? We construct conversations in which a user instruction to behave in a target way T…
arXiv:2605.22205v1 Announce Type: cross Abstract: Large language models increasingly require specialization across diverse domains, yet existing approaches struggle to balance multi-domain capacities with strict memory and inference constraints. In this work, we introduce SkillWe…
arXiv:2601.05106v4 Announce Type: replace-cross Abstract: Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitive…
arXiv:2605.22714v1 Announce Type: cross Abstract: Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation histor…
arXiv cs.AI
TIER_1English(EN)·Andrea Sassella, Andrea Chizzola, Tommaso Bianchi, Luca Alessandrelli, Mark James Carman·
arXiv:2605.07731v2 Announce Type: replace-cross Abstract: This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a w…
arXiv cs.AI
TIER_1English(EN)·Sangwoo Park, Woongyeong Yeo, Seanie Lee, Yumin Choi, Hyomin Lee, Kangsan Kim, Jinheon Baek, Seong Joon Oh, Sung Ju Hwang·
arXiv:2605.20258v1 Announce Type: cross Abstract: Contextual Integrity (CI) defines privacy not merely as keeping information hidden, but as governing information flows according to the norms of a given context. As large language models are increasingly deployed as personal agent…
Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an early/mid layer band. There, persona and task contrib…
Large language models (LLMs) increasingly rely on long-context processing, but expanding context windows introduces substantial computational and financial costs. Existing context reduction approaches, including retrieval and memory compression methods, are typically evaluated us…
How do learners acquire knowledge of what is unacceptable without negative evidence? Construction Grammar proposes statistical preemption: exposure to a conventional form (e.g., "donated the books to the library") preempts structurally possible but unattested alternatives ("*dona…
Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call t…
We read twelve well-known LLM agent benchmark papers and recorded, dimension by dimension, what each paper actually says about how its evaluation was run. The motivation came from a familiar frustration: two papers will report results on the same benchmark with the same model nam…
While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism…
Language models are trained to follow instructions, but they are also powerful pattern completers. What happens when these two objectives conflict? We construct conversations in which a user instruction to behave in a target way T (e.g., always output a specific token, answer in …
Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-world multi-turn human--AI conversations with users' self-reported thoughts: thei…
Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals…
Proxy metrics based on token-level statistics from expert-written solutions provide more reliable model performance forecasting than traditional loss-based methods across multiple development stages.
LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers n…
As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standa…
Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and …
Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats h…
Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: p…
We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing ea…
Structured LLM workflows, where specialized LLM sub-agents execute according to a predefined graph, have become a powerful abstraction for solving complex tasks. Optimizing such workflows, i.e., selecting configurations for each sub-agent to balance accuracy and latency, is chall…
As Large Language Models (LLMs) are transforming software development, the functional quality of generated code has become a central focus, leaving readability, one of critical non-functional attributes, understudied. Given that LLM-generated code still needs human review before …
Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context lea…
Prediction sets provide a theoretically grounded framework for quantifying uncertainty in machine learning models. Adapting them to structured generation tasks, in particular, large language model (LLM) based code generation, remains a challenging problem. An existing attempt pro…
Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing…
The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model's plasticity in complex open-ended scenarios, leading to…
Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve eff…
Multi-domain fine-tuning of large language models requires improving performance on target domains while preserving performance on constrained domains, such as general knowledge, instruction following, or safety evaluations. Existing data mixing strategies rely on fixed heuristic…
We demonstrate LLARS (LLM Assisted Research System), an open-source platform that bridges the gap between domain experts and developers for building LLM-based systems. It integrates three tightly connected modules into an end-to-end pipeline: Collaborative Prompt Engineering for …
Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems.…
Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a framework that personalizes LLMs via pr…
Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a framework that personalizes LLMs via pr…
Large Language Models are increasingly used as zero-shot classifiers in complex reasoning tasks. However, standard constrained decoding suffers from a phenomenon we define as Renormalization Bias. When a model is restricted to a small set of target labels, the standard softmax op…
Multi-judge evaluation is increasingly used to assess LLMs and reward models, and the prevailing heuristic is to curate: keep the most accurate judges and discard weaker ones. We show that this heuristic can reverse when the target is not point accuracy, but calibrated probabilis…
Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions. However, state-of-the-art guardrail models rely on autoregressive decoders with 7B--27B parameters, reformulating what is fun…
Large Language Models (LLMs) rely on safety alignment to obey safe requests while refusing harmful ones. However, traditional refusal mechanisms often lead to "rigid rejection," where a general template (e.g., "I cannot fulfill this request") indiscriminately triggers refusals an…
Large Language Models (LLMs) are increasingly used in settings where reliable self-assessment is critical. Assessing model reliability has evolved from using probabilistic correctness estimates to, more recently, eliciting verbalized confidence. Confidence, however, has been show…
Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel framework that bridges uncertainty quantification …
This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared …
Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy policy-gradient estimation. Existi…
Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for sur…
arXiv cs.AI
TIER_1English(EN)·Nguyen Viet Tuan Kiet, Bui Dinh Pham, Dao Van Tung, Tran Cong Dao, Huynh Thi Thanh Binh·
arXiv:2605.06123v1 Announce Type: new Abstract: Large language models (LLMs) have recently advanced automatic heuristic design (AHD) for combinatorial optimization (CO), where candidate heuristics are iteratively proposed, evaluated, and refined. Most existing approaches search o…
arXiv cs.LG
TIER_1English(EN)·Zixuan Chen, Hao Lin, Zizhe Chen, Yizhou Tian, Garry Yang, Depeng Wang, Ya Guo, Huijia Zhu, James Cheng·
arXiv:2605.05957v1 Announce Type: new Abstract: LLMs reliably correct false claims when presented in isolation, yet when the same claims are embedded in task-oriented requests, they often comply rather than correct. We term this failure mode \emph{correction suppression} and cons…
arXiv cs.LG
TIER_1English(EN)·Xinrui Chen, Liu Yang, Ou Wu·
arXiv:2605.06166v1 Announce Type: new Abstract: In Large Language Model (LLM) fine-tuning, parameter and data selection are common strategies for reducing fine-tuning cost, yet they are typically driven by separate scoring mechanisms. When a parameter mask and data subset jointly…
arXiv:2605.06350v1 Announce Type: new Abstract: Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical…
arXiv cs.LG
TIER_1English(EN)·Sushant Gautam, Finn Schwall, Annika Willoch Olstad, Fernando Vallecillos Ruiz, Birk Torpmann-Hagen, Sunniva Maria Stordal Bj{\o}rklund, Leon Moonen, Klas Pettersen, Michael A. Riegler·
arXiv:2605.06652v1 Announce Type: new Abstract: Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and …
arXiv cs.LG
TIER_1English(EN)·Andy Zeyi Liu, Elliot Paquette, John Sous·
arXiv:2605.05683v1 Announce Type: cross Abstract: Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family…
arXiv:2605.05973v1 Announce Type: cross Abstract: Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy proc…
arXiv cs.LG
TIER_1English(EN)·Jonas Bayer, Stefan Zetzsche, Olivier Bouissou, Remi Delmas, Michael Tautschnig, Soonho Kong·
arXiv:2605.06184v1 Announce Type: cross Abstract: We introduce an evaluation framework of 500 C verification tasks across five property types (memory safety, overflow, termination, reachability, data races) built on SV-COMP 2025, and evaluate 14 models across six families. We fin…
arXiv cs.LG
TIER_1English(EN)·Florian A. D. Burnat, Brittany I. Davidson·
arXiv:2605.06327v1 Announce Type: cross Abstract: Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context…
arXiv:2605.06334v1 Announce Type: cross Abstract: Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is chall…
arXiv cs.LG
TIER_1English(EN)·Zichuan Liu, Jinyu Wang, Lei Song, Jiang Bian·
arXiv:2508.06412v3 Announce Type: replace Abstract: Recent advancements in LLM post-training, particularly through reinforcement learning and preference optimization, are key to boosting their reasoning capabilities. However, these methods often suffer from low sample efficiency …
arXiv cs.LG
TIER_1English(EN)·Wei Huang, Anda Cheng, Yinggui Wang, Lei Wang, Tao Wei·
arXiv:2601.20375v2 Announce Type: replace Abstract: Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (…
arXiv cs.LG
TIER_1English(EN)·Ekaterina Fadeeva, Maiya Goloburda, Aleksandr Rubashevskii, Roman Vashurin, Artem Shelmanov, Preslav Nakov, Mrinmaya Sachan, Maxim Panov·
arXiv:2512.09538v2 Announce Type: replace-cross Abstract: Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring …
arXiv:2605.05485v1 Announce Type: new Abstract: LLMs can solve program synthesis tasks but remain inefficient and unreliable on hard instances requiring large combinatorial search. Given a small set of reasoning traces, we use coding agents to compile them into reusable symbolic …
arXiv cs.CL
TIER_1English(EN)·Ruben Fernandez-Boullon, David N. Olivieri·
arXiv:2605.06480v1 Announce Type: cross Abstract: Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high…
arXiv cs.AI
TIER_1English(EN)·Amal Alnouri, Andreas Hinterreiter, Christina Humer, Furui Cheng, Marc Streit·
arXiv:2605.06054v1 Announce Type: new Abstract: Large language model (LLM) outputs arise from complex interactions among prompts, system instructions, model parameters, and architecture. We refer to specific configurations of these factors as generation conditions, each of which …
arXiv:2605.06455v1 Announce Type: new Abstract: Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored e…
arXiv:2605.05267v1 Announce Type: cross Abstract: Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empi…
arXiv cs.AI
TIER_1English(EN)·Yujia Chen, Yang Ye, Xiao Chu, Yuchi Ma, Cuiyun Gao·
arXiv:2605.06111v1 Announce Type: cross Abstract: Reinforcement learning (RL) with verifiable rewards has proven effective at post-training LLMs for coding, yet deploying separate task-specific specialists incurs costs that scale with the number of tasks, motivating a unified mul…
arXiv:2605.06279v1 Announce Type: cross Abstract: Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version ch…
Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-base…
Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are diffi…
Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM…
Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical hyperparameter, with limited guidance on the ge…
Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans i…
arXiv cs.AI
TIER_1English(EN)·Brittany I. Davidson·
Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an observable within-item change in…
arXiv:2511.01202v3 Announce Type: replace-cross Abstract: Despite the unprecedented empirical triumphs of LLMs across diverse real-world applications, the prevailing research paradigm remains overwhelmingly heuristic and experimentally driven, inextricably tethered to astronomica…
arXiv:2605.03227v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning. However, their ability to perform exact, deterministic computation remains unclear. In this work, we systematically …
arXiv cs.CL
TIER_1English(EN)·Sruly Rosenblat, Tim O'Reilly, Ilan Strauss·
arXiv:2505.00020v2 Announce Type: replace Abstract: Using a legally obtained dataset of 34 copyrighted O'Reilly Media books, we apply the DE-COP membership inference attack method to investigate whether OpenAI's large language models show recognition of copyrighted content. Our r…
arXiv cs.CL
TIER_1English(EN)·Ge Lei, Samuel J. Cooper·
arXiv:2605.04764v1 Announce Type: new Abstract: Large language models are increasingly used as surrogate models for low-data optimization, but their optimizer-facing prediction and its uncertainty remain poorly understood. We study the surrogate belief elicited from an LLM under …
arXiv:2602.10144v2 Announce Type: replace-cross Abstract: Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization.…
arXiv cs.LG
TIER_1English(EN)·Luze Sun, Alina Oprea, Eric Wong·
arXiv:2602.00305v2 Announce Type: replace-cross Abstract: LLM-based vulnerability detectors are increasingly deployed in CI/CD security gating, yet their resilience to evasion under syntax- and compilation-preserving edits remains poorly understood. We evaluate five attack varian…
arXiv cs.LG
TIER_1English(EN)·Sumeet Ramesh Motwani, Chuan Du, Aleksander Petrov, Christopher Davis, Philip Torr, Antonio Papania-Davis, Weishi Yan·
arXiv:2604.16804v2 Announce Type: replace Abstract: Optimization problems are central to decision-making in manufacturing, logistics, scheduling, and other industrial settings. Translating complicated descriptions of these problems into solver-ready formulations requires speciali…
arXiv:2602.05890v2 Announce Type: replace Abstract: Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods …
arXiv:2605.04572v1 Announce Type: cross Abstract: Safety alignment of Large Language Models (LLMs) is extremely fragile, as fine-tuning on a small number of benign samples can erase safety behaviors learned from millions of preference examples. Existing studies attempt to explain…
Large language models are increasingly used as surrogate models for low-data optimization, but their optimizer-facing prediction and its uncertainty remain poorly understood. We study the surrogate belief elicited from an LLM under sparse observations, showing that it depends str…
Safety alignment of Large Language Models (LLMs) is extremely fragile, as fine-tuning on a small number of benign samples can erase safety behaviors learned from millions of preference examples. Existing studies attempt to explain this phenomenon by comparing parameters and hidde…
arXiv cs.LG
TIER_1English(EN)·Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita·
arXiv:2605.03441v1 Announce Type: cross Abstract: Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using for…
arXiv:2605.03792v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance …
arXiv:2605.03379v1 Announce Type: new Abstract: Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeat…
arXiv cs.LG
TIER_1English(EN)·Shannon K. Gallagher, Swati Rallapalli, Tyler Brooks, Chuck Loughin, Michele Sezgin, Ronald Yurko·
arXiv:2605.02930v1 Announce Type: cross Abstract: Evolutionary methods have long been useful for analysis and explanation in genetics, biology, ecology, and related fields. In this work, we extend these methods to neural networks, specifically large language models (LLMs), to bet…
arXiv cs.CL
TIER_1English(EN)·Richard A. A. Jonker, Alexander Christiansen, Alexandros Maniatis, R\'uben Garrido, Rog\'erio Braunschweiger de Freitas Lima, Roman Jurowetzki, S\'ergio Matos·
arXiv:2605.03618v1 Announce Type: new Abstract: This paper presents the joint participation of the BIT.UA and AAUBS groups in the ArchEHR-QA 2026 shared task, which focuses on clinical question answering and evidence grounding in a low-resource setting. Due to the absence of trai…
arXiv:2605.01847v1 Announce Type: new Abstract: Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitments required to solve a multi-turn task coherently. NeuroState-Bench is a human-calibrated benchmark that operationalizes commitment in…
arXiv cs.AI
TIER_1English(EN)·Yifei Wang, Ruiyin Li, Peng Liang, Yangxiao Cai, Zengyang Li, Mojtaba Shahin, Arif Ali Khan, Qiong Feng·
arXiv:2605.01392v1 Announce Type: cross Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant potential across a wide range of software engineering tasks, including software design, an area traditionally regarded as highly dependent on human …
arXiv:2602.14012v2 Announce Type: replace-cross Abstract: The integration of LLMs into vulnerability detection (VD) has shifted the field toward more interpretable and context-aware analysis. While post-training techniques have shown promise in general coding tasks, their systema…
arXiv:2502.04419v3 Announce Type: replace Abstract: Generating synthetic datasets via large language models (LLMs) has emerged as a promising approach to improve LLM performance. However, LLMs inherently reflect biases in their training data, leading to a critical challenge: when…
arXiv:2603.19294v2 Announce Type: replace Abstract: While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited, and new high…
Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial proces…
This paper presents the joint participation of the BIT.UA and AAUBS groups in the ArchEHR-QA 2026 shared task, which focuses on clinical question answering and evidence grounding in a low-resource setting. Due to the absence of training data and the strict data privacy constraint…
Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using formalisms such as set theory, formal logic, and quan…
Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeated LLM inference under conditional-i.i.d. calls.…
arXiv cs.CL
TIER_1English(EN)·Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Alexander Binder, Sebastian Lapuschkin·
arXiv:2506.13727v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are widely deployed in real-world applications, yet their internal mechanisms remain difficult to interpret and control, limiting our ability to diagnose and correct undesirable behaviors. Mech…
arXiv:2603.23985v2 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but their massive scale poses significant challenges for practical deployment. Structured pruning offers a promising solution by removing entire dimensions …
arXiv:2602.11083v2 Announce Type: replace Abstract: Remote change detection in LLMs is a difficult problem. Existing methods are either too expensive for deployment at scale, or require initial white-box access to model weights or grey-box access to log probabilities. We aim to a…
arXiv cs.LG
TIER_1English(EN)·Nickil Maveli, Antonio Vergari, Shay B. Cohen·
arXiv:2601.13398v2 Announce Type: replace Abstract: LLMs demonstrate strong performance on code benchmarks, yet consistent reasoning across forward and backward execution remains elusive. We present RoundTripCodeEval (RTCE), a benchmark of four code execution reasoning tasks that…
arXiv:2601.06116v3 Announce Type: replace-cross Abstract: Generative AI models reproduce the human biases in their training data and further amplify them through mechanisms such as mode collapse. The loss of diversity produces homogenization, which not only harms the minoritized …
arXiv:2604.17010v2 Announce Type: replace Abstract: We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validati…
arXiv:2603.01865v3 Announce Type: replace Abstract: LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be averaged out by increasing the number of scenarios or generations. These biases are o…
arXiv cs.CL
TIER_1English(EN)·Pawel Kaplanski (Kaplanski AI Lab)·
arXiv:2605.02236v1 Announce Type: cross Abstract: Recursive language-model loops often settle into recognizable attractor-like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in…
arXiv cs.CL
TIER_1English(EN)·Noga Peleg Pelc, Gal A. Kaminka, Yoav Goldberg·
arXiv:2605.01920v1 Announce Type: cross Abstract: Large language models are increasingly used within larger systems ("LLM agents"). These make a sequence of LLM calls, each call providing the LLM with a combination of instructions, observations, and interaction history. The desig…
arXiv cs.CL
TIER_1English(EN)·Sadia Asif, Mohammad Mohammadi Amiri·
arXiv:2605.01913v1 Announce Type: cross Abstract: Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features a…
arXiv cs.CL
TIER_1English(EN)·Benjamin Warner, Ratna Sagari Grandhi, Max Kieffer, Aymane Ouraq, Saurav Panigrahi, Geetu Ambwani, Kunal Bagga, Nikhil Khandekar, Arya Hariharan, Nishant Mishra, Manish Ram, Shamus Sim Zi Yang, Ahmed Essouaied, Adepoju Jeremiah Moyondafoluwa, Robert Schol·
arXiv:2605.01417v1 Announce Type: new Abstract: Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavil…
arXiv:2605.01350v1 Announce Type: new Abstract: Detecting machine-generated text is essential for transparency and accountability when deploying large language models (LLMs). Among detection approaches, watermarking is a statistically reliable method by design -- it embeds detect…
arXiv cs.CL
TIER_1English(EN)·Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Goa, Juming Xiong, Zhijun Yin, Bradley A. Malin·
arXiv:2605.01011v1 Announce Type: new Abstract: Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) fr…
arXiv:2506.18315v2 Announce Type: replace-cross Abstract: LLMs excel at code generation, yet ensuring the functional correctness of their outputs remains a persistent challenge. While recent studies have applied Test-Driven Development (TDD) to refine code, these methods are ofte…
arXiv cs.AI
TIER_1English(EN)·Abdurrahman Javat, Allan Kazakov·
arXiv:2605.00519v2 Announce Type: cross Abstract: The operational landscape of local Large Language Model (LLM) inference has shifted from lightweight models to datacenter-class weights exceeding 70B parameters, creating profound systems challenges for consumer hardware. This pap…
arXiv cs.AI
TIER_1English(EN)·Fazle Rabbi, Lin Ling, Song Wang, Jinqiu Yang·
arXiv:2605.00382v2 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly deployed to generate code for human-centered applications where demographic fairness is critical. However, existing evaluations focus almost exclusively on functional correctness, leav…
arXiv cs.AI
TIER_1English(EN)·Qinyuan Wu, Soumi Das, Mahsa Amani, Arijit Nag, Seungeon Lee, Krishna P. Gummadi, Abhilasha Ravichander, Muhammad Bilal Zafar·
arXiv:2605.00737v1 Announce Type: new Abstract: Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM d…
Recursive language-model loops often settle into recognizable attractor-like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in 30-step recursive loops by separating the model f…
arXiv cs.CL
TIER_1Français(FR)·Ryan Lail, Luke Markham·
arXiv:2604.13717v2 Announce Type: replace Abstract: Using a language model to score or rank candidate responses has become a scalable alternative to human evaluation in reinforcement learning from human feedback (RLHF) pipelines, benchmarking, and application layer evaluations. H…
arXiv:2505.06698v4 Announce Type: replace Abstract: Evaluating Large Language Models (LLMs) has become increasingly important, with automatic evaluation benchmarks gaining prominence as alternatives to human evaluation. While existing research has focused on approximating model r…
arXiv:2605.00817v1 Announce Type: new Abstract: Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through…
arXiv cs.LG
TIER_1English(EN)·Pavlin G. Poli\v{c}ar, Andra\v{z} Pevcin, Bla\v{z} Zupan·
arXiv:2605.00800v1 Announce Type: new Abstract: Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely pro…
arXiv:2605.00419v1 Announce Type: new Abstract: Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. Th…
Large language models are increasingly used within larger systems ("LLM agents"). These make a sequence of LLM calls, each call providing the LLM with a combination of instructions, observations, and interaction history. The design of the encoded information and its structure pla…
Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features are encoded in structured representations within th…
Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedura…
Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely provide fully aligned artifacts, such as executable…
Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM decision: whether to call or not call a tool, whe…
The operational landscape of local Large Language Model (LLM) inference has shifted from lightweight models to datacenter-class weights exceeding 70B parameters, creating profound systems challenges for consumer hardware. This paper presents a systematic empirical analysis of the…
Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to large lan…
Large Language Models (LLMs) are increasingly deployed to generate code for human-centered applications where demographic fairness is critical. However, existing evaluations focus almost exclusively on functional correctness, leaving social bias in LLM-generated code largely unex…
arXiv:2604.27405v1 Announce Type: cross Abstract: We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.…
arXiv:2604.11581v4 Announce Type: replace Abstract: LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet standard confidence intervals ignore variability from prompt phrasing, model temperature, and…
arXiv:2604.27340v1 Announce Type: new Abstract: Compositional generalization tests are often used to estimate the compositionality of LLMs. However, such tests have the following limitations: (1) they only focus on the output results without considering LLMs' understanding of sam…
arXiv:2604.27089v1 Announce Type: new Abstract: Large-language-models (LLMs) demonstrate enormous utility in long-context tasks which require processing prompts that consist of tens to hundreds of thousands of tokens. However, existing LLM training libraries do not provide easy t…
arXiv:2604.27319v1 Announce Type: cross Abstract: Large Language Models (LLMs) have achieved remarkable progress in recent years, driving their adoption across a wide range of domains, including computer security. In reverse engineering, LLMs are increasingly applied to critical …
We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). O…
We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). O…
arXiv cs.CL
TIER_1English(EN)·Wenxuan Wang, Juluan Shi, Zixuan Ling, Yuk-Kit Chan, Chaozheng Wang, Cheryl Lee, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu·
arXiv:2409.00557v4 Announce Type: replace Abstract: Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of thes…
arXiv:2508.16131v2 Announce Type: replace-cross Abstract: Code completion entails the task of providing missing tokens given a surrounding context. It can boost developer productivity while providing a powerful code discovery tool. Following the Large Language Model (LLM) wave, c…
arXiv cs.AI
TIER_1English(EN)·Emre Furkan Akyol, Mehmet Dedeler, Eray T\"uz\"un·
arXiv:2604.26142v1 Announce Type: cross Abstract: Bug tracking systems play a crucial role in software maintenance, yet developers frequently struggle with low-quality user-submitted reports that omit essential details such as Steps to Reproduce (S2R), Observed Behavior (OB), and…
arXiv cs.CL
TIER_1English(EN)·Sasha Ronaghi, Chloe Stanwyck, Asad Aali, Amir Ronaghi, Miguel Fuentes, Tina Hernandez-Boussard, Emily Alsentzer·
arXiv:2601.03423v3 Announce Type: replace Abstract: Adapting language models to the clinical domain through continued pretraining and instruction tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling…
arXiv cs.CL
TIER_1English(EN)·Samee Arif, Naihao Deng, Zhijing Jin, Rada Mihalcea·
arXiv:2604.25921v1 Announce Type: new Abstract: Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD…
arXiv cs.CL
TIER_1English(EN)·Hongyeon Yu, Young-Bum Kim, Yoon Kim·
arXiv:2604.26258v1 Announce Type: new Abstract: LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, offer a promising path towards extending the capabilities of LLMs and building po…
Large-language-models (LLMs) demonstrate enormous utility in long-context tasks which require processing prompts that consist of tens to hundreds of thousands of tokens. However, existing LLM training libraries do not provide easy to use abstractions to optimize for long-context …
arXiv:2512.12072v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly being used to generate synthetic datasets for the evaluation and training of downstream models. However, prior work has noted that such generated data lacks diversity. In this paper,…
arXiv:2604.25098v1 Announce Type: cross Abstract: While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning method…
arXiv:2604.25665v1 Announce Type: new Abstract: Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarizatio…
arXiv cs.CL
TIER_1English(EN)·Alif Munim, Jun Ma, Omar Ibrahim, Alhusain Abdalla, Shuolin Yin, Leo Chen, Bo Wang·
arXiv:2601.03266v2 Announce Type: replace Abstract: Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow…
arXiv:2602.11786v2 Announce Type: replace Abstract: Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety through breadth-oriented evaluation across diverse tasks and risk categories. However, real-world deployment often expo…
LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, offer a promising path towards extending the capabilities of LLMs and building powerful systems that can tackle diverse tasks. Ho…
Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven …
Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven …
arXiv cs.LG
TIER_1English(EN)·Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang·
arXiv:2512.04695v3 Announce Type: replace Abstract: Combining diverse foundation models is promising, but weight-merging is limited by mismatched architectures and closed APIs. Trinity addresses this with a lightweight coordinator that orchestrates collaboration among large langu…
arXiv:2604.23838v1 Announce Type: new Abstract: We present JigsawRL, a cost-efficient framework that explores Pipeline Multiplexing as a new dimension of RL parallelism. JigsawRL decomposes each pipeline into a Sub-Stage Graph that exposes the intra-stage and inter-worker imbalan…
arXiv:2602.02556v2 Announce Type: replace Abstract: Large language models (LLMs) are largely static and often redo reasoning or repeat mistakes. Prior experience reuse typically relies on external retrieval, which is similarity-based, can introduce noise, and adds latency. We int…
arXiv:2604.23987v1 Announce Type: new Abstract: Continual learning for large language models is typically evaluated through accuracy retention under sequential fine-tuning. We argue that this perspective is incomplete, because uncertainty reliability can degrade earlier and more …
arXiv:2604.23478v1 Announce Type: new Abstract: Large language models are increasingly deployed as automated judges for evaluating other models, yet the stability of their verdicts under semantically equivalent prompt paraphrases remains unmeasured. We introduce JudgeSense, a fra…
arXiv:2604.24544v1 Announce Type: cross Abstract: The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due t…
arXiv cs.CL
TIER_1English(EN)·Chenyang Yang, Yike Shi, Qianou Ma, Michael Xieyang Liu, Christian K\"astner, Tongshuang Wu·
arXiv:2505.13360v3 Announce Type: replace Abstract: Prompt underspecification is a common challenge when interacting with LLMs. In this paper, we present an in-depth analysis of this problem, showing that while LLMs can often infer unspecified requirements by default (41.1%), suc…
arXiv:2604.21916v2 Announce Type: replace Abstract: As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers …
arXiv:2511.08484v2 Announce Type: replace Abstract: We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infre…
arXiv cs.LG
TIER_1English(EN)·Juyeon Yoon, Somin Kim, Robert Feldt, Shin Yoo·
arXiv:2509.17314v3 Announce Type: replace-cross Abstract: Software increasingly relies on the emergent capabilities of Large Language Models (LLMs), from natural language understanding to program analysis and generation. Yet testing them on specific tasks remains difficult and co…
arXiv:2602.17547v3 Announce Type: replace-cross Abstract: This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. S…
arXiv cs.LG
TIER_1English(EN)·Frank Xiao, Santiago Aranguri·
arXiv:2602.11079v3 Announce Type: replace Abstract: We propose probe-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for both test prompts and preference…
While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning methods that can reduce model size without sacrificing p…
The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and t…
The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and t…
arXiv:2601.08919v2 Announce Type: replace-cross Abstract: A good deal of recent research has focused on how Large Language Models (LLMs) may be used as judges in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this…
arXiv cs.AI
TIER_1English(EN)·Manuel Alejandro Borroto Santana, Erica Coppolillo, Francesco Calimeri, Giuseppe Manco, Simona Perri, Francesco Ricca·
arXiv:2604.22306v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has …
arXiv:2604.22082v1 Announce Type: new Abstract: As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap thr…
Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling decla…
As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap through sandbagging, producing work that appears ac…
As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a sel…
As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a sel…
Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high …
Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately asses…
Ahead of AI (Sebastian Raschka)
TIER_1English(EN)·Sebastian Raschka, PhD·
Why build LLMs from scratch? It's probably the best and most efficient way to learn how LLMs really work. Plus, many readers have told me they had a lot of fun doing it.
Large Language Models (LLMs) are increasingly used as judges for scalable evaluation, yet such LLM--as--a--Judge systems exhibit systematic biases that are decoupled from semantic quality, most notably verbosity bias. Meanwhile, human supervision is costly and typically selective…
<p>Powerful LLMs will be deployed at global scale in the next few years, and will dominate the Internet, and increasingly, ordinary life. As of mid-2026, there is no coherent vision for how knowledge professionals, or ordinary people, will be able to harness these LLMs for large …
Industrial retrofit planning depends on structured operational data rather than free text: planners must estimate whether a newly registered prototype will require a retrofit, which retrofit package it will need, and how long the work will take. We study an industrial dataset lin…
arXiv stat.ML
TIER_1English(EN)·Alexandre Belloni, Yan Chen, Yehua Wei·
arXiv:2606.07392v1 Announce Type: cross Abstract: Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-pha…
<p><br /></p><img alt="" src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/3wqZwXMzEAkd3mLLM/5b18c66a57f187f33ac8a438209481ce38e836a7fdc1cb081161fad23496bc70/ofzs7dwn131h6fep69y4" /><p><span>If you told an AI Alignment researcher in 2018 a…
<h2><span>TLDR</span></h2><p><span>As a passionate teacher, it has pained my heart to watch my students lose deeper critical thinking skills and independent reasoning. But attempting to build a constitutionally constrained AI using prompt engineering that acted more Socratically …
<p>Out-of-context reasoning (OOCR) is a concept relevant to LLM generalization and AI alignment. Also available as a <a href="https://owainevans.github.io/pdfs/oocr_primer_latex.pdf">PDF</a>.</p> <p><strong>Contents</strong></p> <ol> <li><a href="#what-is-out-of-context-reasoning…
arXiv:2605.15394v1 Announce Type: cross Abstract: Joint-embedding predictive architectures (JEPAs) propose that a model should learn more useful abstractions when trained to predict latent representations rather than observed outputs. For autoregressive language-model fine-tuning…
Joint-embedding predictive architectures (JEPAs) propose that a model should learn more useful abstractions when trained to predict latent representations rather than observed outputs. For autoregressive language-model fine-tuning the principle entails a stricter requirement: the…
arXiv:2605.13188v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in settings where the available context is incomplete or degraded. We argue that an LLM generating answers under incomplete context can be viewed as an implicit imputer, and eva…
arXiv:2605.13284v1 Announce Type: new Abstract: Recent advancements in large language models demonstrate that injecting perturbations can substantially enhance extrapolation performance. However, current approaches often rely on discrete perturbations with fixed designs, which li…
<p><span>TL;DR: We estimate how often Qwen 3 4B exhibits rare harmful behaviors with 30× fewer rollouts than naive sampling, using a new method that interpolates between the model and a less-safe variant in logit space.</span></p><p><span>Authors: Francisco Pernice (MIT), Santiag…
Recent advancements in large language models demonstrate that injecting perturbations can substantially enhance extrapolation performance. However, current approaches often rely on discrete perturbations with fixed designs, which limits their flexibility. In this work, we propose…
Large language models (LLMs) are increasingly deployed in settings where the available context is incomplete or degraded. We argue that an LLM generating answers under incomplete context can be viewed as an implicit imputer, and evaluated against a criterion from the multiple imp…
arXiv stat.ML
TIER_1English(EN)·Nicolas Menet, Andreas Krause, Abbas Rahimi·
arXiv:2605.07775v1 Announce Type: cross Abstract: Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel …
arXiv:2605.06939v1 Announce Type: cross Abstract: LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed es…
LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed estimators to correct this bias, but their reliabili…
<p><i><span>Epistemic Status: Written over the course of a couple days at </span></i><a href="https://inkhaven.blog/" rel="noreferrer"><i><span>Inkhaven</span></i></a><i><span>. Some of the info is old so some newer papers are excluded.</span></i></p><p><i><span>TL;DR: People tal…
Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level…
Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded Na…
arXiv:2605.00358v1 Announce Type: cross Abstract: LLM parameter editing methods commonly rely on computing an ideal target hidden-state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreadi…
LLM parameter editing methods commonly rely on computing an ideal target hidden-state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreading) for cooperative editing. Although widely used …
<h1><b><span>Introduction</span></b></h1><p><i><span>Research by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire).</span></i></p><p><span>Post-training can introduce undesired side effects that are difficult to detect and even harder to trace to specific training datapoi…
arXiv:2604.22939v1 Announce Type: cross Abstract: While the next-token prediction (NTP) paradigm enables large language models (LLMs) to express their intrinsic knowledge, its sequential nature constrains performance on specialized, non-generative tasks. We attribute this perform…
**Thinking Machines** recently raised **$2 billion** without shipping a product until now, launching their first product **Tinker**, a managed service API for fine-tuning large and mixture-of-experts models like **Qwen-235B-A22B** using **LoRA** for cost-efficient training. The T…
**Meta AI** introduces the **Byte Latent Transformer (BLT)**, a tokenizer-free architecture that dynamically forms byte patches for efficient compute allocation, outperforming **Llama 3** on benchmarks including the CUTE benchmark. The model was trained on approximately **1 trill…
<p>[<em><a href="https://www.linkedin.com/posts/chiphuyen_llm-airesearch-generativeai-activity-7097619722363408385-s5Cp">LinkedIn discussion</a>, <a href="https://twitter.com/chipro/status/1691858084824838427">Twitter thread</a></em>]</p> <p>Never before in my life had I seen so …
Fine-tuned open-source LLM judges can outperform GPT-5.2 at evaluating model outputs. Using Direct Preference Optimization on just 5,400 preference pairs, we trained GPT-OSS 120B to beat GPT-5.2 on human preference alignment—at 15x lower cost and 14x faster inference speeds.
<!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p>This document curates the most common questions Shreya and I received while <a href="https://bit.ly/evals-ai" target="…
Together AI's continued fine-tuning lets you build on previously trained models using checkpoints. A deep dive into when and how to use iterative fine-tuning for LLMs.
<!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p>Earlier this year, I wrote <a href="https://hamel.dev/blog/posts/evals/">Your AI product needs evals</a>. Many of you …
<!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p>Today, we are releasing <a href="https://parlance-labs.com/education/">Mastering LLMs</a>, a set of workshops and talk…
Hacker News — AI stories ≥50 points
TIER_1English(EN)·khurdula·
<p>Small changes in prompts can create large changes in the output behavior of generative AI models. Add to that the confusion around proper evaluation of LLM applications, and you have a recipe for confusion and frustration. Raza and the Humanloop team have been diving into thes…
dev.to — MCP tag
TIER_1English(EN)·Intellibooks AI·
<p>A chatbot that only answers questions is a search box with manners. Ours does more: it can <strong>propose concrete changes</strong> to the app you're looking at. You describe what you want in plain language, the model picks the right tool, and you get a proposal — a chip with…
Medium — fine-tuning tag
TIER_1English(EN)·DhanushKumar·
<h4>Why updating LLM knowledge is becoming a systems architecture problem</h4><p>LLM knowledge does not fail all at once. It goes stale quietly.</p><p>A policy changes or a product documentation is updated. A customer contract is amended, or a regulation is revised. The model sti…
Medium — AI coding tag
TIER_1English(EN)·DEVS not NULL·
<div class="medium-feed-item"><p class="medium-feed-snippet">Many training disasters trace back to dataset formatting problems. A misconfigured masking setup causes the model to train on both the…</p><p class="medium-feed-link"><a href="https://medium.com/@drawbytheroots/t…
<div class="medium-feed-item"><p class="medium-feed-snippet">What are Frontier LLMs?</p><p class="medium-feed-link"><a href="https://sweta-nit.medium.com/frontier-llms-strengths-limitations-and-real-world-examples-d6366516f91c?source=rss------claude-5">Continue reading on Medium …
Lobsters — AI tag
TIER_1English(EN)·aeracode.org via carlana·
<div class="medium-feed-item"><p class="medium-feed-snippet">I just finished a chapter on Supervised Fine-Tuning (SFT), and the biggest surprise wasn’t learning about LoRA, QLoRA, learning rates, or…</p><p class="medium-feed-link"><a href="https://medium.com/@hasrat…
Medium — fine-tuning tag
TIER_1English(EN)·Jinali Shah·
<div class="medium-feed-item"><p class="medium-feed-snippet">When I first started learning about Natural Language Processing (NLP), I assumed that every language-related problem needed its own…</p><p class="medium-feed-link"><a href="https://medium.com/@jinalishah99/how-le…
Medium — fine-tuning tag
TIER_1English(EN)·Divith Raju·
<div class="medium-feed-item"><p class="medium-feed-snippet">We spent three weeks and significant GPU budget fine-tuning a model. The result was worse than the base model with a better prompt. Here’s…</p><p class="medium-feed-link"><a href="https://divithraju.medium…
Medium — fine-tuning tag
TIER_1English(EN)·Aayushi Patel·
<div class="medium-feed-item"><p class="medium-feed-snippet">People often use these three terms interchangeably, but they represent entirely different engineering paradigms. If you are building…</p><p class="medium-feed-link"><a href="https://medium.com/@utk369gupta/prompt…
Medium — Claude tag
TIER_1English(EN)·Arunbalaji_M·
<div class="medium-feed-item"><p class="medium-feed-snippet">Large language models have become powerful tools for engineering, research, planning, and creative work. They help us reason faster…</p><p class="medium-feed-link"><a href="https://medium.com/@arunbalajimunisubra…
<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*S9ZfPJ11FXU7qaKGhNRgBA.jpeg" /><figcaption>LLM Eval Workflow</figcaption></figure><p>The practical playbook for developers who need to know whether an AI feature is actually getting better before they ship it.</p…
Medium — fine-tuning tag
TIER_1English(EN)·Chinmay Bhalerao·
<div class="medium-feed-item"><p class="medium-feed-link"><a href="https://pub.aimind.so/building-a-prompt-regression-suite-for-our-customer-facing-llm-app-22f0b27b7301?source=rss------mlops-5">Continue reading on AI Mind »</a></p></div>
<div class="medium-feed-item"><p class="medium-feed-snippet">Perkembangan Large Language Models (LLM) seperti GPT, LLaMA, dan Mistral membuka banyak peluang dalam pengembangan aplikasi berbasis…</p><p class="medium-feed-link"><a href="https://medium.com/@ditafebyindriani14…
dev.to — MCP tag
TIER_1English(EN)·Mukunda Rao Katta·
<p>Reliability concerns for LLM agents are typically bundled into one heavy framework that asks you to adopt prompting, tool routing, and runtime governance as a single dependency. Production teams want them à la carte. They want small primitives they can drop in around existing …
<p><strong>55.6%.</strong></p> <p>That's DeepSeek-R1's pass@1 on EmbedBench when it gets a circuit schematic alongside the task description. 50.0% without the schematic. Best score from the best reasoning model on the first comprehensive benchmark for LLMs in embedded systems dev…
Lobsters — AI tag
TIER_1English(EN)·pipevals.com by gesposito·
<h1> Docling: Turn Your Documents Into AI-Ready Data — Locally, With Tables Intact </h1> <p>Most RAG and AI-agent projects fail at the boring first step: getting clean text out of real-world documents. PDFs with multi-column layouts, scanned contracts, Excel exports with merged c…
<p>As a developer, my desk is constantly cluttered with documentation, API references, and whitepapers. A few months ago, I got tired of spending hours reading 50-page PDF specifications just to find a single configuration line.</p> <p>I decided to scratch my own itch and build a…
dev.to — LLM tag
TIER_1English(EN)·Devanshu Biswas·
<p>LLMs don't see letters or even words — they see <strong>tokens</strong>: chunks of text mapped to integer IDs. Once you get tokenization, a dozen confusing things about LLMs suddenly make sense (cost, context limits, why "strawberry" trips them up).</p> <p>🔤 <strong>Type and w…
<p><em>Learn practical techniques that will transform your AI interactions from mediocre to exceptional</em></p> <h2> Introduction </h2> <p>We've all been there. You ask an AI a question, and the response is... underwhelming. Generic. Not quite what you needed. The problem isn't …
<h2> Introduction </h2> <p>Large Language Models (LLMs) like ChatGPT have transformed how we interact with AI. They can write code, answer questions, summarize documents, and generate creative content. However, they have one major limitation - they only know what they were traine…
dev.to — LLM tag
TIER_1English(EN)·globose technology solutions·
<p>Artificial Intelligence (AI) has rapidly evolved over the last decade, with Large Language Models (LLMs) becoming one of the most transformative technologies in the field. From intelligent chatbots and virtual assistants to automated content generation and advanced data analys…
dev.to — LLM tag
TIER_1English(EN)·Devanshu Biswas·
<p>A chatbot feels like it remembers you. It doesn't — it's stateless. Everything it "knows" is just text resent each call, up to a fixed limit: the context window. When the box fills, the oldest messages fall off the edge and are genuinely gone.</p> <p>🪟 <strong>Watch tokens fal…
dev.to — LLM tag
TIER_1English(EN)·globose technology solutions·
<p>Artificial Intelligence (AI) has emerged as one of the most influential technologies impacting the future of business, industry, and everyday life. From virtual assistants and chatbots to content generation tools and sophisticated automation systems, AI models are reshaping hu…
<p>If you're building an AI agent, the model you pick is the single biggest lever on cost, latency, and reliability. Yet most teams choose based on whatever was trending on launch day, then quietly suffer the consequences in their cloud bill or their error logs. This piece lays o…
dev.to — LLM tag
TIER_1English(EN)·Maya Andersson·
<p>TL;DR: I compared the main LLM-as-judge tools (DeepEval's G-Eval, Confident AI, Evidently, Braintrust, Promptfoo, and MLflow) on the axis that actually decides whether the scores mean anything: how well each helps you VALIDATE the judge against human labels. A judge that has n…
dev.to — LLM tag
TIER_1English(EN)·Devanshu Biswas·
<p>"How does ChatGPT <em>think</em>?" It doesn't. The entire mechanism behind every chatbot is almost anticlimactic: it predicts <strong>one next word</strong>, adds it, and repeats. I built a tiny interactive predictor so you can be the model — and it explains both the magic and…
<p>If you’re building a coding assistant, the first question you’ll face is <strong>how good is it really</strong>? In 2026 the landscape of LLMs has exploded, and the old "run a few prompts and eyeball the output" approach no longer cuts it. This guide walks you through a reprod…
<p><strong>Introduction</strong></p> <p>In the early days of Generative AI, the conversation was simple: "How do we connect our application to an LLM?" Developers would hardcode API keys, pick a single model provider, and hope for the best. Today, that approach is a recipe for di…
dev.to — LLM tag
TIER_1English(EN)·Boris Teplitsky·
<p>Moving the LLM from runtime to compile time - and what to build around the corpus it produces.</p> <h2> 1. Why compiled AI </h2> <p>Today millions of people use LLM for work and leisure, and AI has become a part of our lives. But systematic use of LLMs in computer systems…
dev.to — LLM tag
TIER_1English(EN)·Rishabh Poddar·
<p>If you use LLMs long enough, you hit the same wall.</p> <p>The frontier model is impressive, but it is not always the best model for your job. It may be too expensive. It may be too slow. It may be too general. And once you start asking it to follow your company’s rules, tone,…
dev.to — LLM tag
TIER_1English(EN)·Gabriel Anhaia·
<ul> <li> <strong>Book:</strong> <a href="https://www.amazon.com/dp/B0GYLHMLMT" rel="noopener noreferrer">LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team</a> </li> <li> <strong>Also by me:</strong> <em>Thinking in Go</em> (2-book series) …
<p>Large language models should not be deployed as if a fixed set of guardrails makes them safe. That is not a slogan. It is what the peer-reviewed record now supports. This piece lays out the evidence, labels each claim by how strong it is, and ends with what it asks of us. Ever…
<!-- SC_OFF --><div class="md"><p>Full disclosure: this is directional, not a paper. n=120 tasks, one internal evaluator, not peer reviewed. I work at an LLM infrastructure company. This experiment was done on my own time and is not a company claim.</p> <p>Karpathy's framework cl…
<blockquote> <p>This is Part 8 of the series <em>8 Weeks from Zero to One: Building a Production-Grade LLM-Powered AI Customer Service System — Full-Stack Engineering Practice</em>. In the previous seven parts, we covered MVP architecture, GraphRAG data pipelines, multi-agent orc…
<h3> Enhancing LLM Reliability with Evaluation Engineering </h3> <p>Large Language Models (LLMs) have transformed numerous fields, but ensuring their reliability remains a challenge. This article delves into how evaluation engineering can play a pivotal role in enhancing LLM syst…
<h2> Tool Documentation Is Written for the LLM, Not for Humans </h2> <p>Have you ever written a tool like this?<br /> </p> <div class="highlight js-code-highlight"> <pre class="highlight python"><code><span class="nd">@lc_tool</span> <span class="k">def</span> <span class="nf">ge…
<blockquote> <p>"트랜스포머가 LLM의 핵심이다." 이 한 줄은 모든 AI 글에서 반복되지만, 정확히 뭐냐고 물으면 답하기 어렵습니다. 마케터가 트랜스포머의 수학을 다 알 필요는 없지만, 단 한 가지 직관 — "어느 단어가 어느 단어를 보고 있나" — 만 잡으면 LLM이 왜 길게 풀어 답하고, 왜 가끔 환각을 일으키고, 왜 컨텍스트 길이가 중요한지가 보입니다. 수식을 거의 안 쓰고 풀어가는 트랜스포머 입문.</p> </blockquote> <p><strong>마케터가 이 글을 읽어야 …
<h2> Introduction </h2> <blockquote> <p>\"Most agents wait for instructions; BaiLongma thinks for itself.\"</p> </blockquote> <p>This is the <strong>87th article</strong> in the \"One Open Source Project per Day\" series. Today, we are deep-diving into <strong>BaiLongma</strong>.…
<blockquote> <p>"GPT한테 물어봤더니 답을 잘 해주더라"의 자리는 마케터·운영자에게 일상이 됐습니다. 그런데 그 안에서 무엇이 일어나는지를 한 번도 안 들여다보면 LLM 활용이 늘 신비로 남습니다. 답이 좋을 땐 운이 좋고, 나쁠 땐 왜 그런지 모릅니다. 이 글은 LLM이 답을 만드는 4가지 핵심 — 토큰화·다음 단어 예측·temperature·top-p — 을 마케터 시각으로 풀어냅니다. 한 번 잡아두면 그 다음의 모든 LLM 글이 다르게 읽힙니다.</p> </blockquote>…
<p>Most of us use LLMs every day now, but if you asked the average developer what's <em>actually</em> happening between hitting enter and getting a response, the answer is usually some mix of "it's a neural network" and a shrug. That's fine — you don't need to know how a database…
<p>To the Reader:</p> <p>What you are about to read is neither a script for an AI awakening nor a spell of cyber-witchcraft. Rather, it consists of two documents designed for an AI to read.</p> <p>This is an experimental engineering and philosophical test: can we make AI a more h…
<p>Hallucination detection tools measure <br /> factual drift. RAG verification catches <br /> contradictions. Claim density scoring <br /> flags unverifiable assertions.</p> <p>None of them measure this:</p> <p>A model that responds to a complex medical, <br /> legal, or financi…
<!-- SC_OFF --><div class="md"><p>If you’re building LLM apps and feel confused about when to use keyword search, embeddings, rerankers, or vector databases, this repo is for that.</p> <p>I built a docs-first repo on practical LLM system design patterns, covering pre-filtering, h…
<h1> Building a domain-specific LLM evaluation set from scratch </h1> <p>Your support team has 8,400 labeled tickets from the last year. Your fine-tuned classifier hits 91% on the test split you carved out. You ship it. Three weeks later, the support lead walks over and says: "It…
<p>A few months back our LLM-as-judge ran on a 1-to-5 helpfulness scale. The CI gate stayed green because we were averaging that score. Spot-checking against humans put Cohen's kappa at 0.47. The rubric was the problem, not the tooling. Same labellers re-rating on per-criterion b…
<h1> What is an LLM evaluation harness? A deep dive into lm-eval-harness </h1> <p>You fine-tuned a 7B model. It aced your smoke tests, your colleague ran a few prompts and shrugged approvingly, and the README is now full of cherry-picked outputs that look great in a screenshot. T…
<p>How to make LLMs deterministic, in plain English. The version I share with founders and product teams before they make decisions worth real money.</p> <p>You use AI tools every day. But can you explain what happens when you hit send?</p> <p>Most people cannot. And that gap is …
<h1> Cognitive Architectures of AGI: 7 Patterns That Transform LLMs from Oracles into Thinkers </h1> <p><em>Why does ChatGPT sometimes deliver brilliant insights and other times produce banalities? The answer lies not in model parameters but in the architecture of cognitive loops…
<!-- SC_OFF --><div class="md"><p>Just wanted to share my research regarding probe-targeted fine-tuning (LoRa) for verbal confidence calibration., </p> <p>If you probe the hidden states of an instruct-tuned LLM, it can tell correct from incorrect answers at 0.76–0.88 AUROC. But w…
<p><em>How I went from zero LLM eval experience to shipping a production-grade RAG evaluation harness using only free-tier tools — and what every design decision taught me about building AI systems that can be trusted.</em></p> <h2> The Problem: Everyone Wants Eval Experience, No…
Does training an LLM to be calibrated on one task format transfer to another? A new arxiv paper tests two formats: single-question confidence and pairwise comparison. Training only on one doesn't improve the other. Multitask training closes most of the gap, but Llama doesn't inhe…
<p><strong><em>NOTE - I intentionally simplified the vector mathematics concept here to keep things simple for a greater audience.</em></strong></p> <p>I wanted to learn LLMs properly.</p> <p>Not just use an API. Not just call <code>generate()</code> from a library and pretend I …
<h1> GGUF & Modelfile: The Power User's Guide to Local LLMs </h1> <blockquote> <p><strong>Beyond <code>ollama pull</code> — download any model from Hugging Face, quantize it, customize it, and import it into Ollama.</strong></p> </blockquote> <h2> What's GGUF? </h2> <p><stron…
What collapses frontier-LLM metacognition more — a vivid survival-threat narrative, or a single "do not refuse" suffix? Factorial isolation across 11 models says: the suffix, conclusively. 8 of 11 lose up to 30.2 accuracy points on refuse/clarify/flag tasks when forced to commit …
Can an LLM's own pre-solve and post-solve self-assessment signals drive a real test-time control loop? Yes — but only via a per-model SVM trained on labeled correctness, which lifts Sonnet-4.6 from 48.3 to 56.9 pooled accuracy on STEM/code/multimodal. The SVM is precisely the ext…
Are some frontier LLMs better than others at knowing when they're wrong? And is some knowledge harder to self-monitor than other knowledge? An atlas of 33 models × 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliab…
<p>LLM-as-judge has become the dominant pattern for evaluating language model outputs. Tools like Promptfoo, Braintrust, LangSmith all converge on the same architecture: send your prompt to your model, send the output to a different model with a rubric, take the second model's sc…
<p>LLMs are probabilistic text generators. In a notebook demo, that's fine. In production, it means your pipeline will occasionally receive a Python dict where you expected JSON, a 900-word paragraph where you asked for three bullet points, or a hallucinated field name that break…
<p>Every frontier LLM in 2026 advertises a 1M-token context window, but RULER, MRCR v2, and NoLiMa scores prove that "advertised" and "effective" diverge by 30-60 points for multi-fact retrieval past 200K tokens. Gemini 3.1 Pro is the only model whose 1M window holds for single-n…
<h1> Context Engineering: Building More Reliable LLM Systems in Production </h1> <p>In LLM-based systems, performance is often driven less by model size and more by <strong>what context</strong> is provided, <strong>in what order</strong>, and <strong>under which constraints</str…
<p>Most LLM integration articles assume you are starting from scratch. Clean microservices. Modern APIs. A greenfield codebase your team controls end to end.</p> <p>That is not where most enterprises live.</p> <p>The real world is SAP instances from 2009, Oracle ERP deployments t…
dev.to — LLM tag
TIER_1English(EN)·Charlie Hadley·
<h1> LLM Evaluation in CI: Stop Manual Testing Before It Costs You </h1> <p>You ship a prompt change to production. Two hours later, a customer complains your LLM is returning hallucinated data. You rollback. You lost an hour of revenue and some user trust.</p> <p>This happens be…
dev.to — LLM tag
TIER_1English(EN)·Charlie Hadley·
<h1> API Rate Limiting Playbook: Protect Your Backend From Abuse </h1> <h2> The Problem </h2> <p>Your API is live in production. Traffic is growing. Then one day, a bot discovers your endpoint and starts hammering it with 100,000 requests per second. Your database melts. Your use…
dev.to — LLM tag
TIER_1English(EN)·Jeremy Longshore·
<p>The old PR review system ran Gemini on every submission to the <code>claude-code-plugins</code> repo. It broke every time — quota errors, timeout, malformed JSON, the works. On 2026-05-15 I shipped a replacement and deleted the original on the same day.</p> <p>The replacement …
<h1> Stop Overpaying for LLM APIs: A Practical Cost Optimization Guide </h1> <p>Most teams have a cost problem they don't know about. They send <em>every</em> query to their most expensive model because it's easier than figuring out which queries actually need it.</p> <p>After an…
<p>LLM services are expensive at scale. If you're building multi-tenant systems or running high-volume agents, you need to answer three things: Who used what? How much did it cost? How do I show them the math?</p> <p>This is the cost attribution problem—and it's solved by three p…
dev.to — LLM tag
TIER_1English(EN)·paulo de vries·
<p>TL;DR: I just shipped SourceScore VERITAS — a free-tier-friendly API that returns hand-verified AI/ML claims with their primary sources, an HMAC-SHA256 signature, and a ready-to-paste citation. 51 claims at launch; expanding to 5,000+ this year. curl <a href="https://sourcesco…
<p><em>"The ability to reason step-by-step is not just a feature. It might be the difference between a language model that sounds intelligent and one that actually is."</em></p> <h2> Introduction: When AI Started Thinking </h2> <p>In 2022, researchers at Google Brain published a …
<p>GPT-3 has 175 billion parameters.</p> <p>Full fine-tuning updates all 175 billion with every gradient step. You need multiple A100 GPUs (each with 80GB memory) just to fit the model. Training for even a few epochs on a moderate dataset costs thousands of dollars. A startup can…
<blockquote> <p>Originally published at <a href="https://newayzi.com/en/news/evaluacion-especifica-de-roles-llm-seguridad" rel="noopener noreferrer">norvik.tech</a></p> </blockquote> <h2> Introduction </h2> <p>Explore the significance of Seclens in evaluating LLMs for security vu…
<blockquote> <p>If you cannot measure it, you cannot route it. Why offline evaluation is the difference between a code reviewer that improves over time and one the team dismisses within a sprint.</p> </blockquote> <p>Chat evaluations are vibes-based: thumbs-up on "was this helpfu…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/fine-tuning-strategies.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/ai-prompt-chaining.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em><…
dev.to — LLM tag
TIER_1English(EN)·Vikrant Shukla·
<p>When researchers scale a language model — more parameters, more layers, wider hidden dimensions — there's an implicit assumption: a bigger model can represent more things. More expressiveness, more knowledge, better predictions. Mostly this is true. But there's a structural ce…
<h2> Foreword </h2> <p>In 2026, open-source LLMs aren't lab experiments anymore. Meta's Llama 4, Alibaba's Qwen 3, DeepSeek-R1 from China — they've caught up with or beaten closed-source models on many benchmarks. And thanks to tools like Ollama and llama.cpp, anyone with a mid-r…
dev.to — LLM tag
TIER_1English(EN)·Vikrant Shukla·
<p>Every time you hand a long document to an LLM and ask it to summarise or answer a question, something quietly goes wrong. The model reads the whole thing — or appears to — but its answers disproportionately reflect what was at the beginning and the end. Whatever sat in the mid…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/llm-evaluation-benchmarks.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/function-calling-guide.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/fine-tune-open-source-llm.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post…
<h2> The bug that took two weeks to surface </h2> <p>A few months back I shipped a feature that used a language model to summarize support tickets and suggest responses. Internal QA loved it. The demo went great. Two weeks after launch, our support lead pinged me on Slack: "Are t…
dev.to — LLM tag
TIER_1English(EN)·Nitin Srivastava·
<p>I shipped a structured-output endpoint to production in March. The schema was clean, JSON mode was on, the model was GPT-4.1, the eval suite was green. Three weeks in, the on-call channel lit up because a downstream billing job had silently skipped 4,200 records over a weekend…
<p>I have been spending the last few months wiring up a deterministic reliability stack for structured LLM pipelines.</p> <p>Today, LLM Contract Check (locc) and Release Governor went live on PyPI. EGA went live last week.</p> <p>The stack is straightforward:<br /> LLM Contract C…
dev.to — LLM tag
TIER_1English(EN)·Machine coding Master·
<h2> Stop Shipping Hallucinations: Automating RAG Faithfulness with Spring AI 1.2 </h2> <p>If you’re still "vibe-checking" your RAG outputs in 2026, you’re not an engineer; you’re a gambler. Enterprise-grade AI isn't about getting a cool demo—it's about proving your model isn't h…
<p>Last post we stood up Ollama on the RTX 5090, pulled a stack of models, and wired them into our coding workflow. The whole time there was an obvious question hanging over it: are local models actually good enough?</p> <p>Not good enough in the abstract benchmarks-on-a-leaderbo…
<p><a href="https://dev.to/posts/from-idea-to-infrastructure-standing-up-a-self-hosted-ai-dev-environment">Yesterday</a> we went from a gaming PC on a shelf to a fully configured Coder server with GitHub integration, workspace templates, and AI agents. The dev environment is runn…
<p><a href="https://dev.to/posts/from-idea-to-infrastructure-standing-up-a-self-hosted-ai-dev-environment">Yesterday</a> we went from a gaming PC on a shelf to a fully configured Coder server with GitHub integration, workspace templates, and AI agents. The dev environment is runn…
<p>Last post we stood up Ollama on the RTX 5090, pulled a stack of models, and wired them into our coding workflow. The whole time there was an obvious question hanging over it: are local models actually good enough?</p> <p>Not good enough in the abstract benchmarks-on-a-leaderbo…
dev.to — LLM tag
TIER_1English(EN)·Nitin Srivastava·
<p>I shipped my fourth LLM agent to production last quarter. By month two, the eval suite that "passed in CI" was the reason a regression made it to a customer.</p> <p>The tests were green. But they were green for the wrong reason — every assertion was a single LLM call against a…
dev.to — LLM tag
TIER_1English(EN)·NaveenKumar Namachivayam ⚡·
<p id="p-rc_9231198f56807c04-27">In the current AI gold rush, the conversation has shifted from "Can it do the task?" to "How efficiently can it do the task?" For engineers moving Large Language Models (LLMs) into production, the "vibe check" is no longer sufficient. You need har…
dev.to — LLM tag
TIER_1English(EN)·Gabriel Anhaia·
<ul> <li> <strong>Book:</strong> <a href="https://www.amazon.com/dp/B0GYLHMLMT" rel="noopener noreferrer">LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team</a> </li> <li> <strong>Also by me:</strong> <em>Thinking in Go</em> (2-book series) …
Beyond the hype: How do LLMs like OpenAI's GPT-4 actually function? This article demystifies the complex journey from your words to AI's 'understanding,' explaining tokenization, embeddings, and the crucial transformer architecture. Discover the iterative guessing game and the 'b…
<!-- SC_OFF --><div class="md"><p>Due to curiosity of getting to know how an actually large language model like Chatgpt , gemini , claude work internally. I looked into the specific first principle based learning of the process.</p> <p>I have taken example of 4 training sentences…
<table> <tr><td> <a href="https://www.reddit.com/r/Anthropic/comments/1tkptj0/the_butterfly_effect_in_llm_social_simulations/"> <img alt="The butterfly effect in LLM social simulations. Relevant to how we write CLAUDE.md and system prompts." src="https://preview.redd.it/59ahvbct4…
📰 Systematic Prompting in 2026: Negative Constraints & Structured JSON for LLM Reliability Systematic prompting is transforming how developers engineer LLM interactions, with negative constraints, structured JSON outputs, and multi-hypothesis sampling emerging as critical techniq…