新研究解决大语言模型的事实准确性、架构推断和专业化评估问题

Google AI / Research TIER_1 English(EN) · 2025-09-17 17:00

通过利用其所有层来提高 LLM 的准确性

Algorithms & Theory

Apple Machine Learning Research TIER_1 English(EN) · 2026-07-15 00:00

Large Language Models (LLMs) are increasingly deployed to autonomously solve real-world tasks. A key ingredient for this is the LLM Function-Calling paradigm, a widely used approach for equipping LLMs with tool-use capabilities. However, an LLM calling functions incorrectly can h…

arXiv cs.AI TIER_1 English(EN) · Aryan Keluskar, Amrita Bhattacharjee, Huan Liu · 2026-07-17 04:00

ToolAlignBench：研究工具调用赋能的大语言模型的对齐冲突

arXiv:2607.14285v1 Announce Type: cross Abstract: Safety alignment in LLMs aims to align models with human values, but which values take precedence when they conflict? We investigate this question in the context of tool-calling LLM agents deployed in regulated industries, where a…

arXiv cs.AI TIER_1 English(EN) · Nyx Iskandar · 2026-07-17 04:00

Eta Given Delta：用边际工具效用定义 LLM 工具效率

arXiv:2607.14108v1 Announce Type: cross Abstract: This paper introduces tool efficiency, a new quantitative metric to evaluate the rate of useful tool calls in an LLM agent trajectory. To ensure that tool efficiency is well-defined, we also introduce marginal tool utility, a new …

arXiv cs.AI TIER_1 English(EN) · Qingyu Zhang, Qianhao Yuan, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun, Xiang Li, Ming Xu, Jiarui Li, Xiuyin Zhao · 2026-07-16 04:00

ShortOPD：通过短到长策略内蒸馏恢复剪枝的LLM

arXiv:2607.13124v1 Announce Type: cross Abstract: Structured pruning is a hardware-friendly way to compress LLMs, but it is mostly validated on multiple-choice recognition tasks, while the same compressed checkpoints can collapse on the free-form generation that deployment actual…

arXiv cs.NE (Neural & Evolutionary) TIER_1 English(EN) · Zhi-Hui Zhan · 2026-07-15 14:52

如何指导LLM生成：双代理引导搜索用于自动化启发式设计

Large language models (LLMs) have made automated heuristic design (AHD) increasingly practical by generating executable heuristic code from task descriptions and evaluator feedback. Yet under a limited query and evaluation budget, search efficiency depends critically on a pre-gen…

arXiv cs.AI TIER_1 English(EN) · Navnit Shukla · 2026-07-15 04:00

成本约束的RAG：多租户LLM系统中跨检索和生成统一的每个租户成本归因

arXiv:2607.12188v1 Announce Type: new Abstract: Enterprise Retrieval-Augmented Generation (RAG) deployments face a critical governance gap: while LLM generation cost is metered per token, the retrieval layer - vector memory, similarity compute, and embedding API calls - remains a…

arXiv cs.AI TIER_1 English(EN) · Brenda Lelis, Rodrigo Cabral-Carvalho · 2026-07-15 04:00

RCWT：衡量LLM调用中协调内容带来的任务预算位移

arXiv:2607.12216v1 Announce Type: cross Abstract: Multi-agent and memory-augmented LLM systems often place coordination content, shared state, prior discussion, tool outputs, summaries, and role instructions, inside the same finite prompt used for the current task. This creates a…

arXiv cs.AI TIER_1 English(EN) · Aleh Manchuliantsau · 2026-07-15 04:00

以静默取胜：LLM计划评估中的删除非单调性、自主利用和类型状态门控

arXiv:2607.12986v1 Announce Type: new Abstract: Plan evaluators can reward a strategic plan for becoming less explicit. This paper studies that failure in a staged expected-value scorer for LLM-generated venture routes. Proposition 1 gives the score change from deleting an interi…

arXiv cs.CL TIER_1 English(EN) · Huihao Jing, Wenbin Hu, Shaojin Chen, Haochen Shi, Hanyu Yang, Sirui Zhang, Haoran Li, Yangqiu Song · 2026-07-15 04:00

PerfCodeBench：为系统级高性能代码优化进行大语言模型基准测试

arXiv:2605.15222v2 Announce Type: replace-cross Abstract: Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emph…

arXiv cs.CL TIER_1 English(EN) · Chao Zhang, Yiren Liu, Lunyiu Nie, Jeffrey M. Rzeszotarski, Yun Huang, Tal August · 2026-07-15 04:00

从文本到控件：可控的LLM生成

arXiv:2604.10925v2 Announce Type: cross Abstract: Natural language remains the predominant way people interact with large language models (LLMs). However, users often struggle to precisely express and control subjective preferences (e.g., tone, style, and emphasis) through prompt…

arXiv cs.CL TIER_1 English(EN) · Xiuyin Zhao · 2026-07-14 17:50

ShortOPD：使用短到长策略内蒸馏恢复剪枝的LLM

Structured pruning is a hardware-friendly way to compress LLMs, but it is mostly validated on multiple-choice recognition tasks, while the same compressed checkpoints can collapse on the free-form generation that deployment actually requires. Two observations trace this gap. Firs…

arXiv cs.AI TIER_1 English(EN) · Aleh Manchuliantsau · 2026-07-14 17:29

以静默取胜：LLM计划评估中的删除非单调性、自主利用和类型状态门控

Plan evaluators can reward a strategic plan for becoming less explicit. This paper studies that failure in a staged expected-value scorer for LLM-generated venture routes. Proposition 1 gives the score change from deleting an interior transition while retargeting its predecessor …

arXiv cs.AI TIER_1 English(EN) · Shrestha Datta, Hongfu Liu, Anshuman Chhabra · 2026-07-14 04:00

权重调整梯度揭示LLM中的参数重要性和失效模式

arXiv:2607.10803v1 Announce Type: cross Abstract: Understanding which parameters are influential in Large Language Models (LLMs) is central to improving their efficiency, reliability, and interpretability. We introduce Weight-Adjusted Gradients (WAG), a simple yet effective appro…

arXiv cs.CL TIER_1 English(EN) · Anna Marklov\'a, Ji\v{r}\'i Mili\v{c}ka, Martina Vok\'a\v{c}ov\'a, Rudolf Rosa · 2026-07-14 04:00

LLM中的生成与感知：一种令牌概率方法

arXiv:2607.11703v1 Announce Type: new Abstract: The asymmetry between language production and perception has been well-documented in psycholinguistics. Whether large language models (LLMs) exhibit a functionally analogous distinction remains an open question, particularly given t…

arXiv cs.AI TIER_1 English(EN) · Chigozirim Ifebi, Brent Kong, Ayushi Mehrotra · 2026-07-14 04:00

Minionese：多语言大模型安全的综合基准和机制研究

arXiv:2607.10112v1 Announce Type: cross Abstract: Safety alignment in large language models remains brittle across languages: prompts reliably refused in English can elicit harmful compliance in non-English and low-resource settings. We introduce \textsc{Minionese}, a multilingua…

arXiv cs.AI TIER_1 English(EN) · Deep Pankajbhai Mehta · 2026-07-14 04:00

格式敏感度指数：LLM基准测试中的Token控制提示包装器鲁棒性与模式合规性

arXiv:2607.09665v1 Announce Type: new Abstract: Prompt wrappers often differ only in formatting, yet they can change model scores enough to flip leaderboard conclusions. We study this variance under a token-controlled protocol and introduce two complementary metrics: the Format S…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-07-14 00:00

ShortOPD：通过短到长策略内蒸馏恢复剪枝的LLM

Structured pruning is a hardware-friendly way to compress LLMs, but it is mostly validated on multiple-choice recognition tasks, while the same compressed checkpoints can collapse on the free-form generation that deployment actually requires. Two observations trace this gap. Firs…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Rodrigo Cabral-Carvalho · 2026-07-13 23:31

RCWT：衡量LLM调用中来自协调内容的任务预算位移

Multi-agent and memory-augmented LLM systems often place coordination content, shared state, prior discussion, tool outputs, summaries, and role instructions, inside the same finite prompt used for the current task. This creates a practical allocation problem: every token spent o…

arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Navnit Shukla · 2026-07-13 22:16

成本控制的RAG：多租户LLM系统中跨检索和生成统一的每租户成本归属

Enterprise Retrieval-Augmented Generation (RAG) deployments face a critical governance gap: while LLM generation cost is metered per token, the retrieval layer - vector memory, similarity compute, and embedding API calls - remains an unattributed shared cost, enabling invisible c…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-07-13 15:33

LLM中的生成与感知：一种词元概率方法

The asymmetry between language production and perception has been well-documented in psycholinguistics. Whether large language models (LLMs) exhibit a functionally analogous distinction remains an open question, particularly given that LLMs rely on the same underlying mechanism (…

arXiv cs.CL TIER_1 English(EN) · Rudolf Rosa · 2026-07-13 15:33

LLM中的生成与感知：一种词元概率方法

The asymmetry between language production and perception has been well-documented in psycholinguistics. Whether large language models (LLMs) exhibit a functionally analogous distinction remains an open question, particularly given that LLMs rely on the same underlying mechanism (…

arXiv cs.AI TIER_1 English(EN) · Viraaji Mothukuri, Reza M. Parizi · 2026-07-13 04:00

LLM 生成代码中的拼凑问题

arXiv:2607.08981v1 Announce Type: cross Abstract: LLM-generated code often compiles, passes tests, and appears correct, yet breaks once deployed. The root cause is frequently structural rather than logical. A generated endpoint references configuration keys never declared in the …

arXiv cs.AI TIER_1 English(EN) · Amin Haeri, Mahdi Ghelichi · 2026-07-09 04:00

规范接地提升LLM代码测试有效性

arXiv:2607.06636v1 Announce Type: cross Abstract: Large language models frequently generate code that appears correct on typical inputs yet fails on edge cases, invalid inputs, and other specification-defined corner conditions. A popular fix has the model write its own tests and …

arXiv cs.LG TIER_1 English(EN) · Daniel Maninger, Leon Chemnitz, Jannis Brugger, Tushar Lamba, Amir Molzam Sharifloo, Mira Mezini · 2026-07-08 04:00

通过检索增强生成和约束解码减轻 LLM 生成的 Web API 调用中的错误

arXiv:2607.05936v1 Announce Type: cross Abstract: Integration of web APIs is a cornerstone of modern software systems, yet writing correct web API invocation code remains challenging due to complex and evolving API specifications. Although LLMs are increasingly used for code gene…

arXiv cs.LG TIER_1 English(EN) · Mira Mezini · 2026-07-07 07:38

通过检索增强生成和约束解码来减轻 LLM 生成的 Web API 调用中的错误

Integration of web APIs is a cornerstone of modern software systems, yet writing correct web API invocation code remains challenging due to complex and evolving API specifications. Although LLMs are increasingly used for code generation, previous work has empirically shown that t…

arXiv cs.AI TIER_1 English(EN) · Ali Hassaan Mughal, Muhammad Bilal · 2026-07-07 04:00

基于大语言模型的测试预言机：权威来源分类法——系统性文献综述

arXiv:2607.05031v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used to produce test oracles, the part of a test that decides whether observed behavior is correct. Yet a clear account of where these oracles draw their authority is missing. Prior se…

arXiv cs.AI TIER_1 English(EN) · Muhammad Bilal · 2026-07-06 13:13

基于大语言模型的测试预言机：权威来源分类法——系统性文献综述

Large language models (LLMs) are increasingly used to produce test oracles, the part of a test that decides whether observed behavior is correct. Yet a clear account of where these oracles draw their authority is missing. Prior secondary studies organize the area by oracle form o…

arXiv cs.LG TIER_1 English(EN) · Haotian Xie, Junlin Chen, Mingkai Zheng, Lishan Yang, Zhao Zhang · 2026-07-03 04:00

DeadPool：通过零开销检查点实现热插拔的弹性 LLM 训练

arXiv:2607.01646v1 Announce Type: new Abstract: State-of-the-art large language model (LLM) training takes tens of thousands of graphics processing units (GPUs) for months and encounters failures across the software and hardware stack. Existing fault-tolerance mechanisms either i…

arXiv cs.AI TIER_1 English(EN) · Yongyi Ji, Jiaji Wang, Yi Zhou, Fuxiang Chen, Hongji Yang · 2026-07-03 04:00

LLM 生成代码及代码库中注释的探索性研究

arXiv:2607.01867v1 Announce Type: cross Abstract: The use of LLMs in software development has become increasingly widespread on tasks such as code generation and summarization. Reports from large technology companies showed that around 20% to 30% of their code are generated by LL…

arXiv cs.AI TIER_1 English(EN) · Christopher Ellis, Shreyas Chaudhari, Mei-Yu Wang, Leighton Barnes, Giulia Fanti, Jos\'e M. F. Moura · 2026-07-03 04:00

通过限制性API访问对LLM架构属性进行黑盒推理

arXiv:2607.01313v1 Announce Type: cross Abstract: In practice, most commercial LLM providers do not publicly release details of underlying LLM architectures. However, prior work has shown that given limited API access to an LLM (namely, top-$k$ logits and/or a logit bias function…

arXiv cs.AI TIER_1 English(EN) · Dekun Yang · 2026-07-03 04:00

提示框架扭曲基于计数的LLM错误检测评估：来自数字锚定的证据

arXiv:2607.01240v1 Announce Type: cross Abstract: Count-based F1 is widely used as a proxy for LLM error-detection quality, but this paper shows that it can rise dramatically without a corresponding improvement in span localization, a gap termed F1 Inflation. The paper introduces…

arXiv cs.AI TIER_1 English(EN) · Zihao Xu, Yuekang Li, Gelei Deng, Yi Liu, Zhenchang Xing · 2026-07-03 04:00

重新思考 LLM 集成应用的复杂性指标：超越源代码

arXiv:2607.01903v1 Announce Type: new Abstract: LLM-integrated applications blend natural language prompts with program code, and much of their runtime behavior originates in the prompt layer rather than in the code itself. Existing complexity metrics, however, operate solely at …

arXiv cs.AI TIER_1 English(EN) · Blair Hudson · 2026-07-03 04:00

Meta 金融服务大语言模型评估基准

arXiv:2607.01740v1 Announce Type: new Abstract: Public LLM leaderboards optimise for global average performance and do not capture the specific cognitive demands of financial-services work: a model that leads on MMLU-Pro may underperform on document-grounded compliance reasoning,…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-07-02 09:02

重新思考 LLM 集成应用的复杂性指标：超越源代码

LLM-integrated applications blend natural language prompts with program code, and much of their runtime behavior originates in the prompt layer rather than in the code itself. Existing complexity metrics, however, operate solely at the code level and therefore overlook this behav…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-07-02 08:25

LLM 生成的代码和代码库中的注释的探索性研究

The use of LLMs in software development has become increasingly widespread on tasks such as code generation and summarization. Reports from large technology companies showed that around 20% to 30% of their code are generated by LLMs. However, there remains skepticism about the pr…

arXiv cs.AI TIER_1 English(EN) · Zhao Tian, Yingquan Zhao, Chenyao Suo, Meng Wang, Junjie Chen · 2026-07-02 04:00

LLVM-Bench：为 LLVM 编译器问题解析进行大语言模型基准测试与推进

arXiv:2607.00700v1 Announce Type: cross Abstract: LLVM is a widely used compiler infrastructure whose scale and complexity make issue resolution labor-intensive and challenging. Although large language models (LLMs) have recently achieved remarkable success in issue resolution, t…

arXiv cs.CL TIER_1 English(EN) · Xiangchen Song, Zhenhao Chen, Lingjing Kong, Shaoan Xie, Xinshuai Dong, Guangyi Chen, Kun Zhang · 2026-07-02 04:00

超越困惑度：LLM 测试时训练中部署内存声明的行为评估框架

arXiv:2607.00368v1 Announce Type: new Abstract: Large language model test-time training (TTT) is often evaluated through local proxy metrics: models are updated on recent tokens, retrieved context, target-domain data, or verifiable task attempts, and then judged by perplexity, fu…

arXiv cs.LG TIER_1 English(EN) · Tao Feng, Haozhen Zhang, Zijie Lei, Pengrui Han, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jiaxuan You · 2026-07-02 04:00

FusionFactory：将大型语言模型能力与多大型语言模型日志数据融合

arXiv:2507.10540v3 Announce Type: replace Abstract: The rapid advancement of large language models (LLMs) has created a diverse landscape of models, each excelling at different tasks. This diversity drives researchers to employ multiple LLMs in practice, leaving behind valuable m…

arXiv cs.CL TIER_1 English(EN) · Hao Chen, Ziyu Han, Yukun Yan, Qingfu Zhu, Maosong Sun, Wanxiang Che · 2026-07-02 04:00

从整体评估到结构化标准：应对不断发展的LLM格局的评分细则

arXiv:2606.08625v2 Announce Type: replace Abstract: As Large Language Models (LLMs) advance toward open-ended autonomous agents, the mechanisms used to evaluate and guide their behavior must evolve accordingly. This work introduces the rubric as a unifying framework capturing thi…

arXiv cs.AI TIER_1 English(EN) · Junjie Chen · 2026-07-01 09:50

LLVM-Bench：为 LLVM 编译器问题解决而进行的 LLM 基准测试与推进

LLVM is a widely used compiler infrastructure whose scale and complexity make issue resolution labor-intensive and challenging. Although large language models (LLMs) have recently achieved remarkable success in issue resolution, their effectiveness on complex system-level LLVM co…

arXiv cs.AI TIER_1 English(EN) · Gan Luo, Zihan Qin, Bin Dong, Wotao Yin · 2026-07-01 04:00

从搜索到合成：将大型语言模型训练为零样本工作流生成器

arXiv:2606.30704v1 Announce Type: cross Abstract: Large language models (LLMs) excel across a wide range of tasks, yet their instance-specific solutions often lack the structural consistency needed for reliable deployment. Workflows that encode recurring algorithmic patterns at t…

arXiv cs.AI TIER_1 English(EN) · Yuqing Yang, Qi Zhu, Zhen Han, Boran Han, Zhengyuan Shen, Shuai Wang, Vassilis N. Ioannidis, Huzefa Rangwala · 2026-07-01 04:00

当大型语言模型粗心读取表格时：衡量和减少数据引用错误

arXiv:2606.32029v1 Announce Type: cross Abstract: While large language models (LLMs) perform well on table tasks, they still make data referencing errors (DREs), i.e., incorrectly citing or omitting table values, despite understanding the table structure. Beyond final-answer accu…

arXiv cs.AI TIER_1 English(EN) · Marina Mancoridis, Zo\"e Hitzig · 2026-07-01 04:00

大型语言模型的 the Consistency Dilemma：生成器-评估器一致性与易错性

arXiv:2606.30653v1 Announce Type: cross Abstract: Large language models are increasingly deployed in agentic pipelines that depend on the model evaluating its own outputs without external verification. The reliability of these pipelines depends on an implicit assumption: that the…

arXiv cs.CL TIER_1 English(EN) · Kun Zhang · 2026-07-01 03:07

超越困惑度：LLM 测试时训练部署内存声明的行为评估框架

Large language model test-time training (TTT) is often evaluated through local proxy metrics: models are updated on recent tokens, retrieved context, target-domain data, or verifiable task attempts, and then judged by perplexity, future-token loss, long-context performance, or re…

arXiv cs.AI TIER_1 English(EN) · Huzefa Rangwala · 2026-06-30 17:54

当大型语言模型粗心读取表格时：衡量和减少数据引用错误

While large language models (LLMs) perform well on table tasks, they still make data referencing errors (DREs), i.e., incorrectly citing or omitting table values, despite understanding the table structure. Beyond final-answer accuracy, DREs directly compromise the correctness and…

arXiv cs.AI TIER_1 English(EN) · Bu\u{g}ra Alperen Ulu{\i}rmak, Rifat Kurban · 2026-06-30 04:00

EvalSafetyGap：LLM 评估-安全漏洞的混合调查和概念框架

arXiv:2606.30219v1 Announce Type: new Abstract: LLM evaluation and AI safety face a shared measurement problem: benchmark scores, reward-model signals, and reported safety metrics can improve while the latent properties they are meant to represent remain difficult to verify. This…

arXiv cs.AI TIER_1 English(EN) · Jinchao Hu, Meizhi Zhong, Kehai Chen, Min Zhang · 2026-06-30 04:00

SearchSkill：教会大型语言模型使用搜索引擎，并拥有不断进化的技能库

arXiv:2605.09038v3 Announce Type: replace Abstract: Teaching language models to use search tools is not only a question of whether they search, but also of whether they issue good queries. This is especially important in open-domain question answering, where broad or copied queri…

arXiv cs.AI TIER_1 English(EN) · Yuanhong Cai, Xiaohui Nie, Kanglin Yin, Changhua Pei, Yongqian Sun, Shenglin Zhang, Haibin Liu, Guiyang Liu, Xidao Wen, Fang Situ, Dan Pei · 2026-06-30 04:00

用于评估微服务故障诊断中 LLM Agent 的多数据集基准测试

arXiv:2606.29193v1 Announce Type: cross Abstract: LLM-based agents are reshaping microservice operations into AgentOps, where benchmarks are key to evaluating failure diagnosis over multimodal observability data. However, existing benchmarks remain largely outcome-oriented: they …

arXiv cs.AI TIER_1 English(EN) · Manuel Pita · 2026-06-30 04:00

为错误原因寻找正确代码？将大型语言模型作为理论构建的测量工具进行验证

arXiv:2606.28574v1 Announce Type: cross Abstract: When a large language model (LLM) codes a construct in text as a human annotator would, that agreement makes the LLM a reliable coder. Yet reliability leaves construct validity untouched. The instrument may be theory-naive, reachi…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-30 00:00

当大型语言模型粗心读取表格时：衡量和减少数据引用错误

Large language models exhibit data referencing errors when processing tables, which can be mitigated through critic-based filtering and rejection sampling, with a lightweight 4B-parameter model achieving high detection accuracy.

arXiv cs.AI TIER_1 English(EN) · Rifat Kurban · 2026-06-29 12:33

EvalSafetyGap：LLM 评估-安全故障的混合调查和概念框架

LLM evaluation and AI safety face a shared measurement problem: benchmark scores, reward-model signals, and reported safety metrics can improve while the latent properties they are meant to represent remain difficult to verify. This paper combines a hybrid survey - a systematic s…

arXiv cs.LG TIER_1 English(EN) · Zhijian Zhou, Zesheng Ye, Zhaorun Chen, Bo Li, Feng Liu · 2026-06-29 04:00

CELEUS：通过电子流程实现可认证且高效的大语言模型评估

arXiv:2606.20820v2 Announce Type: replace Abstract: Can we trust evaluation scores to capture an LLM's true real-world performance? Certifiable evaluation answers this question by providing guarantee for LLM evaluation. In particular, existing methods sequentially curate evaluati…

arXiv cs.AI TIER_1 English(EN) · Carson Rodrigues, Oysturn Vas, Isaiah Abner DCosta, Nithish Kumar Prabhakaran · 2026-06-29 04:00

何时使用 LLM 进行超参数优化才划算？一项预算匹配的表格数据研究发现，暖启动是默认配置，而非模型本身

arXiv:2606.21641v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have been proposed as hyperparameter-optimization (HPO) advisors that "warm-start" search from prior knowledge, proposing strong configurations in very few evaluations. We test that claim under…

arXiv cs.AI TIER_1 English(EN) · Enhao Huang, Pengyu Sun, Shuxun Wang, Zixin Lin, Alex Chen, Kaichun Hu, Joey Ouyang, Frank Li, Zhiyu Zhang, Haobo Wang, Yiming Li, Zhan Qin, James Yi, Gang Zhao, Ziang Ling, Lowes Yang · 2026-06-29 04:00

DMind Benchmark：迈向对Web3领域大语言模型能力的全面评估

arXiv:2504.16116v4 Announce Type: replace-cross Abstract: The Web3 ecosystem, underpinned by cryptographic primitives and decentralized consensus, represents a high-stakes environment where software vulnerabilities and incentive misalignments translate directly into financial los…

arXiv cs.CL TIER_1 English(EN) · Aaron J. Li, Hao Huang, Youngmin Park, Yitong Ma, Wei-Lin Chiang, Li Chen, Cho-Jui Hsieh, Bin Yu, Ion Stoica · 2026-06-26 04:00

DualEval：联合模型-项目校准，实现统一 LLM 评估

arXiv:2606.26429v1 Announce Type: cross Abstract: Current LLM evaluation relies on two complementary but often disconnected signals: static benchmarks with objective correctness labels and arena-style preference data that better reflect open-ended user interactions. We introduce …

arXiv cs.LG TIER_1 English(EN) · Hiroki Tamba · 2026-06-26 04:00

必要条件但非充分条件：LLM-as-Judge 安全性评估中的温度控制与可复现性

arXiv:2606.26185v1 Announce Type: new Abstract: LLM-as-judge ("grader") components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader's sampl…

arXiv cs.CL TIER_1 English(EN) · Yow-Fu Liou, Yu-Chien Tang, Yu-Hsiang Liu, An-Zi Yen · 2026-06-26 04:00

OI-Bench：一项用于评估 LLM 对指令干扰的易感性的选项注入基准测试

arXiv:2601.13300v2 Announce Type: replace Abstract: Benchmarking large language models (LLMs) is critical for understanding their capabilities, limitations, and robustness. In addition to interface artifacts, prior studies have shown that LLM decisions can be influenced by direct…

arXiv cs.AI TIER_1 English(EN) · Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu, Peiyang He · 2026-06-26 04:00

库漂移：诊断和修复自演进 LLM 技能库中的一种隐形故障模式

arXiv:2605.19576v2 Announce Type: replace Abstract: Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and perform…

arXiv cs.AI TIER_1 English(EN) · Wen Fan, Minh Tran, Sanya Dod, Xin Hu, Marilyn Rego, Danning Xie, Jenna DiVincenzo, Lin Tan · 2026-06-26 04:00

LLM 生成的 VeriFast 规范的实证研究

arXiv:2606.26490v1 Announce Type: cross Abstract: Static verification tools can assure industrial scale software, but require significant human labor to write specifications. This is particularly true of static verifiers based on separation logic (SL verifiers), which excel at ve…

arXiv cs.CL TIER_1 English(EN) · Chang-Chieh Huang, Yan-Lun Chen, Chia-Mu Yu, Wei-Bin Lee · 2026-06-25 04:00

RAS：通过拒绝对齐衡量 LLM 安全性

arXiv:2606.25750v1 Announce Type: cross Abstract: Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is exp…

arXiv cs.LG TIER_1 English(EN) · Sagnik Anupam, Alexander Shypula, Osbert Bastani · 2026-06-25 04:00

通过检索增强搜索优化大型语言模型程序

arXiv:2501.18916v2 Announce Type: replace Abstract: Recent work has demonstrated the potential of large language models (LLMs) for program optimization, a key challenge in programming languages. We propose a blackbox adaptation method called Retrieval Augmented Search (RAS) that …

arXiv cs.CL TIER_1 English(EN) · Ezgi Sar{\i}kayak, Wenchao Gu, Hesham Ghonim, Chunyang Chen · 2026-06-25 04:00

在真实世界软件性能优化方面评估LLM

arXiv:2606.25530v1 Announce Type: cross Abstract: Software performance optimization is a notoriously complex and manual task. Despite the growing use of Large Language Models (LLMs) for code refinement, we still lack benchmarks that capture how optimization actually happens in re…

arXiv cs.CL TIER_1 English(EN) · Fangzheng Li, Aimin Zhang, Chen Lv · 2026-06-25 04:00

开放权重LLM中的约束税：结构化输出约束下工具调用抑制的实证研究

arXiv:2606.25605v1 Announce Type: new Abstract: Tool Calling and Structured Output are two core capabilities of modern Agent systems, yet their interaction under joint deployment conditions remains insufficiently understood. This paper reports a reproducible phenomenon observed i…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-24 22:40

DualEval：联合模型-项目校准，实现统一的LLM评估

Current LLM evaluation relies on two complementary but often disconnected signals: static benchmarks with objective correctness labels and arena-style preference data that better reflect open-ended user interactions. We introduce DualEval, a latent model-item calibration framewor…

arXiv cs.CL TIER_1 English(EN) · Ion Stoica · 2026-06-24 22:40

DualEval：联合模型-项目校准，实现统一的LLM评估

Current LLM evaluation relies on two complementary but often disconnected signals: static benchmarks with objective correctness labels and arena-style preference data that better reflect open-ended user interactions. We introduce DualEval, a latent model-item calibration framewor…

arXiv cs.CL TIER_1 English(EN) · Wei-Bin Lee · 2026-06-24 12:19

RAS：通过拒绝对齐衡量 LLM 安全性

Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is expensive, sensitive to judge choice, and easily tied…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-24 09:14

开放权重LLM中的约束税：结构化输出约束下工具调用抑制的实证研究

Tool Calling and Structured Output are two core capabilities of modern Agent systems, yet their interaction under joint deployment conditions remains insufficiently understood. This paper reports a reproducible phenomenon observed in a production Agent system: when Tool Calling a…

arXiv cs.CL TIER_1 English(EN) · Chen Lv · 2026-06-24 09:14

开放权重LLM中的约束税：结构化输出约束下工具调用抑制的实证研究

Tool Calling and Structured Output are two core capabilities of modern Agent systems, yet their interaction under joint deployment conditions remains insufficiently understood. This paper reports a reproducible phenomenon observed in a production Agent system: when Tool Calling a…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-24 08:07

在真实世界软件性能优化方面评估LLM

Software performance optimization is a notoriously complex and manual task. Despite the growing use of Large Language Models (LLMs) for code refinement, we still lack benchmarks that capture how optimization actually happens in real-world codebases. Existing frameworks often over…

arXiv cs.AI TIER_1 English(EN) · Chunyang Chen · 2026-06-24 08:07

在真实世界软件性能优化方面评估LLMs

Software performance optimization is a notoriously complex and manual task. Despite the growing use of Large Language Models (LLMs) for code refinement, we still lack benchmarks that capture how optimization actually happens in real-world codebases. Existing frameworks often over…

arXiv cs.AI TIER_1 (AF) · Yihan Wang, Cheng Liu, Jiazheng Zhang, Lei Zhang, Long Cheng, Xiaowei Li, Huawei Li · 2026-06-24 04:00

VeriPilot：一个由LLM驱动的Verilog调试框架

arXiv:2606.23759v1 Announce Type: cross Abstract: Verilog debugging remains one of the most time-consuming stages in digital circuit design. Recent advances in Large Language Models (LLMs) have enabled automated debugging; however, most existing approaches rely solely on test out…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-24 00:00

开放权重LLM中的约束税：结构化输出约束下工具调用抑制的实证研究

Tool Suppression occurs when JSON Schema constraints and tool calling are jointly enabled, preventing open-weight models from invoking tools despite maintaining schema compliance, with the issue stemming from grammar-based token masking that makes tool-call tokens unreachable dur…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-23 00:00

通过检索增强搜索优化LLM程序

Blackbox adaptation methods using retrieval-augmented search and atomic edit decomposition improve program optimization performance for both C++ and Python code.

arXiv cs.CL TIER_1 English(EN) · Guanhua Chen · 2026-06-22 07:57

StatABench：用于评估LLM统计分析能力的基准数据集和框架

Statistical analysis is a broad, complex field requiring both domain knowledge and tool proficiency. While prior work has evaluated large language models (LLMs) in this domain, existing benchmarks remain limited in scope and format. To bridge this gap, we introduce StatABench (St…

arXiv cs.AI TIER_1 English(EN) · Mehwish Fatima · 2026-06-21 12:30

PRIME：评估大型语言模型在不兼容指令下的提示解析能力

Large language models (LLMs) often encounter conflicting prompts, although current instruction following benchmarks assess those meta-instructions in isolation, limiting the insights about how models process conflicting instructions. We introduce a framework \textit{PRIME}(\texti…

arXiv cs.LG TIER_1 English(EN) · Nils Loose, Jonas Sander, Felix M\"achtle, Thomas Eisenbarth · 2026-06-19 04:00

FloatDoor：LLM 中平台触发的后门

arXiv:2606.19535v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed in sensitive settings such as software engineering, where their outputs directly shape downstream artifacts. Recent work has shown that an identical model can produce measurab…

arXiv cs.AI TIER_1 English(EN) · Haotian Xu, Zeyang Zhang, Linbao Li, Huadi Zheng, Yu Li, Cheng Zhuo · 2026-06-19 04:00

SafeSpec：通过动态反射采样实现快速安全的LLM

arXiv:2606.19755v1 Announce Type: cross Abstract: Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional compu…

arXiv cs.AI TIER_1 English(EN) · Arastoo Zibaeirad, Marco Vieira · 2026-06-19 04:00

校准而非理解：诊断微调大型语言模型在系统软件漏洞检测中的局限性

arXiv:2606.20502v1 Announce Type: cross Abstract: Whether LLMs scoring well on vulnerability benchmarks genuinely reason about security or merely pattern-match on contaminated data remains unresolved. We present CWE-Trace, a framework for LLM vulnerability detection built from 83…

arXiv cs.CL TIER_1 English(EN) · Milo\v{s} Nikoli\'c, Ali Hadi Zadeh, Enrique Torres Sanchez, Andreas Moshovos · 2026-06-19 04:00

位移而非方向：评估量化大模型部署的保真度指标

arXiv:2606.19558v1 Announce Type: cross Abstract: Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a …

arXiv cs.AI TIER_1 English(EN) · Marco Vieira · 2026-06-18 17:19

校准而非理解：诊断微调大型语言模型在系统软件漏洞检测中的局限性

Whether LLMs scoring well on vulnerability benchmarks genuinely reason about security or merely pattern-match on contaminated data remains unresolved. We present CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 7…

arXiv cs.CL TIER_1 English(EN) · Andreas Moshovos · 2026-06-17 19:59

位移而非方向：评估量化大模型部署的保真度指标

Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a 41-quant cohort of Devstral-Small-2-24B, evaluated…

arXiv cs.AI TIER_1 English(EN) · Libin Qiu, Yuhang Ye, Zhirong Gao, Xide Zou, Junfu Chen, Ziming Gui, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, Kun Zhao · 2026-06-17 04:00

先有蓝图，后有模型：确定性LLM工作流框架

arXiv:2508.02721v2 Announce Type: replace-cross Abstract: While powerful, the inherent non-determinism of large language model (LLM) agents limits their application in structured operational environments where procedural fidelity and predictable execution are strict requirements.…

arXiv cs.AI TIER_1 English(EN) · Hankyul Baek, Jaewon Noh, Sang Seo, Yongsu Kim, Gabriel Waikin Loh Matienzo, Young Il Kim, Ee Wei Seah, Akriti Vij · 2026-06-17 04:00

现实场景中工具使用LLM智能体的数据泄露风险评估

arXiv:2606.17114v1 Announce Type: cross Abstract: AI agents are increasingly being adopted in enterprise and personal settings with access to emails, databases, documents, and other tools where they can read, update, and disseminate sensitive information. Much of prior research o…

Alignment Forum TIER_1 English(EN) · Tomek Korbak · 2026-06-16 19:55

通过模拟部署预测LLM发布前的安全性

<p><a href="https://cdn.openai.com/pdf/predicting-llm-safety-before-release-by-simulating-deployment.pdf"><span>Paper link</span></a></p><p><span>Before releasing a new model, labs need to understand not just what it can do, but how it is likely to behave in real-world use, inclu…

arXiv cs.LG TIER_1 English(EN) · Yiwei Chen, Lichi Li, Kai Cheung, Vinny Parla, Ganesh Sundaram · 2026-06-16 04:00

LLM漏洞生成的数据中心基准测试：理解微调的影响

arXiv:2606.15123v1 Announce Type: cross Abstract: We study the task of CVE-conditioned exploit generation, where a model drafts proof-of-concept (PoC) exploits given software vulnerability context. We adopt a data-centric approach, constructing a high-quality dataset via multi-st…

arXiv cs.AI TIER_1 English(EN) · Yan Wang, Xinyi Hou, Junjun Si, Yanjie Zhao, Weiguo Lin, Haoyu Wang · 2026-06-11 04:00

LaQual：LLM 应用质量评估的自动化框架

arXiv:2508.18636v2 Announce Type: replace-cross Abstract: Representing a new paradigm in software distribution, LLM app stores are rapidly emerging, offering users diverse choices for content generation, coding assistance, education, and more. However, current ranking and recomme…

arXiv cs.AI TIER_1 English(EN) · Daniel Commey · 2026-06-11 04:00

通用提示改进何时会适得其反：面向 LLM 应用的驱动式评估迭代

arXiv:2601.22025v2 Announce Type: replace-cross Abstract: Evaluating Large Language Model (LLM) applications differs from conventional software testing because outputs are probabilistic, semantically variable, and sensitive to prompt and model changes. This technical report propo…

arXiv cs.AI TIER_1 English(EN) · Ikbel Ghrab, Mohamed Dhieb, Ismail Khenissi, Ines Abdeljaoued-Tej · 2026-06-10 04:00

基于LLM的代码文档生成与多裁判评估

arXiv:2606.09852v1 Announce Type: cross Abstract: High-quality source code documentation is vital yet often neglected, especially in critical domains like healthcare where reliability and maintainability are essential. We presented an AI powered framework that automates documenta…

arXiv cs.AI TIER_1 English(EN) · Sayed Erfan Arefin · 2026-06-09 04:00

超越通过率：对开源代码大语言模型的跨语言、执行式评估

arXiv:2606.08840v1 Announce Type: new Abstract: Code generation models are typically compared using compact execution benchmarks and aggregate pass rates, but such summaries obscure how performance varies across programming languages, problem families, and failure modes. We prese…

arXiv cs.AI TIER_1 English(EN) · Alex Thillen, Niels M\"undler, Veselin Raychev, Martin Vechev · 2026-06-09 04:00

CodeTaste：大型语言模型能否生成人类水平的代码重构？

arXiv:2603.04177v2 Announce Type: replace-cross Abstract: LLM coding agents can generate working code, but their solutions often accumulate complexity, duplication, and architectural debt. Human developers address such issues through refactoring: behavior-preserving program trans…

arXiv cs.AI TIER_1 English(EN) · Le Chen, Nuo Xu, Winson Chen, Bin Lei, Pei-Hung Lin, Dunzhi Zhou, Rajeev Thakur, Caiwen Ding, Ali Jannesari, Chunhua Liao · 2026-06-06 04:00

超越代码对：基于对话的数据生成用于LLM代码翻译

arXiv:2512.03086v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-qual…

arXiv cs.AI TIER_1 English(EN) · Huifan Gao, Liuhua He, Yinghui Pan, Shenbao Yu, Yifeng Zeng, Shengchao Qin, Weidi Sun · 2026-06-06 04:00

通过多任务表示工程提高 LLM 生成代码的可读性

arXiv:2606.06214v1 Announce Type: cross Abstract: Correctness and readability are key measures of code quality, respectively ensuring functional fidelity and ease of comprehension. While most existing research focuses on improving the correctness of large language models~(LLMs) g…

arXiv cs.AI TIER_1 English(EN) · Weidi Sun · 2026-06-04 14:24

通过多任务表示工程提高LLM生成代码的可读性

Correctness and readability are key measures of code quality, respectively ensuring functional fidelity and ease of comprehension. While most existing research focuses on improving the correctness of large language models~(LLMs) generated codes, readability remains under-addresse…

arXiv cs.AI TIER_1 English(EN) · Jie Li, Wenzhao Wu, Junqi Hu, Qinrui Zheng, Bowen Wu, Juepeng Zheng, Yutong Lu, Haohuan Fu · 2026-06-04 04:00

CodegenBench：大型语言模型能否跨架构编写高效代码？

arXiv:2606.04023v1 Announce Type: cross Abstract: While large language models (LLMs) have been extensively evaluated on code generation tasks for general-purpose programming and GPU-accelerated environments (e.g., PyTorch, CUDA), their capabilities in CPU-oriented high-performanc…

LessWrong (AI tag) TIER_1 English(EN) · Tomek Korbak · 2026-06-16 19:55

通过模拟部署预测LLM发布前的安全性

<p><a href="https://cdn.openai.com/pdf/predicting-llm-safety-before-release-by-simulating-deployment.pdf"><span>Paper link</span></a></p><p><span>Before releasing a new model, labs need to understand not just what it can do, but how it is likely to behave in real-world use, inclu…

Eugene Yan TIER_1 English(EN) · 2026-05-27 00:00

使用LLMs保护源代码

Build a threat model, discover vulnerabilities, verify, triage, and patch.

Hacker News — AI stories ≥50 points TIER_1 English(EN) · kkm · 2026-06-26 21:14

开源大模型与闭源大模型之间的差距

HN — claude cli stories TIER_1 English(EN) · yolo-auto · 2026-07-06 01:22

Show HN：不限流量的 LLM API – 每月 6 美元，无 token 追踪，无限制

Medium — MLOps tag TIER_1 English(EN) · strawhacks · 2026-07-14 10:15

你的大语言模型有效……但你能信任它吗？MLflow 在 GenAI 中的应用解析

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@strawhacks/your-llm-works-but-can-you-trust-it-inside-mlflow-for-genai-69fcfc75fd1f?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/2600/1*1RbAYyXU_7wZsZ6_NX1P9Q.png" wi…

Towards AI TIER_1 English(EN) · Rizwanhoda · 2026-07-14 03:02

LLM 应用的可观测性：记录什么、监控什么以及为什么

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/observability-for-llm-applications-what-to-log-what-to-monitor-and-why-c10ea2e9c2f5?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/2600/0*7VBZXBJ16I2ObvKD"…

Medium — fine-tuning tag TIER_1 English(EN) · Md. Abdullah Al Mamun Emon · 2026-07-12 12:42

LLM 微调入门 — 第三部分：数据为王（而大多数教程都忽略了它）

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://emon4075.medium.com/llm-finetuning-for-dummies-part-3-data-is-everything-and-most-tutorials-ignore-it-736fe6b7a0c4?source=rss------fine_tuning-5"><img src="https://cdn-images-1.medium.com/max/1536/1*RYbW1…

Towards AI TIER_1 English(EN) · Manash Pratim, PhD · 2026-07-11 13:01

2026年7月编码LLM的残酷现实：数据驱动的基准测试

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/the-brutal-reality-of-coding-llms-in-july-2026-the-data-driven-benchmarks-63439d730146?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1024/1*-N2zOcXnVkC8Sp…

Towards AI TIER_1 English(EN) · Harish Ramkumar · 2026-07-08 14:01

Amazon Bedrock 入门指南：轻松构建你的第一个大型语言模型应用

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/a-beginners-guide-to-amazon-bedrock-your-first-llm-app-without-the-overwhelm-51fcddef6a1e?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/2600/0*Y0R-htOcEnQ…

dev.to — MCP tag TIER_1 English(EN) · Himanshu Agarwal · 2026-07-07 08:27

测试LLM应用

<h2> Complete Enterprise Guide to Validating Large Language Model Applications (2026 Edition) </h2> <p>🚀 <strong>Recommended Learning Path</strong></p> <p>If you're serious about becoming an AI Test Engineer, SDET, or GenAI Architect, get the complete <strong>GenAI Testing Master…

Towards AI TIER_1 English(EN) · Garvit Agarwal · 2026-07-07 05:56

LLM Token 解读：成本、内存、速度与上下文窗口

<h4><em>We see “Token Limit Exceeded.” Now lets learn what tokens actually are, why different LLMs count them differently, and how they impact our AI costs, speed, and context window.</em></h4><p><strong><em>“Token Limit Exceeded.”</em></strong><br />We’ve all encountered this er…

Towards AI TIER_1 English(EN) · Gaurav Bhardwaj · 2026-07-06 05:25

LLM作为裁判：使用Azure进行大规模自动化评估的完整指南

<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RXMAQQUsNc9hRt9qITasNw.png" /><figcaption>The LLM Judge Stack</figcaption></figure><h3>Introduction: Why We Need Automated Judges</h3><p>Every day, AI systems generate billions of outputs — chatbot responses, cod…

dev.to — MCP tag TIER_1 English(EN) · Shaiju Edakulangara · 2026-07-05 03:48

NodeLLM 1.17：MCP采样、并发工具执行和更智能的ORM控制

<p>Back when we <a href="https://dev.to/blog/nodellm-mcp-integration">introduced MCP support</a>, we ended on a teaser: Phase 3 would tackle <strong>Sampling</strong>—letting servers request completions from the host instead of only exposing tools and resources to it. NodeLLM 1.1…

dev.to — MCP tag TIER_1 English(EN) · Ameer Hamza · 2026-07-04 19:37

后端工程师的LLM入门指南

<h2> Introduction </h2> <p>If you have built APIs, databases, and distributed systems, you already have the mindset needed for AI engineering. The missing piece is a clear mental model of what a Large Language Model (LLM) actually is.</p> <p>An LLM is not a search engine with bet…

Medium — MLOps tag TIER_1 English(EN) · Varun Rajput · 2026-07-04 09:42

内部托管的LLM基准测试：如何选择合适的模型

<div class="medium-feed-item"><p class="medium-feed-snippet">The model selection and cost-quality analysis that MLOps engineers actually do</p><p class="medium-feed-link"><a href="https://medium.com/@thevarunfreelance/llm-benchmarking-for-internal-hosting-how-to-pick-the-right-mo…

Medium — MCP tag TIER_1 English(EN) · Teresa Qin · 2026-07-02 11:42

为大语言模型设计MCP工具：停止为概率性客户端构建传统API

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@tereschin/designing-mcp-tools-for-llms-stop-building-traditional-apis-for-probabilistic-clients-0bb5b18c84f8?source=rss------mcp-5"><img src="https://cdn-images-1.medium.com/max/1024/1*GQ5-Z-4…

Towards AI TIER_1 English(EN) · George Stavrakis · 2026-07-01 20:31

使用 LangSmith 实现端到端的 LLM 可观测性、评估和监控

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/end-to-end-llm-observability-evaluation-and-monitoring-with-langsmith-c34f921d1c9b?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/2600/0*GMmE-m5j2Dj-VZIo" …

Medium — fine-tuning tag TIER_1 English(EN) · Lithika · 2026-07-01 19:45

超越更大的模型：如何挽救失败的LLM应用

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@lithikanov9/beyond-bigger-models-how-to-rescue-failing-llm-applications-56db420bd053?source=rss------fine_tuning-5"><img src="https://cdn-images-1.medium.com/max/1536/1*vK0aDZO1YYDuCBbyxdL22w.…

Medium — fine-tuning tag TIER_1 English(EN) · Sreekanth · 2026-06-30 16:28

LLM 微调分步详解

<div class="medium-feed-item"><p class="medium-feed-snippet">What is a Pre-trained (Base) Model?</p><p class="medium-feed-link"><a href="https://medium.com/@sreekanthsreekanth970/understanding-llm-fine-tuning-step-by-step-209913c2032c?source=rss------fine_tuning-5">Continue readi…

Towards AI TIER_1 English(EN) · Srini Dwarakanathan · 2026-06-30 15:31

LLM 服务运营就绪：相同的基础组件，不同的默认设置

<figure><img alt="Diagram comparing classical and LLM service operational readiness. Classical: error rate, CPU, p99 latency. LLM: inter-token latency, cache miss rate, cost. Shows user → load balancer → worker pool → DB/store." src="https://cdn-images-1.medium.com/max/1024/1*Ho4…

Medium — Claude tag TIER_1 English(EN) · Gowtam Singulur · 2026-06-29 17:39

大语言模型基准测试，五岁小孩也能懂（但有代码）

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://gowtamsingulur.medium.com/llm-benchmarks-explained-like-youre-five-but-with-code-a2b451397912?source=rss------claude-5"><img src="https://cdn-images-1.medium.com/max/1536/1*TP9tT0qgmGqqhk4AqU6Qlg.png" wid…

Medium — MLOps tag TIER_1 English(EN) · Saurabh Maurya · 2026-06-29 06:21

LoRA 对比 QLoRA：LLM 微调指南

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@saurabh11.maurya/lora-vs-qlora-a-guide-to-llm-fine-tuning-a4191502b675?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/945/1*Ua6wWVyuXMjHzU3Xb-RG3Q.png" width="945" /></…

Medium — fine-tuning tag TIER_1 Türkçe(TR) · Kubilay Malçok · 2026-06-24 08:17

LLM 微调快速入门：理解 LoRA、PEFT 和 QLoRA（附 Google Colab 笔记本）

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@kmalcok1/llm-fine-tuninge-h%C4%B1zl%C4%B1-ba%C5%9Flang%C4%B1%C3%A7-lora-peft-ve-qlora-y%C4%B1-anlamak-google-colab-notebook-u-ile-b8261c4ce71b?source=rss------fine_tuning-5"><img src="https://…

Medium — fine-tuning tag TIER_1 English(EN) · Tanvir Khan · 2026-06-23 06:18

从零开始微调大型语言模型至部署：一份完整的实操指南

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://aeontanvir.medium.com/fine-tuning-llms-from-scratch-to-deployment-a-complete-hands-on-guide-24f08181d34a?source=rss------fine_tuning-5"><img src="https://cdn-images-1.medium.com/max/600/1*KRDcz_vq32LA-l9P…

Medium — MLOps tag TIER_1 English(EN) · Delight Olaoluwa · 2026-06-22 12:47

在 Amazon SageMaker 上微调 LLM：通过 LitGPT、TRL、PEFT 和部署迷宫的指南

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@delightolaoluwa/fine-tuning-llms-on-amazon-sagemaker-a-guide-through-litgpt-trl-peft-and-the-deployment-maze-5eb685de7160?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max…

Medium — Claude tag TIER_1 English(EN) · aashuu ✦ · 2026-06-22 12:17

如何从零开始构建你自己的LLM（分5个阶段）：GPT和Claude背后的确切流程

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@warrioraashuu/how-to-build-your-own-llm-from-scratch-in-5-stages-exact-pipeline-behind-gpt-and-claude-e670b7ea0ce1?source=rss------claude-5"><img src="https://cdn-images-1.medium.com/max/1376/…

Medium — MLOps tag TIER_1 English(EN) · RAHUL SARKAR · 2026-06-21 09:16

超越模型：vLLM 如何赋能企业级 LLM 服务

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@rahulsarkar906/beyond-the-model-how-vllm-powers-enterprise-scale-llm-serving-0eb3b08a21d3?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/826/1*JyNtHgYSIq50PlzvRlrhcw.pn…

Medium — MLOps tag TIER_1 English(EN) · Jasmine Park · 2026-06-19 09:30

Langfuse 替代品：6 款 LLM 可观测性工具，按第八个月的痛点排序

<div class="medium-feed-item"><p class="medium-feed-snippet">They all trace your LLM calls. The difference that matters later is whether the traces are yours (OpenTelemetry) or theirs (proprietary).</p><p class="medium-feed-link"><a href="https://medium.com/@jasmine.park_60464/la…

Medium — Claude tag TIER_1 English(EN) · John Chiwai · 2026-06-18 20:01

如何为生产就绪的 LLM 构建错误恢复模式

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@chiwai.kiriba/how-to-build-error-recovery-patterns-for-production-ready-llms-2abb2e4262be?source=rss------claude-5"><img src="https://cdn-images-1.medium.com/max/2600/1*V4MXMfWtUqN79E9BnqbWOA.…

Medium — MLOps tag TIER_1 English(EN) · Ethan Walker · 2026-06-18 19:53

我实际比较过的开源大模型评测框架，以及区分它们的那个问题

<div class="medium-feed-item"><p class="medium-feed-snippet">“Eval framework” covers app-output graders, RAG-specific scorers, and academic benchmark harnesses. They are not substitutes. Pick by what…</p><p class="medium-feed-link"><a href="https://medium.com…

Medium — Claude tag TIER_1 English(EN) · Nichetraffickit · 2026-06-18 05:36

如何从零开始构建你自己的LLM（GPT和Claude背后的五阶段流程）

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@nichetraffickit/how-to-build-your-own-llm-from-scratch-the-5-stage-pipeline-behind-gpt-and-claude-21c0dbcbde26?source=rss------claude-5"><img src="https://cdn-images-1.medium.com/max/1742/1*eA…

Medium — Claude tag TIER_1 Español(ES) · Michel Alan López · 2026-06-17 20:31

将大型语言模型与安全和控制集成

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@ingalopez11/%EF%B8%8F-integrando-llms-con-seguridad-y-control-%EF%B8%8F-56bcb2175e9e?source=rss------claude-5"><img src="https://cdn-images-1.medium.com/max/733/1*ryqad5PASIaAw7AY3Xj-OQ.png" w…

Medium — MLOps tag TIER_1 English(EN) · Arun Kumar Singh · 2026-06-17 08:00

驾驭LLM部署树：为本地和服务器选择模型、格式和框架…

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://arunksingh16.medium.com/navigating-the-llm-deployment-tree-selecting-models-formats-and-frameworks-for-local-and-server-757d0640d7a3?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/…

Medium — MLOps tag TIER_1 English(EN) · Building the Future with Agentic AI & ML · 2026-06-17 00:52

LLM成本优化：AI工程师设计前必须知道什么

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@tpriya27/llm-cost-optimization-what-ai-engineers-must-know-before-they-design-008a9f97b5cf?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/1024/1*sJs-fDhObM-YHPMfoUiwrg.…

Towards AI TIER_1 English(EN) · Rizwanhoda · 2026-06-16 08:00

提示缓存不足以支撑：构建全面的LLM成本优化策略

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/the-prompt-cache-is-not-enough-building-a-full-llm-cost-optimization-strategy-a9c1992a0d7c?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/2600/0*dnw3RcE6am…

Medium — MLOps tag TIER_1 English(EN) · Ethan Walker · 2026-06-15 15:23

我们将LLM评估集成到CI中的方法：我比较过的6种工具和最终确定的技术栈

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@ethan-writes-AI/how-we-wired-llm-evals-into-ci-the-6-tools-i-compared-and-the-stack-that-stuck-aa0af26ea5d7?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/2600/1*Lad5oE…

Medium — fine-tuning tag TIER_1 English(EN) · Hari Prakash Natarajan · 2026-06-13 20:13

无需编写任何代码即可在本地微调 LLMs：深入了解 Unsloth Studio

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@techofhp/fine-tune-llms-locally-without-writing-a-single-line-of-code-a-deep-dive-into-unsloth-studio-b4cb0350e172?source=rss------fine_tuning-5"><img src="https://cdn-images-1.medium.com/max/…

Medium — MLOps tag TIER_1 English(EN) · Siddhartha Pramanik · 2026-06-13 13:44

构建用于比较开源大模型的评估工具集

<div class="medium-feed-item"><p class="medium-feed-link"><a href="https://pub.aimind.so/building-an-evaluation-harness-for-comparing-open-source-llms-33473e3fe0cf?source=rss------mlops-5">Continue reading on AI Mind »</a></p></div>

Medium — MLOps tag TIER_1 English(EN) · Ted Park · 2026-06-12 21:36

面向生产型LLM系统的轻量级RAG评估框架

<div class="medium-feed-item"><p class="medium-feed-snippet">Many RAG demos look useful in a short demo.</p><p class="medium-feed-link"><a href="https://itstedpark.medium.com/a-small-rag-evaluation-harness-for-production-oriented-llm-systems-5df924426141?source=rss------mlops-5">…

Medium — MLOps tag TIER_1 English(EN) · Siddhartha Pramanik · 2026-06-11 11:44

构建用于比较开源大模型的评估工具集

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/codetodeploy/building-an-evaluation-harness-for-comparing-open-source-llms-de3e55afe5b5?source=rss------mlops-5"><img src="https://cdn-images-1.medium.com/max/1024/1*d9QTaaaxboQP_gKSLedW_w.png"…

dev.to — LLM tag TIER_1 English(EN) · Nazar Boyko · 2026-07-16 01:54

LLM Evals For Developer Tools: Useful, Correct, Safe

<p>Someone on your team built an LLM feature. Maybe it's an inline code-suggest. Maybe it's a "fix this PR comment" button. Maybe it's a full agent that opens pull requests on its own. The demo worked. The screenshots were good. You shipped it.</p> <p>Now a real user gives it a r…

dev.to — LLM tag TIER_1 English(EN) · Learn AI Resource · 2026-07-15 15:01

本地大模型赋能开发：速度、隐私与零 API 账单

<p>Your team just shipped a feature. Great. Now you're waiting 3 seconds for Claude to respond... again. The API bills are climbing. And someone inevitably asks: "Wait, what data are we actually sending to OpenAI?"</p> <p>Yeah. Running local LLMs isn't just hype anymore. It's the…

dev.to — LLM tag TIER_1 English(EN) · Reno Lu · 2026-07-15 13:05

提示注入是结构性的：增强LLM应用的16项检查

<p>If your application passes untrusted text to a language model and then acts on the output, prompt injection is the threat you cannot fully eliminate at the model layer — only contain at the system layer.</p> <p>OWASP lists it as the top risk for LLM applications. Unlike SQL in…

dev.to — LLM tag TIER_1 English(EN) · GWEN · 2026-07-15 10:17

实用多模型API集成：设计一个可切换、可观察、易回滚的LLM层

<p>When teams integrate large language models, the first step is usually connecting to a model’s API. Once the code runs and returns responses, integration is often considered complete.</p> <p>In production, however, the real problems begin:</p> <ul> <li>The same Prompt produces …

dev.to — LLM tag TIER_1 English(EN) · Kuldeep Paul · 2026-07-14 20:54

企业部署LLM的5种失败模式（及其修复方法）

<p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Faw1pwg00h7y3p8t0y1hk.png"><img alt="5 Failure Modes …

dev.to — LLM tag TIER_1 English(EN) · Kuldeep Paul · 2026-07-14 20:46

LLM 网关基准测试：延迟、吞吐量和开销

<p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fud1vlatsdrg03o9hixf7.png"><img alt="Benchmarking LLM…

dev.to — LLM tag TIER_1 English(EN) · Kuldeep Paul · 2026-07-14 20:35

速率限制、重试和熔断器：使 LLM 调用更具弹性

<p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fd9lwa59o28aql2diy0wd.png"><img alt="Rate Limits, Ret…

dev.to — LLM tag TIER_1 English(EN) · Emre Yilmaz · 2026-07-14 14:42

跨 LLM 提供商负载均衡：实用的操作手册

<p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fvfhh2v3q9u6g3osx9a9i.png"><img alt="Load Balancing A…

dev.to — LLM tag TIER_1 English(EN) · Frank · 2026-07-14 09:00

解锁LLM应用的潜力：开发者的视角

<p>As a developer who's been following the advancements in Large Language Models (LLMs), I was excited to come across the awesome-llm-apps repository on GitHub. This collection of 100+ AI agent and Retrieval-Augmented Generation (RAG) apps is a game-changer for developers like me…

dev.to — LLM tag TIER_1 English(EN) · Aamer Mihaysi · 2026-07-13 14:37

Llamafile vs vLLM：两种本地模型服务方式，以及各自适用的场景

<p>I spent last weekend comparing two ways to serve a local model: Llamafile and the more traditional vLLM + Docker setup I've been running for months. Same model (Qwen2.5-7B-Instruct), same hardware (a single RTX 4090), same test queries. The gap between them is smaller than I e…

dev.to — LLM tag TIER_1 English(EN) · vectronodeAPI · 2026-07-13 08:28

为LLM API调用设计任务级别的能力预算

<p>Many AI applications begin with one model and one API call. That is a reasonable prototype, but it creates a fragile production contract: product behavior becomes tied to a model name instead of a task requirement.<br /> A better contract starts with the workload.<br /> Define…

dev.to — LLM tag TIER_1 English(EN) · Himanshu Agarwal · 2026-07-13 06:49

生产型 LLM 系统的完整生命周期：构建、测试、调试、部署

<blockquote> <p><strong>A quick note before we start:</strong> everything below — the patterns, the code, the debugging method, the deployment checklist — is the condensed, field-tested version of what's in <strong><a href="https://himanshuai.gumroad.com/l/The-Enterprise-LLM-Engi…

r/MachineLearning TIER_1 English(EN) · /u/No_Caregiver_2922 · 2026-07-12 07:58

使用 LLM 进行开发的开发者们，你们是如何实际处理内存、上下文持久化和多模型路由的？真心好奇大家都在做什么 [D]

<div class="md"><p>Been building an AI product for a few months and honestly the part that's eaten most of my time has nothing to do with the actual product, it's all the plumbing around context management, memory persistence, and dealing with multiple LLM provider…

dev.to — LLM tag TIER_1 English(EN) · Rishabh Poddar · 2026-07-12 05:49

开源大模型：企业为何超越前沿模型

<p>For a while, the default answer to almost every AI problem was simple: use the strongest frontier model you can get.</p> <p>That made sense early on. Hosted frontier models were better at reasoning, more forgiving with messy prompts, and much easier to plug into a product than…

dev.to — LLM tag TIER_1 Русский(RU) · Promptra Team · 2026-07-10 21:48

2026年俄罗斯大语言模型API聚合器：如何选择且不被高价宰割

<p><em>Применить: за 15 минут · Экономия: до x4 наценки на каждом токене · Уровень: средний · Чтение: ~30 минут</em></p> <blockquote> <p><strong>Что узнаешь:</strong></p> <ul> <li>Сравнение 12 агрегаторов LLM API - наценка, модели, оплата, документы - одной таблицей</li> <li>Форм…

dev.to — LLM tag TIER_1 English(EN) · Dixit Angiras · 2026-07-10 08:57

使用 Ollama 开发服务优化本地 LLM 部署

<p>Running large language models inside a private network sounds straightforward until teams hit GPU bottlenecks, inconsistent inference performance, and data governance concerns. These challenges become more visible in enterprise environments where customer data cannot leave int…

dev.to — LLM tag TIER_1 English(EN) · Odd_Background_328 · 2026-07-10 07:12

从Token到智能：深入解析大型语言模型如何处理语言

<p>If you've been anywhere near the tech world in the past two years, you've heard the term "large language model" (LLM) thrown around constantly. But what actually is a large language model? How does it work? And why should you care?</p> <p>This guide breaks it down without the …

dev.to — LLM tag TIER_1 English(EN) · GWEN · 2026-07-09 10:26

超越“无效JSON”：从LLM工程化鲁棒的结构化输出

<p>We’ve all been there: Your prompt explicitly says, <em>"Return ONLY a JSON object."</em> But the LLM, in its infinite desire to be helpful, returns: <em>"Sure! Here is the data you requested:<br /> <br /> <code>json { ... }</code><br /> <br /> "</em>.</p> <p>If your production…

dev.to — LLM tag TIER_1 English(EN) · Lior Ben-David · 2026-07-09 09:32

评估 LLM 提供商性能的最佳工具

<p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F9kdkcukokkn7z4bzxov6.png"><img alt="Best Tools for B…

dev.to — LLM tag TIER_1 English(EN) · Ingrid · 2026-07-09 09:17

大规模管理多个 LLM API 密钥的最佳工具

<p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fjceg0ribpu7rtxxlygwj.png"><img alt="Best Tools for M…

dev.to — LLM tag TIER_1 English(EN) · RouterPlex · 2026-07-09 01:42

RouterPlex：一个API密钥连接28个大语言模型 — Claude, GPT, DeepSeek, Qwen, MiniMax

<p>Most projects that touch multiple LLM providers end up with a pile of vendor SDKs, a pile of API keys, and separate billing relationships to manage. RouterPlex is a gateway that collapses that down to one key.</p> <h2> What it does </h2> <p>One OpenAI- and Anthropic-compatible…

dev.to — LLM tag TIER_1 English(EN) · smakosh · 2026-07-08 17:08

什么是LLM编排？模式、工具及何时需要它

<p>The first version of an AI feature is usually one prompt to one model. The production version almost never is. It's a model choice that depends on the task, a fallback when the provider is down, a retry when the JSON comes back malformed, a cache for repeated questions, and a …

dev.to — LLM tag TIER_1 English(EN) · Andrew · 2026-07-08 11:01

升级：2026年自托管编码大语言模型现状

<p>The performance gap between proprietary models like Claude or GPT and open-weight alternatives has effectively collapsed. As of July 2026, self-hosting is no longer about settling for 'good enough' results; it is about deploying production-grade coding assistants that keep you…

dev.to — LLM tag TIER_1 English(EN) · Kuldeep Paul · 2026-07-08 10:41

7个超越缓存的LLM成本优化技巧

<p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fppv0kmwbanmglu53y6oq.png"><img alt="7 LLM Cost-Optim…

dev.to — LLM tag TIER_1 English(EN) · TokenPAPA · 2026-07-08 02:29

多提供商大模型API聚合器2026：通过单一端点访问DeepSeek、Qwen、MiniMax等

<h1> Multi-Provider LLM API Aggregator 2026: Access DeepSeek, Qwen, MiniMax and More from a Single Endpoint </h1> <p>If you are building AI-powered applications for a global audience, you already know that relying on a single LLM provider is risky — model availability changes, pr…

dev.to — LLM tag TIER_1 Español(ES) · Carlos Arturo Castaño G. · 2026-07-07 13:32

用于代码代理的本地 LLM：YouTube 未告诉你的事

<p>En YouTube abundan videos de "corre un LLM local en tu laptop y reemplaza Claude/GPT gratis". Lo intenté en serio, en dos máquinas distintas, durante semanas. La conclusión corta: sirve para responder preguntas sueltas. No sirve, todavía, para uso agentic real con herramientas…

dev.to — LLM tag TIER_1 English(EN) · plasma · 2026-07-07 08:25

用于 LLM API 重试、超时和日志记录的轻量级 Node.js 包装器

<p>Most LLM API integrations start with a direct SDK call.</p> <p>That is fine for a demo.</p> <p>But once the call is inside a real product, I usually want three things around it:</p> <ul> <li>a timeout</li> <li>retry rules</li> <li>useful logs when something fails</li> </ul> <p…

dev.to — LLM tag TIER_1 English(EN) · ding · 2026-07-06 09:58

我如何构建了一个用于管理LLM API密钥、模型发现和本地路由的桌面控制台

<p>Managing multiple LLM provider APIs sounds simple until the number of keys, relay sites, model names, and desktop clients starts to grow. I built AllApiDeck because I wanted one place to import records, organize them, test what actually works, and route requests through a loca…

dev.to — LLM tag TIER_1 English(EN) · Devanshu Biswas · 2026-07-06 09:02

grounding and citations: making LLM answers you can actually verify

<p>An LLM will hand you a smooth, confident paragraph and never once tell you which parts it made up. Fluency is not truth. The fix is grounding: force the answer onto retrieved evidence, attach a citation to every claim, and then check that the citations actually hold. Here it i…

dev.to — LLM tag TIER_1 English(EN) · mihir mohapatra · 2026-07-06 08:43

LLM 应用的可观测性：追踪、成本跟踪和评估循环

<p>If you've shipped a traditional backend service, you already know the observability checklist: logs, metrics, traces, alerts. LLM-powered apps need all of that — plus a few things that don't exist in a normal request/response world: token spend, prompt/response pairs, and qual…

dev.to — LLM tag TIER_1 English(EN) · soy · 2026-07-05 21:33

本地大模型效率：Token 缩减、Unity 集成与开放模型“口味-技能”

<h2> Local LLM Efficiency: Token Reduction, Unity Integration, and Open Model Taste-Skill </h2> <h3> Today's Highlights </h3> <p>This week's top stories focus on practical advancements for local AI, including a technique to drastically reduce LLM token usage for more efficient in…

dev.to — LLM tag TIER_1 English(EN) · galian · 2026-07-05 21:11

2026年使用Ollama在本地运行LLM：实用开发者指南

<p>For years, "run the model locally" was the option you mentioned and then didn't take: the models were too weak, the tooling too fiddly, and the cloud APIs too convenient. In 2026 that calculus has genuinely shifted. Open-weight models in the 12–35B range now handle real coding…

dev.to — LLM tag TIER_1 English(EN) · Puneet Gupta · 2026-07-05 18:42

在 Python 中评估 LLM 应用

<h2> Introduction </h2> <p><a href="https://pg-blogs.netlify.app/posts/10-building-reliable-llm-apps-in-python/" rel="noopener noreferrer">Building Reliable LLM Applications in Python</a> put it plainly: <strong>treat model output as a hypothesis to verify, not a fact to trust.</…

dev.to — LLM tag TIER_1 English(EN) · Puneet Gupta · 2026-07-05 18:42

在 Java 中评估 LLM 应用

<h2> Introduction </h2> <p><a href="https://pg-blogs.netlify.app/posts/11-building-reliable-llm-apps-in-java/" rel="noopener noreferrer">Building Reliable LLM Applications in Java</a> put it plainly: <strong>treat model output as a hypothesis to verify, not a fact to trust.</stro…

dev.to — LLM tag TIER_1 English(EN) · Puneet Gupta · 2026-07-05 18:42

Python中的LLM框架与原生SDK对比

<h2> Introduction </h2> <p>Every LLM ecosystem now has at least one framework promising to make agents easier to build, and every framework post either oversells the abstraction or dismisses it outright. Neither is useful. The only honest way to evaluate a framework is to build t…

dev.to — LLM tag TIER_1 English(EN) · Puneet Gupta · 2026-07-05 18:42

Java中的LLM框架与原生SDK对比

<h2> Introduction </h2> <p>Every LLM ecosystem now has at least one framework promising to make agents easier to build, and every framework post either oversells the abstraction or dismisses it outright. Neither is useful. The only honest way to evaluate a framework is to build t…

dev.to — LLM tag TIER_1 English(EN) · Puneet Gupta · 2026-07-04 15:29

使用 Java 构建可靠的 LLM 应用

<h2> Introduction </h2> <p>LLMs are usually associated with Python, but a great deal of production software — banking, enterprise backends, long-lived services — runs on the JVM, and those systems increasingly need to call language models too. Java's strong typing and mature tool…

dev.to — LLM tag TIER_1 English(EN) · Puneet Gupta · 2026-07-04 15:29

使用 Python 构建可靠的 LLM 应用

<h2> Introduction </h2> <p>Calling an LLM API is easy. Building an application on top of one that is <em>reliable</em> — that fails predictably, doesn't hallucinate its way into wrong answers, and doesn't surprise you with a bill — is a real engineering discipline.</p> <p>The cor…

dev.to — LLM tag TIER_1 Nederlands(NL) · Mattias chaw · 2026-07-04 13:01

基准测试中文大模型API：DeepSeek V4 vs Qwen3 vs Kimi K2 — 开发者指南 (2026)

<h1> Benchmarking Chinese LLM APIs: DeepSeek V3 vs Qwen3 vs Kimi K2 — A Developer's Guide (2026) </h1> <p>If you're building AI-powered applications in 2026, you've probably noticed something: Western model APIs are getting expensive. GPT-5 runs $5-15 per million tokens. Claude O…

dev.to — LLM tag TIER_1 English(EN) · Learn AI Resource · 2026-07-03 15:00

在不抓狂的情况下本地运行 LLMs：开发者工作流指南

<h1> Run LLMs Locally Without Losing Your Mind: A Dev Workflow Guide </h1> <p>So you want to use AI in your development workflow but don't want to send every code snippet to the cloud? I get it. Privacy concerns, latency headaches, API costs adding up—all valid. Here's how I actu…

dev.to — LLM tag TIER_1 English(EN) · MD Shahinur Rahman · 2026-07-03 12:36

如何为实际的AI工作流选择合适的LLM

<p>`</p> <p>Choosing an LLM used to feel simple.</p> <p>Pick the biggest name, test a few prompts, and ship.</p> <p>That does not work anymore.</p> <p>In today’s AI landscape, the gap between a good demo and a production-ready AI system is wide.</p> <p>Some models are better at d…

dev.to — LLM tag TIER_1 English(EN) · Moussa Coulibaly · 2026-07-02 17:28

构建LLM使用和性能仪表板

<p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fg3rvm9qxx8xu29lx5vz1.png"><img alt="Building Dashboa…

dev.to — LLM tag TIER_1 English(EN) · Babatunde Fashola · 2026-07-02 17:25

LLM 应用的可观测性：重要的指标

<p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Feup67vcvg8wjkfc1l3lq.png"><img alt="Observability fo…

dev.to — LLM tag TIER_1 English(EN) · Kuldeep Paul · 2026-07-02 16:15

开源与商业LLM网关对比

<p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F30x8nq8ispgshjufox44.png"><img alt="Open-Source vs. …

dev.to — LLM tag TIER_1 English(EN) · kapil Maheshwari · 2026-07-01 03:30

流式传输与批量处理LLM响应：成本与延迟分析

<h2> Key takeaways </h2> <ul> <li>Streaming can reduce perceived latency by 30-50%.</li> <li>Batching often leads to 20-40% lower API costs.</li> <li>Choosing the wrong method can double your LLM expenses.</li> <li>Understanding your user experience needs is critical.</li> </ul> …

dev.to — LLM tag TIER_1 English(EN) · Priya Sundaram · 2026-06-30 21:57

生产级LLM网关的解剖

<p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Filf3cxn4vd5j32qjsfpw.png"><img alt="The Anatomy of a…

dev.to — LLM tag TIER_1 English(EN) · Amit Nabarro · 2026-06-30 08:11

Langfuse 用于 LLM 可观测性 — 在您的中间件堆栈中的位置

<p><em>Originally published on <a href="https://475cumulus.com/articles/langfuse-for-llm-observability" rel="noopener noreferrer">475 Cumulus</a></em></p> <p><em>How to trace model calls, debug prompts, and run evals with Langfuse — integrated into server-side LLM middleware, not…

dev.to — LLM tag TIER_1 English(EN) · Ankit Sharma · 2026-06-30 02:54

使用 LangGraph 精通 LLM 工作流：新手指南

<p>Have you ever tried to build a complex application with a Large Language Model (LLM) only to find yourself tangled in a mess of if-else statements and function calls? You start with a simple prompt, but then you need to check a database, call an external API, maybe ask the use…

dev.to — LLM tag TIER_1 English(EN) · NovaStack · 2026-06-29 08:19

我如何简化了我的多模型LLM工作流程（并省去了一些麻烦）

<p>Over the past few months, I've been building an AI-powered code review tool for my team. Nothing groundbreaking — just something that catches common issues before PR reviews. But as the project evolved, I found myself drowning in API keys.</p> <p>The problem wasn't the code. I…

dev.to — LLM tag TIER_1 English(EN) · Eribo Richmond · 2026-06-28 09:29

FLenQA Benchmark：当前LLM是否能在其声称的上下文长度内进行推理？

<p>Some days ago, I started working on a research assistant that uses multi-agent orchestration mainly because the goal was to use small, local models (ignoring latency and output token/secs which impacts inference speed).</p> <p>Most small models have limited reasoning capabilit…

dev.to — LLM tag TIER_1 English(EN) · Delafosse Olivier · 2026-06-27 21:30

为 LLM 训练后微调设计 Google OpenRL 自托管 API

<blockquote> <p>Originally published on <a href="https://www.coreprose.com/kb-incidents/designing-a-google-openrl-self-hosted-api-for-llm-post-training-fine-tuning?utm_source=devto&utm_medium=syndication&utm_campaign=kb-incidents" rel="noopener noreferrer">CoreProse KB-in…

dev.to — LLM tag TIER_1 English(EN) · Ariel Frischer · 2026-06-27 19:07

LLM 的涌现属性与能力

<p>Emergent LLM ability is best treated as an evaluation problem, not a mystical property. Some abilities do appear suddenly under common benchmark metrics, but a large part of "emergence" comes from thresholded scoring, prompt format, in-context examples, tool access, training l…

dev.to — LLM tag TIER_1 English(EN) · Prateek Pareek · 2026-06-26 13:01

如何微调LLM：一份完整的循序渐进指南

<p>Fine-tuning an LLM means taking a general pre-trained model and training it further on your own data so it gets good at exactly what you need. In this guide, you will get a practical, step-by-step walkthrough covering every stage from dataset prep to deployment, written for en…

dev.to — LLM tag TIER_1 English(EN) · Suman Nath · 2026-06-26 06:32

解析准确率数字：从头开始构建一个LLM评估框架

<p>In my last series I fine-tuned models and kept quoting one proud number: <strong>~96% accuracy</strong>. This series is about the thing I <em>didn't</em> do carefully enough back then — actually checking what that number meant.</p> <p>Here's the trap. Accuracy is a single numb…

dev.to — LLM tag TIER_1 English(EN) · hhhfs9s7y9-code · 2026-06-25 03:29

构建一个自愈合 LLM API 层：重要的架构决策

<h1> Building a Self-Healing LLM API Layer: Architecture Decisions That Matter </h1> <p>Everyone wants self-healing APIs. Not everyone builds one that actually works in production.</p> <p>After 20,000+ real LLM API calls and iterating through five major architecture revisions at …

dev.to — LLM tag TIER_1 English(EN) · hhhfs9s7y9-code · 2026-06-25 02:53

六维合同验证：为什么您的LLM API需要的不只是状态码检查

<h1> 6-Dimensional Contract Validation: Why Your LLM API Needs More Than Status Code Checks </h1> <p>Your API returns 200 OK. Your monitoring dashboard is green. Everything looks fine.</p> <p>Except the response is JSON with completely wrong schema. Or the latency just tripled. O…

dev.to — LLM tag TIER_1 English(EN) · hhhfs9s7y9-code · 2026-06-25 02:23

为何重试并非自我修复：LLM API 的技术深度解析

<h1> Why Retry Is Not Self-Healing: A Technical Deep Dive for LLM APIs </h1> <p>Every LLM API wrapper claims "self-healing." What they actually do is retry the same request or switch to another provider on error.</p> <p>That's not self-healing. That's <strong>hope-driven developm…

dev.to — LLM tag TIER_1 English(EN) · hhhfs9s7y9-code · 2026-06-25 02:17

如何在生产环境中处理 LLM API 故障：2026 年实用指南

<h1> How to Handle LLM API Failures in Production: A Practical 2026 Guide </h1> <p><em>Last updated: June 25, 2026 | Reading time: 6 min</em></p> <p>Every AI application in production will face LLM API failures. They are not "if" but "when" — and the challenge is not just <em>det…

dev.to — LLM tag TIER_1 English(EN) · Nazar Boyko · 2026-06-23 23:45

评估生产环境中的 LLM 输出质量

<p>In March 2023, GPT-4 could tell you whether a number was prime with 97.6% accuracy. By June of the same year, the <em>same model name</em> answered those same questions correctly 2.4% of the time. Nobody pushed a bad commit. No prompt changed in your repo. The thing behind the…

dev.to — LLM tag TIER_1 English(EN) · Lucas · 2026-06-23 22:02

降低数据密集型 RAG 应用中 LLM 成本的两种模式

<p><em>How we cut token usage significantly in an F1 telemetry analyzer by rethinking what goes into the context window — and when.</em></p> <p>When building RAG applications on top of structured data (databases, APIs, telemetry), the naive approach is to dump everything into the…

dev.to — LLM tag TIER_1 English(EN) · Yash Kumar Saini · 2026-06-23 15:14

开发日志 #7 复活 DevNotion：10,000 行代码，多 LLM 支持，以及 v2.1 的未来之路

<blockquote> <p>Spent the week breathing new life into DevNotion—59 commits and over 10,000 lines of code later, v2.1 is officially alive. It was a massive push toward multi-LLM support and public-facing dashboards, keeping a steady 6-day streak in the process.</p> </blockquote> …

r/LocalLLaMA TIER_1 English(EN) · /u/hay-yo · 2026-06-23 10:58

用于长期运行本地 LLM 的可重用工作流

<div class="md"><p>Howdy All,</p> <p>Letting you know about a harness I've built to help us use local models on long tasks.</p> <p>I've been using local llms for 8 months now and in that time the two biggest recurring issues are slow processing speeds and small con…

dev.to — LLM tag TIER_1 English(EN) · galian · 2026-06-22 08:24

停止对你的LLM进行“氛围检查”：开发者评估指南

<p>You tweaked the system prompt, ran the same two test questions you always run, the answers looked good, and you shipped. A week later support is forwarding you screenshots of the model confidently doing the exact thing your prompt was supposed to stop. You never saw it, becaus…

dev.to — LLM tag TIER_1 English(EN) · hhhfs9s7y9-code · 2026-06-22 01:23

Python LLM API 错误处理：429 速率限制、重试和故障转移完整指南

<h1> Python LLM API Error Handling: A Complete Guide to 429 Rate Limits, Retries, and Failover </h1> <p>If you're building AI-powered applications in Python, you've probably hit this wall: your LLM provider returns a 429 (rate limit), a 502 (bad gateway), or just hangs until time…

dev.to — LLM tag TIER_1 中文(ZH) · hhhfs9s7y9-code · 2026-06-21 08:42

NeuralBridge Benchmark 数据：LLM 在百万次调用下的自我修复性能报告

<blockquote> <p>本文公布 NeuralBridge SDK 的完整基准测试数据，基于 1,000,000 次 API 调用实测，涵盖故障诊断延迟、熔断检查开销、遥测吞吐量等核心指标。</p> </blockquote> <h2> 测试环境 </h2> <div class="table-wrapper-paragraph"><table> <thead> <tr> <th>参数</th> <th>值</th> </tr> </thead> <tbody> <tr> <td>测试次数</td> <td>1,000,000</td> </tr…

dev.to — LLM tag TIER_1 中文(ZH) · hhhfs9s7y9-code · 2026-06-21 08:38

LLM API 24类故障完全解决方案：从429限流到静默失败的自愈实践

<blockquote> <p>大模型API的故障远比传统API复杂。本文系统梳理24类AI接口故障的根因、诊断方法和自愈方案，帮你彻底告别"半夜被叫醒处理API问题"。</p> </blockquote> <h2> 前言 </h2> <p>根据对10,000次生产环境LLM API调用的分析，故障分布如下：</p> <div class="table-wrapper-paragraph"><table> <thead> <tr> <th>故障类型</th> <th>占比</th> <th>危害程度</th> </tr> </thead> <tbody>…

dev.to — LLM tag TIER_1 中文(ZH) · hhhfs9s7y9-code · 2026-06-21 08:34

7个大型语言模型API故障模式及生产级解决方案

<h1> LLM API 的 7 大故障模式与生产级应对方案 </h1> <p>LLM API 在生产环境中的故障不是随机的——它们有明确的模式。</p> <h2> 故障模式分类 </h2> <p>基于 70,000 次故障注入测试的经验分类（来源：NeuralBridge SDK 基准测试），LLM API 故障可归纳为 7 大模式：</p> <div class="table-wrapper-paragraph"><table> <thead> <tr> <th>#</th> <th>故障模式</th> <th>触发条件</th> <th>占比（估）…

dev.to — LLM tag TIER_1 中文(ZH) · hhhfs9s7y9-code · 2026-06-21 07:46

LLM API 故障排除：40 多种实际故障模式及自动恢复解决方案

<blockquote> <p>LLM API 的故障不是"会不会发生"的问题，而是"下一个故障是什么、什么时候来"的问题。</p> </blockquote> <h2> 为什么需要 API 故障排查体系？ </h2> <p>2026 年，没有任何一家 LLM Provider 能保证 100% 可用。OpenAI、Anthropic、DeepSeek、通义千问等主流 Provider 在过去 12 个月都经历了不同程度的服务中断。</p> <p>对于生产环境中的 AI Agent 来说，API 故障是 <strong>日常运维的一部分</strong>…

dev.to — LLM tag TIER_1 English(EN) · Ayi NEDJIMI · 2026-06-20 10:04

LLM上下文窗口管理：策略与模式

<p>Managing context windows in production LLM applications is one of those problems that everyone underestimates until their app crashes or costs spiral out of control. Token limits are hard walls, not soft guidelines, and the strategies you choose upfront determine whether your …

dev.to — LLM tag TIER_1 English(EN) · Growth Collective · 2026-06-19 11:00

监控 LLM 可见性：增长工程师的技术手册

<p>The shift from traditional search engines to AI-powered answer engines is already reshaping how users discover content. Gartner projects a 25% decline in search engine volume by 2026 as more people turn to chatbots like ChatGPT, Claude, and Gemini for instant answers. For bran…

dev.to — LLM tag TIER_1 English(EN) · Rost · 2026-06-19 09:52

LLM系统成本优化：钱到底花在哪儿了

<p>LLM costs scale linearly with usage. A system processing 10,000 requests a day at $0.01 per request costs $100 daily — $365 a year. At enterprise scale, that's over $10,000.</p> <p>Cost optimization isn't about cutting corners. It's about spending tokens where they matter.</p>…

dev.to — LLM tag TIER_1 English(EN) · PAWAN YADAV (AI Engineer) · 2026-06-19 09:36

轻量级开源大模型驱动的工具调用

<p>🚀 How Lightweight LLMs Can Use Tools Without Large Compute: A Prompt-Driven Tool-Calling Approach</p> <h1> AI #LLM #MachineLearning #AIAgents #PromptEngineering #OpenSourceAI </h1> <p>🚀 Introduction</p> <p>Large Language Models (LLMs) like GPT-4 or Claude are extremely powerfu…

dev.to — LLM tag TIER_1 English(EN) · Jasmine Park · 2026-06-19 09:33

Langfuse 替代品：6 款 LLM 可观测性工具，按第八个月的痛点排序

<h2> TL;DR </h2> <p>I went looking for Langfuse alternatives after living with a proprietary tracer for eight months and then paying to migrate off it.</p> <p>I compared six options:</p> <ul> <li>Helicone</li> <li>Arize Phoenix</li> <li>LangSmith</li> <li>Braintrust</li> <li>Lami…

dev.to — LLM tag TIER_1 English(EN) · Vaibhav Doddihal · 2026-06-18 13:45

评估大型语言模型系统：指标、方法和评分卡

<h1> Evaluating LLM Systems: Metrics, Methods, and Scorecards </h1> <p><em>Originally published on <a href="https://blocksimplified.com/blog/evaluating-llm-systems-metrics-methods-scorecards" rel="noopener noreferrer">BlockSimplified</a> — 11 min read</em></p> <blockquote> <p>Thi…

dev.to — LLM tag TIER_1 English(EN) · zendev2112 · 2026-06-18 04:03

提示词不够：强制LLM输出硬性约束

<p>Every LLM demo looks impressive until it encounters a requirement that cannot be left to probability. Models are remarkably good at producing convincing text, but production systems often need guarantees rather than likelihoods. I ran into that distinction while building an AI…

dev.to — LLM tag TIER_1 Português(PT) · Lucas Amaral · 2026-06-17 13:10

面向海量数据的提示工程：使用LLM进行覆盖且无重复的规模化测试

<p>O uso de LLMs para a geração de dados sintéticos tornou-se uma estratégia atraente para equipes de QA que precisam escalar suas esteiras de testes. A promessa é tentadora: gerar centenas de registros complexos em segundos. No entanto, na prática, a geração automatizada sem dir…

dev.to — LLM tag TIER_1 English(EN) · Yogitaadevi Ravishankar · 2026-06-17 12:09

使用 Ollama 解锁本地 LLM 强大功能：实用指南

<h2> <strong>Tags:</strong> #Ollama #LLM #AI #OpenSource </h2> <h2> Introduction </h2> <p>The rise of large language models (LLMs) has transformed how we build AI applications, from chatbots to code assistants. Yet, most developers still rely on cloud APIs, paying per request and…

dev.to — LLM tag TIER_1 English(EN) · Alex Delov · 2026-06-17 09:05

LLM流水线的有状态提供者回退：一种FSM模式

<p>Gateway-level LLM fallback (LiteLLM, Bifrost, Kong AI Gateway) operates on individual HTTP requests. When a request to one provider fails, the gateway retries it against another. This is the right tool when your unit of work is a single completion call.</p> <p>It is the wrong …

dev.to — LLM tag TIER_1 English(EN) · DevOps Start · 2026-06-17 09:03

Kubernetes 上的 LLM 可观测性：实用指南

<p>Monitoring traditional applications often feels like a well-trodden path. You set up logs, grab some metrics, and perhaps add a few traces. However, integrating Large Language Models (LLMs) or AI agents, especially when running on Kubernetes, fundamentally changes this paradig…

dev.to — LLM tag TIER_1 English(EN) · QuantaMind · 2026-06-16 03:30

基于提示词的工具调用 vs. 原生工具调用：驾驭本地 LLM 实现的雷区

<p>If you’ve spent any time working across different local LLM backends, you know the frustration. You get your tool-calling logic dialed in perfectly for Ollama, you feel great, and then you try to switch your backend to something like MLX or a specific llama.cpp setup, and sudd…

r/LocalLLaMA TIER_1 English(EN) · /u/awfulalexey · 2026-06-15 19:32

Evalatro：一个让大语言模型玩真实 Balatro 的开放基准测试

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1u6qso1/evalatro_an_open_benchmark_where_llms_play_the/"> <img alt="Evalatro: an open benchmark where LLMs play the real Balatro" src="https://preview.redd.it/hh9qswkj0i7h1.png?width=640&crop=smart&aut…

dev.to — LLM tag TIER_1 English(EN) · Gabriel Anhaia · 2026-06-13 11:00

LLM 应用的追踪采样：保留重要的 Span，丢弃其余

<ul> <li> <strong>Book:</strong> <a href="https://www.amazon.de/-/en/dp/B0GXNNMKVF" rel="noopener noreferrer">Observability for LLM Applications</a> </li> <li> <strong>Also by me:</strong> <em>Thinking in Go</em> (2-book series) — <a href="https://xgabriel.com/go-book" rel="noope…

dev.to — LLM tag TIER_1 English(EN) · Alex Towell · 2026-06-07 03:39

语言微积分：LLM组合的代数框架

<p>What if we could compose language models the way we compose functions in mathematics? What if there was an algebra of language models?</p> <p><strong>Language Calculus</strong> (langcalc) is an algebraic framework for building and reasoning about language model systems.</p> <h…

dev.to — LLM tag TIER_1 English(EN) · Alex Towell · 2026-06-07 03:09

src2md: 将代码库适配到 LLM 上下文窗口

<p><strong><a href="https://pypi.org/project/src2md/" rel="noopener noreferrer">src2md</a></strong> solves a practical problem: you want an LLM to understand your codebase, but the codebase doesn't fit in the context window.</p> <p>GPT-4 gives you ~128K tokens. Claude gives you ~…

dev.to — LLM tag TIER_1 English(EN) · zxpmail · 2026-06-06 11:55

少即是多：为什么3个代码示例胜过10条LLM代码生成规则

<p><em>A controlled benchmark comparing two approaches to guiding LLM code generation.</em></p> <h2> The Question </h2> <p>Most LLM harnesses guide code generation via rules: "Don't hardcode API keys." "Don't use empty catch blocks." "Don't over-abstract."</p> <p>But LLMs aren't …

报道来源 [220]