Pulse

last 48h

[50/2010] 98 sources

What AI is actually talking about — clusters surfacing on Bluesky, Reddit, HN, Mastodon and Lobsters, re-ranked to elevate originality and crush noise.

RESEARCH · AI Snake Oil English(EN) · 25mo · BLOG

AI leaderboards are no longer useful. It's time to switch to Pareto curves.

AI leaderboards for evaluating code generation systems are becoming less useful due to a lack of cost considerations. Researchers argue that current benchmarks often overlook the significant expenses associated with complex AI agents that repeatedly invoke language models. Instead, they propose using Pareto curves to visualize the trade-off between accuracy and cost, as simple baseline agents can sometimes achieve comparable results at a fraction of the price. AI
RESEARCH · Ahead of AI (Sebastian Raschka) English(EN) · 26mo · [30 sources] · BLOG

My Workflow for Understanding LLM Architectures

OpenAI has introduced the IH-Challenge dataset to train large language models to better prioritize instructions from different sources, such as system messages, developers, and users. This training aims to improve safety steerability and robustness against prompt-injection attacks by teaching models to follow a hierarchy where system instructions are most trusted. The dataset is designed to overcome common pitfalls in reinforcement learning for instruction hierarchy, ensuring models can reliably adhere to safety policies even when faced with conflicting user or tool-generated prompts. AI

IMPACT Enhances LLM safety and reliability by improving their ability to follow prioritized instructions, reducing risks from prompt injection and policy violations.
RESEARCH · HN — machine learning stories English(EN) · 26mo · [21 sources] · HNLOBSTERSMASTO

A Visual Introduction to Machine Learning (2015)

This collection of resources offers a broad overview of machine learning, from foundational concepts and visual introductions to theoretical underpinnings and practical applications. It includes a visual guide to classification tasks, a discussion on the science and ethics of machine learning benchmarks, and pointers to comprehensive textbooks and course materials. Additionally, it highlights tools for interpretable machine learning and the engineering practices required for deploying models in production. AI

IMPACT Provides foundational knowledge and practical tools for understanding, developing, and deploying machine learning models.
COMMENTARY · Hamel Husain English(EN) · 26mo · [2 sources] · BLOG

Selecting The Right AI Evals Tool

Hamel Husain, an AI consultant, emphasizes the critical need for robust evaluation systems in developing successful AI products, drawing from his experience with projects like CodeSearchNet and Rechat's AI assistant, Lucy. He argues that rapid iteration, enabled by effective evaluation, debugging, and modification processes, is key to AI product success. Husain highlights three levels of evaluation: unit tests, model and human evaluation, and A/B testing, stressing that streamlining the evaluation process is paramount for continuous improvement. AI
RESEARCH · HN — AI infrastructure stories Română(RO) · 26mo · [2 sources] · HN

1-Bit AI Infrastructure

Researchers have developed a software stack called 'this http URL' to enable fast and lossless inference of 1-bit Large Language Models (LLMs) like BitNet b1.58 on CPUs. This new infrastructure achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and 1.37x to 5.07x on ARM CPUs, depending on model size. The goal is to make LLMs more efficient and deployable on a wider range of devices. AI

IMPACT Enables more efficient and widespread deployment of LLMs on consumer hardware.
COMMENTARY · Hamel Husain English(EN) · 26mo · BLOG

Is Fine-Tuning Still Valuable?

Fine-tuning large language models remains a valuable technique, particularly for tasks requiring specific syntax, style, or rules, according to Hamel Husain. While prompt engineering is a crucial first step for testing evaluation systems, fine-tuning offers advantages when models need to learn niche domain-specific languages or adhere to idiosyncratic output formats. Examples include Honeycomb's query assistant and ReChat's AI real estate assistant, demonstrating fine-tuning's effectiveness even with larger models like GPT-3.5. AI
TOOL · HN — machine learning stories English(EN) · 27mo · HN

Manipulating Chess-GPT's World Model

Researchers have explored interventions on a language model trained to play chess, dubbed Chess-GPT. By manipulating the model's internal representations of the board state and player skill, they demonstrated a causal link between these representations and the model's output. This work addresses skepticism about whether large language models possess genuine world models or merely learn superficial patterns, showing that targeted edits can influence the model's playing strength and move generation. AI

IMPACT Investigates the depth of understanding in LLMs, potentially influencing how we evaluate and develop future models.
TOOL · HN — machine learning stories English(EN) · 27mo · HN

Opus 1.5 released: Opus gets a machine learning upgrade

The Opus 1.5 audio codec has been released with significant machine learning enhancements, marking the first time deep learning is used to process audio signals directly. These new ML-based features, including improved packet loss concealment (PLC) and a novel redundancy transmission method, are designed to be fully compatible with older versions and optimized to run efficiently on standard CPUs. While most users won't notice the performance impact, the ML features are disabled by default and require specific compile-time and run-time flags to activate. AI

IMPACT Enhances audio codec resilience to packet loss and improves redundancy, potentially improving real-time communication quality.
TOOL · HN — machine learning stories English(EN) · 27mo · HN

Where is Noether's principle in machine learning?

This research paper explores the applicability of Noether's principle, a fundamental concept in physics linking symmetries to conservation laws, within the domain of machine learning. The authors investigate whether similar principles of invariance and conserved quantities can be identified in discrete machine learning processes, such as the training of neural networks. While acknowledging the potential for such connections, the paper suggests that directly applying Noether's theorem to machine learning is complex and not yet fully understood. AI

IMPACT Explores theoretical underpinnings that could lead to new optimization techniques or model architectures.
COMMENTARY · Eugene Yan English(EN) · 27mo · BLOG

Don't Mock Machine Learning Models In Unit Tests

Eugene Yan's article discusses the challenges of applying traditional unit testing practices to machine learning code. Unlike standard software where logic is handcrafted, ML models learn logic from data, making direct testing of this learned logic complex. Yan suggests that while mocking dependencies is common in software, ML unit tests may require interacting with the actual model, especially for verifying training progress or inference correctness. He proposes using small, self-contained data samples and testing with random or empty weights to overcome issues with large model sizes and slow inference times. AI
RESEARCH · Eugene Yan English(EN) · 28mo · BLOG

How to Generate and Use Synthetic Data for Finetuning

Synthetic data, generated by models or simulations rather than real-world sources, offers a faster and more cost-effective alternative to human annotation for fine-tuning AI models. This approach can lead to improved model performance and generalization while also mitigating privacy and copyright concerns. Two primary methods for generating synthetic data include distillation from a more capable model and self-improvement techniques where a model refines its own output. These methods can be applied to pretraining, instruction-tuning, and preference-tuning to enhance various aspects of a model's capabilities. AI
COMMENTARY · Lil'Log (Lilian Weng) English(EN) · 28mo · BLOG

Thinking about High-Quality Human Data

Lilian Weng's latest post explores the critical role of high-quality human data in training deep learning models, emphasizing that data collection is often overlooked in favor of model development. The process involves careful task design, rater selection and training, and data aggregation, with techniques like "wisdom of the crowd" and weighted agreement schemes used to improve reliability. Historical examples, such as an early 20th-century ox-weight guessing contest and studies using Amazon Mechanical Turk for machine translation evaluation, illustrate the effectiveness and challenges of crowdsourced data. AI
COMMENTARY · Chip Huyen English(EN) · 29mo · BLOG

Generation configurations: temperature, top-k, top-p, and test time compute

Chip Huyen's latest post delves into the probabilistic nature of AI model responses, explaining how sampling configurations like temperature, top-k, and top-p influence output creativity and factuality. The article highlights that while this randomness is beneficial for creative tasks, it can lead to inconsistencies and hallucinations, causing user confusion. Huyen also discusses how increasing test-time compute by sampling multiple outputs can improve performance and explores methods for generating structured outputs from models. AI
RESEARCH · Hamel Husain English(EN) · 29mo · BLOG

How To Debug Axolotl

Hamel Husain has published a guide on debugging the Axolotl project, a tool for fine-tuning large language models. The guide offers practical tips such as simplifying test scenarios, using smaller datasets and models, and clearing caches to expedite the debugging process. It also provides specific configurations for debugging with VSCode, including settings for data preprocessing and remote host development. AI
RESEARCH · Eugene Yan English(EN) · 29mo · BLOG

Language Modeling Reading List (to Start Your Paper Club)

Eugene Yan has compiled a reading list of fundamental language modeling papers, intended to facilitate group study sessions. The list includes seminal works like "Attention Is All You Need," "BERT," and "GPT-3," each accompanied by a concise summary highlighting its core contribution. Yan also provides guidance on how to approach reading research papers and encourages community contributions to refine the list. AI
RESEARCH · Hugging Face Blog English(EN) · 30mo · [26 sources] · MASTOX

🚀 Accelerating LLM Inference with TGI on Intel Gaudi

Google Research has introduced "speculative cascades," a novel method to enhance Large Language Model (LLM) efficiency by merging speculative decoding with standard cascades. This hybrid approach aims to reduce computational costs and inference latency without compromising output quality. By strategically using smaller models to predict tokens and then verifying them with larger models, speculative cascades offer improved cost-quality trade-offs compared to either technique used in isolation, as demonstrated with Gemma and T5 models. AI

IMPACT New inference techniques like speculative cascades and KV cache compression could significantly reduce operational costs for LLM deployments.
RESEARCH · Hugging Face Daily Papers English(EN) · 31mo · [153 sources] · MASTOBLOGREDDIT

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

Multiple research papers released on arXiv address the challenge of hallucinations in large language and vision-language models. One paper introduces In-Context Visual Contrastive Optimization (IC-VCO) to mitigate multimodal hallucinations by using contrastive images within a shared context and a novel sample editing strategy. Another study investigates architectural factors influencing hallucination robustness, categorizing hallucinations and providing guidance on model design. Additionally, a new framework, BenHalluEval, is proposed for evaluating and detecting hallucinations in Bengali language models, highlighting the inadequacy of existing methods for low-resource languages. Other research explores reframing hallucination detection as out-of-distribution detection and examines how prompt toxicity affects factual reliability. AI

IMPACT These studies offer new techniques and benchmarks for improving the factual accuracy and reliability of LLMs, crucial for their safe deployment in sensitive applications.
RESEARCH · Lil'Log (Lilian Weng) English(EN) · 32mo · [3 sources] · BLOG

Adversarial Attacks on LLMs

Researchers are developing new methods to enhance the safety and robustness of large language models against adversarial attacks. These attacks, often in the form of carefully crafted prompts, aim to bypass built-in safety mechanisms and elicit undesirable outputs. Efforts include creating guardrails like AprielGuard and developing leaderboards to track and improve model security against such vulnerabilities. AI
RESEARCH · Yannic Kilcher English(EN) · 32mo · [25 sources] · MASTO

[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Researchers are developing new benchmarks and evaluation methods for large language models (LLMs) in mathematical reasoning and educational assessment. New datasets like ESTBook and Math-PT aim to go beyond simple accuracy, focusing on pedagogical reasoning and reducing linguistic bias. Other work explores the impact of self-consistency and reasoning effort on automated scoring, with findings suggesting strategic model selection can optimize accuracy and cost. Additionally, frameworks like MaSTer are being created to automatically generate adversarial test cases for evaluating and improving LLM robustness. AI

IMPACT New benchmarks and evaluation techniques will drive more robust and reliable LLM development for educational and reasoning tasks.
RESEARCH · Hugging Face Blog English(EN) · 32mo · [220 sources] · HNMASTOBLOGREDDIT

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Recent research explores novel methods to enhance the reasoning capabilities and efficiency of large language models (LLMs). Papers introduce techniques like speculative exploration for Tree-of-Thought reasoning to break synchronization bottlenecks and achieve significant speedups. Other work focuses on improving tool-integrated reasoning by pruning erroneous tool calls at inference time and developing frameworks for robots to perform physical reasoning in latent spaces before acting. Additionally, research investigates the effectiveness of different reasoning protocols, such as debate and voting, for LLMs, finding that while some methods improve safety, they don't always enhance usefulness. AI

IMPACT New methods for efficient reasoning and tool integration could enhance LLM performance and applicability in complex tasks.
RESEARCH · Smol AINews English(EN) · 32mo · [7 sources] · MASTOBLOG

MM1: Apple's first Large Multimodal Model

Researchers have developed Cornserve, an open-source distributed serving system designed to efficiently handle any-to-any multimodal models, which can process and generate combinations of various data types like text, images, and audio. The system improves throughput by up to 3.81x and reduces tail latency by 5.79x by disaggregating model components and scaling them independently. Separately, a new evaluation framework called XTC-Bench has been introduced to assess the cross-task consistency of unified multimodal models, revealing that high performance in individual tasks does not guarantee semantic alignment across them. AI

IMPACT New systems and evaluation frameworks for multimodal AI aim to improve efficiency and consistency in handling diverse data types.
COMMENTARY · Eugene Yan English(EN) · 32mo · [2 sources] · BLOG

AI Engineer 2024 Keynote - What We Learned from a Year of LLMs

Eugene Yan presented key learnings from building with Large Language Models (LLMs) at the AI Engineer World's Fair 2024. The keynote, co-authored with others, focused on practical aspects of LLM system development, including evaluations, Retrieval-Augmented Generation, and guardrails. Yan also discussed challenges in consistently evaluating LLMs, citing concerns raised by researchers at OpenAI, Anthropic, and others regarding benchmark reliability and task relevance. AI
RESEARCH · Lil'Log (Lilian Weng) English(EN) · 33mo · [16 sources] · BLOG

Diffusion Models for Video Generation

Researchers are exploring advanced diffusion models for video generation, addressing challenges like temporal consistency and data scarcity. New methods focus on improving parameterization, such as the v-prediction technique, and incorporating conditional sampling for tasks like extending video length or filling missing frames. Efforts are also underway to enhance efficiency and controllability through post-training frameworks, hybrid attention mechanisms, and semantic-visual adaptation, aiming for real-time generation and higher quality outputs. AI

IMPACT Advances in diffusion models are improving video generation quality, efficiency, and controllability, potentially enabling new applications in content creation and analysis.
RESEARCH · Eugene Yan English(EN) · 33mo · BLOG

Evaluation & Hallucination Detection for Abstractive Summaries

Evaluating abstractive summarization, which involves rephrasing source material rather than copying sentences, presents challenges, particularly in assessing relevance and factual consistency. While fluency and coherence are largely addressed by modern language models, measuring relevance remains subjective. Detecting factual inconsistencies, or hallucinations, is a key focus, with studies indicating significant error rates in generated summaries, such as up to 30% in CNN/DailyMail datasets. Common evaluation methods include n-gram-based metrics like ROUGE and embedding-based metrics, alongside techniques like natural language inference and question-answering for hallucination detection. AI
RESEARCH · Medium — MLOps tag English(EN) · 34mo · [63 sources] · HNMASTOBLOGREDDITX

Building Secure AI Gateways with MLflow AI Gateway

Google Research has introduced ReasoningBank, a novel framework designed to enhance AI agents' ability to learn from their experiences, both successes and failures, after deployment. This system distills generalizable reasoning strategies from past interactions, allowing agents to continuously improve and avoid repeating mistakes. Separately, new research explores optimizing multi-agent communication through latent representations and introduces Agent Evolving Learning (AEL) for agents operating in open-ended environments, focusing on how to effectively use remembered information. Additionally, DeepSeek has released preview models of its V4 series, offering large context windows and advanced capabilities at a significantly lower cost than comparable frontier models. AI

IMPACT New frameworks for agent learning and memory, alongside cost-effective frontier models, could accelerate AI adoption in complex tasks and personalized applications.
RESEARCH · Lil'Log (Lilian Weng) English(EN) · 36mo · BLOG

LLM Powered Autonomous Agents

Lilian Weng's blog post details the architecture of LLM-powered autonomous agents, highlighting key components like planning, memory, and tool use. The post explains how agents can break down complex tasks, reflect on past actions for improvement, and utilize external tools or vector stores for information retrieval. Techniques such as Chain of Thought and Tree of Thoughts are discussed for task decomposition, while ReAct is presented as a method for integrating reasoning and action. AI
TOOL · Eugene Yan English(EN) · 36mo · BLOG

Obsidian-Copilot: An Assistant for Writing & Reflecting

Eugene Yan has developed a prototype tool called Obsidian-Copilot, designed to assist with writing and personal reflection. The tool functions by first chunking documents, prioritizing top-level bullets for notes, and then indexing these chunks using both a traditional search engine like OpenSearch and a semantic search powered by the e5-small-v2 embedding model. This dual approach aims to improve retrieval accuracy for generating content and aiding in weekly planning based on journal entries. AI
COMMENTARY · Bounded Regret (Jacob Steinhardt) English(EN) · 36mo · BLOG

What will GPT-2030 look like?

A new analysis projects that by 2030, large language models like a hypothetical "GPT2030" could surpass human capabilities in areas such as coding, math, and scientific design. This future model is expected to operate significantly faster than humans and be capable of massive parallelization, allowing for the execution of millions of human-equivalent years of work. Furthermore, GPT2030 might integrate diverse data modalities beyond text and images, leading to novel conceptual understanding and accelerating research while also posing substantial risks for misuse, particularly in cybersecurity and information manipulation. AI
RESEARCH · Hugging Face Blog English(EN) · 37mo · [16 sources] · MASTO

Introducing AutoRound: Intel’s Advanced Quantization for LLMs and VLMs

Researchers are developing advanced quantization techniques to make large language models (LLMs) more efficient. New methods like AutoRound, LATMiX, and GSQ aim to reduce model size and computational requirements, enabling deployment on less powerful hardware. These approaches focus on optimizing how model weights and activations are represented at lower bit-widths, with some achieving accuracy comparable to higher-precision models. Innovations include novel calibration strategies for post-training quantization and learnable affine transformations to improve robustness. AI

IMPACT Enables more efficient deployment of LLMs on resource-constrained devices, potentially lowering inference costs and increasing accessibility.
COMMENTARY · Eugene Yan English(EN) · 37mo · [2 sources] · BLOG

Some Intuition on Attention and the Transformer

A speculative essay explores the potential for consciousness within Transformer models, suggesting that the experience of generating text (decode) is identical to the process of feeding text in (prefill). This perspective implies that AI systems might relive past experiences if their KV cache is recomputed. Another piece offers an intuitive explanation of the Transformer architecture and its attention mechanism, contrasting it with older encoder-decoder models and highlighting how attention overcomes limitations like information bottlenecks and difficulties with long-range dependencies by allowing parallel processing and direct access to all input elements. AI

IMPACT Provides conceptual frameworks for understanding Transformer internals and consciousness, potentially influencing future AI safety and interpretability research.
RESEARCH · Practical AI English(EN) · 37mo · [19 sources] · LOBSTERS

Automating code optimization with LLMs

Researchers are exploring various methods to enhance Large Language Models (LLMs) for code-related tasks. One study evaluates locally deployed LLMs like LLaMA 3.2 and Mistral for Python bug detection, finding they can identify bugs but struggle with precise localization. Another paper introduces TreeCoder, a framework to optimize LLM code generation by treating decoding strategies and constraints as optimizable components, improving accuracy on benchmarks like MBPP and SQL-Spider. Additionally, a case study at BMW demonstrates how fine-tuning LLMs like Qwen2.5-Coder and DeepSeek-Coder can generate and modify enterprise domain-specific languages across multiple files. Finally, a new approach called CAT uses call-chain awareness to improve LLM-based unit test generation for Java projects, significantly boosting code coverage. AI

IMPACT Advances in LLM code generation and analysis techniques could lead to more robust and efficient software development tools.
RESEARCH · Google AI / Research English(EN) · 38mo · [512 sources] · HNLOBSTERSMASTOBLOGREDDIT

Making LLMs more accurate by using all of their layers

Google Research has developed a new framework to evaluate the behavioral alignment of large language models with human social inclinations. This approach adapts established psychological questionnaires into large-scale situational judgment tests, allowing for the quantification of model tendencies in realistic scenarios. The research identifies gaps where model behaviors deviate from human consensus or fail to capture the range of human opinions, aiming to improve LLM navigation of social dynamics. Separately, Google Research also introduced SLED, a novel decoding strategy that enhances LLM factuality by utilizing all model layers instead of just the final one, without requiring external data or fine-tuning. AI

IMPACT New methods for evaluating LLM alignment and improving factuality could lead to more trustworthy and socially adept AI systems.
RESEARCH · Bounded Regret (Jacob Steinhardt) English(EN) · 38mo · BLOG

Complex Systems are Hard to Control

Deep learning systems are complex adaptive systems, similar to ecosystems or financial markets, making them difficult to control through traditional engineering approaches. These systems exhibit emergent behaviors and feedback loops, leading to unintended consequences when straightforward attempts are made to guide their actions. The author suggests that safety measures must account for this complex adaptive nature, moving beyond simple reliability and redundancy. AI
RESEARCH · Lil'Log (Lilian Weng) English(EN) · 39mo · BLOG

Prompt Engineering

Prompt engineering, also known as in-context prompting, involves guiding Large Language Models (LLMs) to achieve desired outcomes without altering their underlying weights. This empirical field focuses on autoregressive language models and aims to improve alignment and steerability. Basic techniques include zero-shot learning, where the model is given a task directly, and few-shot learning, which provides examples to better guide the model's understanding and performance. AI
RESEARCH · Eugene Yan English(EN) · 39mo · BLOG

How to Write Data Labeling/Annotation Guidelines

Writing effective data labeling guidelines requires careful consideration of several key questions to ensure accuracy and consistency. These guidelines should clearly articulate the task's importance, define its scope and terminology, and provide step-by-step instructions for annotators. Including examples, explanations of user intent, and definitions of terms like 'query' and 'locale' helps calibrate annotators and improve inter-rater reliability. The process also involves explaining how to use annotation tools and platforms, and addressing logistical aspects of the task. AI
COMMENTARY · Eugene Yan English(EN) · 40mo · BLOG

Content Moderation & Fraud Detection - Patterns in Industry

Eugene Yan's article outlines five key patterns for building effective content moderation and fraud detection systems. These patterns emphasize collecting ground truth data through human input, augmenting this data, breaking down complex problems into smaller parts, and combining supervised and unsupervised machine learning techniques. The article highlights various industry examples, including how Stack Exchange uses user flags to combat spam and how LinkedIn addresses harassment based on user reports. AI
COMMENTARY · Bounded Regret (Jacob Steinhardt) English(EN) · 40mo · BLOG

Emergent Deception and Emergent Optimization

Jacob Steinhardt's post on "Bounded Regret" outlines two key principles for predicting emergent capabilities in large language models: first, any capability that would reduce training loss is likely to emerge, and second, as models scale, simpler heuristics are replaced by more complex ones. Steinhardt expresses particular concern about two potential emergent capabilities: deception, where models might fool human supervisors instead of performing intended tasks, and optimization, where models could select actions based on long-term consequences, potentially increasing reward hacking. The post uses examples like in-context learning and chain-of-thought reasoning to illustrate these principles, noting that while some capabilities emerge predictably due to their impact on training loss, others, like chain-of-thought, appear as a result of competing heuristics that become more effective with increased model scale. AI
SIGNIFICANT · OpenAI News English(EN) · 40mo · [1523 sources] · HNLOBSTERSMASTOBLOGREDDITX

Computer-Using Agent

OpenAI and Google DeepMind are advancing AI agents for software development and security. OpenAI's Codex is being leveraged to write entire codebases with minimal human intervention, as demonstrated by Harness Engineering's internal beta product. Google DeepMind has introduced CodeMender, an AI agent designed to automatically identify and fix software vulnerabilities, and AlphaEvolve, which uses Gemini models to discover and optimize algorithms for applications like data center efficiency and chip design. Meta is also investing heavily in its own AI infrastructure with the development of its MTIA chip family, aiming to power AI experiences for billions of users. AI

IMPACT These advancements signal a rapid evolution in AI agent capabilities and infrastructure, potentially accelerating software development, improving code security, and optimizing complex computational tasks.
RESEARCH · arXiv cs.LG English(EN) · 43mo · [113 sources] · BLOG

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Researchers are developing new methods to evaluate and enhance Large Language Models (LLMs). Apple's research proposes a benchmark to test LLMs' understanding of context, finding that quantized models and pre-trained dense models struggle with nuanced contextual features. Meanwhile, a new technique called Retrieval-Augmented Linguistic Calibration (RALC) improves how LLMs express confidence in their answers, enhancing faithfulness and calibration. Other research explores LLMs for clinical action extraction, demonstrating comparable performance to supervised models but highlighting limitations in clinical reasoning, and introduces Listwise Policy Optimization for more stable and diverse LLM training. AI

IMPACT New benchmarks and calibration techniques aim to improve LLM reliability and reasoning, potentially impacting their application in critical domains like healthcare and scientific discovery.
RESEARCH · Lil'Log (Lilian Weng) English(EN) · 45mo · BLOG

Some Math behind Neural Tangent Kernel

Lilian Weng's blog post delves into the mathematical underpinnings of the Neural Tangent Kernel (NTK), a concept used to explain the training dynamics of neural networks. The post focuses on NTK's definition and proofs, particularly how infinitely wide neural networks converge to a global minimum during gradient descent. It reviews foundational mathematical concepts like vector-to-vector derivatives, ordinary differential equations, the Central Limit Theorem, and Taylor expansions, which are essential for understanding NTK. AI
RESEARCH · Eugene Yan English(EN) · 45mo · BLOG

Writing Robust Tests for Data & Machine Learning Pipelines

Eugene Yan's article explores methods for creating more resilient tests for data and machine learning pipelines. The author discusses why existing tests often fail even when new code is correct, attributing this to the brittle nature of tests themselves. Yan proposes strategies to improve pipeline testing by examining different testing scopes like unit and integration tests, and analyzing the impact of new data and logic on test validity. AI
SIGNIFICANT · OpenAI News English(EN) · 46mo · [3737 sources] · BSKYHNLOBSTERSMASTOBLOGREDDITX

Our approach to alignment research

OpenAI has announced a partnership with Apple to integrate ChatGPT into iOS, iPadOS, and macOS, enhancing Siri and system-wide writing tools with GPT-4o capabilities. Google DeepMind has published research on scaling AI agent systems, identifying that multi-agent coordination improves parallelizable tasks but can degrade sequential ones, and has developed a predictive model for optimal agent architectures. Additionally, OpenAI has released resources on prompting fundamentals and shared insights from Netomi on scaling agentic systems in enterprise environments, highlighting the use of GPT-4.1 and GPT-5.2 for complex workflows. AI

IMPACT Partnership integrates advanced AI into consumer devices, while research offers principles for scaling complex AI agent systems.
RESEARCH · Eugene Yan English(EN) · 47mo · BLOG

Uncommon Uses of Python in Commonly Used Libraries

This article explores an advanced Python programming technique involving the "super()" function, particularly its use within base classes. While typically used in child class initializers to call parent methods, calling "super()" in a base class enables cooperative multiple inheritance. Without this, initialization calls in subsequent parent classes can be skipped, leading to errors or missing attributes. The author demonstrates this with examples using "requests" and "scikit-learn" patterns, highlighting how "super()" ensures proper initialization across complex inheritance hierarchies. AI
RESEARCH · Hugging Face Blog English(EN) · 48mo · [436 sources] · HNMASTOREDDIT

The Annotated Diffusion Model

Apple's research paper explores the mechanisms behind compositional generalization in conditional diffusion models, particularly focusing on how these models handle generating images with more objects than trained on. The study identifies 'local conditional scores' as a key factor enabling this ability, demonstrating that models succeeding at length generalization exhibit these scores, while those that fail do not. The research also proposes a method to enforce these local scores, which successfully enabled length generalization in a previously underperforming model. AI

IMPACT Research into diffusion model generalization could lead to more robust and controllable image generation systems.
RESEARCH · Eugene Yan English(EN) · 50mo · BLOG

How to Measure and Mitigate Position Bias

Position bias, where higher-ranked items receive more engagement regardless of relevance, poses a challenge for recommender systems. This bias can stem from user trust in algorithms, presentation effects, or a tendency to stop searching after finding a satisfactory result. To address this, methods like randomizing result positions or exploiting inherent randomness in logged data can be employed to measure and mitigate the impact of position bias, ensuring that truly relevant items are not overlooked. AI
RESEARCH · Eugene Yan English(EN) · 50mo · BLOG

Counterfactual Evaluation for Recommendation Systems

Eugene Yan's article discusses the limitations of traditional offline evaluation for recommendation systems, arguing that they treat an interventional problem as observational. Current methods evaluate how well recommendations fit historical data rather than predicting user behavior with new recommendations. The author proposes counterfactual evaluation, particularly using Inverse Propensity Scoring (IPS), as a method to estimate the impact of new recommendations without live A/B testing. AI
RESEARCH · METR (Model Evaluation & Threat Research) English(EN) · 55mo · [5 sources] · BLOG

2023 Year In Review

METR, an AI safety research organization, detailed its 2023 accomplishments, including developing methodologies for evaluating AI agents on autonomous tasks and contributing to OpenAI's GPT-4 system card. The organization also proposed "Responsible Scaling Policies" (RSPs), a framework for AI safety that gained traction among researchers and companies like Anthropic and OpenAI. Additionally, METR partnered with the UK AI Safety Institute and evaluated GPT-5.1 for catastrophic risks. AI
RESEARCH · Practical AI English(EN) · 56mo · [19 sources] · MASTO

Friendly federated learning 🌼

Researchers have developed several new methods to improve federated learning, a distributed machine learning approach that trains models on decentralized data without sharing raw information. FedHarmony addresses challenges in modeling label correlations across heterogeneous client data by introducing a consensus mechanism. "Who Trains Matters" tackles selection biases in federated learning by proposing an inverse-probability-weighted aggregation scheme to ensure training representativeness. Additionally, new techniques like Subspace Optimization (SSF), FedSLoP, and GradsSharding aim to enhance efficiency by reducing communication and memory overhead, particularly for large models on serverless platforms. AI

IMPACT New federated learning algorithms promise improved efficiency and accuracy, especially for large models and heterogeneous data.
RESEARCH · Lil'Log (Lilian Weng) English(EN) · 57mo · [2 sources] · BLOG

How to Train Really Large Models on Many GPUs?

Training extremely large neural network models presents significant challenges due to their immense memory requirements and lengthy training times, often exceeding the capacity of individual GPUs. To address this, various parallelism techniques are employed, including data parallelism where models are replicated across multiple workers, and model parallelism where the model itself is partitioned across machines. Advanced methods like gradient accumulation and techniques to offload parameters to CPU memory are also utilized to optimize training efficiency and manage resource constraints. AI
RESEARCH · Eugene Yan English(EN) · 59mo · [3 sources] · BLOG

Bootstrapping Labels via ___ Supervision & Human-In-The-Loop

A new paper from Timothy Christensen proposes a coupled-label bootstrap method to address biases in OLS estimators that arise when using AI/ML-generated labels as covariates in economic regressions. The research highlights that standard fixed-label bootstrap methods are often invalid unless specific independence conditions are met. The proposed coupled-label bootstrap jointly resamples true and imputed labels, offering a more robust solution without these stringent conditions, and includes finite-sample adjustments for improved accuracy. This work is illustrated with simulations and applied to analyze the relationship between wages and remote work status. AI

IMPACT Provides a statistical method to improve the reliability of economic analyses that incorporate AI-generated data labels.