Brief

last 24h

[35/1235] 223 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · HN — machine learning stories English(EN) · 12mo

Show HN: Glowstick – type level tensor shapes in stable rust

Glowstick is a new Rust crate designed to enhance tensor manipulation by integrating shape checking directly into the type system. This approach aims to make tensor operations safer and more intuitive, particularly for developers working with machine learning frameworks. The project, currently in its pre-1.0 phase, offers features like dynamic dimension support and improved error messages, with plans to align with ONNX operations. AI

IMPACT Provides a type-safe approach to tensor manipulation in Rust, potentially improving developer experience and reducing errors in ML workflows.
- Rust
- Burn
- Candle
- Tensor
- ONNX
TOOL · HN — machine learning stories English(EN) · 12mo

The Illusion of Thinking: Strengths and Limitations of Reasoning Models

Researchers have introduced a new framework called "The Illusion of Thinking" to better understand the reasoning capabilities and limitations of Large Reasoning Models (LRMs). This framework utilizes controllable puzzle environments to analyze the internal reasoning traces of LRMs, moving beyond traditional evaluations that focus solely on final answer accuracy. Experiments revealed that LRMs experience a complete accuracy collapse at high problem complexities and exhibit a peculiar scaling limit where reasoning effort decreases despite sufficient computational resources. AI

IMPACT Introduces a novel evaluation method for LLMs that probes reasoning capabilities beyond simple accuracy, potentially guiding future model development.
RESEARCH · Hugging Face Daily Papers English(EN) · 12mo · [361 sources]

Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

Researchers are developing new methods to improve the evaluation and training of large language models (LLMs). One approach, SCOPE, calibrates LLM judges to ensure reliable pairwise evaluations with controlled error rates. Another technique, D3, uses dynamic influence graphs to optimize data scheduling during LLM training by considering sample interactions. Additionally, OBCache offers a principled framework for pruning key-value caches to reduce memory overhead during long-context inference, improving accuracy. AI

IMPACT New research introduces methods for more reliable LLM evaluation, efficient training data scheduling, and optimized inference, potentially improving LLM performance and resource utilization.
- LLMs
- FlashAttention
- PagedAttention
- A100 GPU
- LLM
- Nested WAIT
- Llama-2-7B
- Asteria
- vLLM
- SCICONVBENCH
- KVDrive
- Orca
- Sarathi-Serve
- FasterTransformer
- A100
- LLMEval-Logic
- DeepSeek-R1-Distill-7B
- V* benchmark
- POPE benchmark
- LLaDA2.0-flash
- LLaDA2.0-mini
- TIDE
- Charon
- Frontier
- FT-Dojo
- FT-Agent
- rePIRL
- PALS
- LlamaWeb
- WebGPU
- arXiv
- llama.cpp
- Hermes
- Qwen
- LLaMA
- AxBench
- FEM-Bench
- Gemini 3 Pro
- GPT-5
- SCOPE
- OBCache
- LoRA
- Item Response Theory
- Lean
RESEARCH · arXiv cs.CL English(EN) · 13mo · [53 sources]

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

Researchers have developed several new methods to accelerate large language model (LLM) inference through speculative decoding. AdaPLD improves retrieval and draft construction by using semantic similarity and branched hypotheses, achieving up to 3.10x speedup. SSSD combines n-gram matching with hardware-aware speculation for up to 2.9x latency reduction without training. D^2SD uses a dual diffusion model and confidence-guided prefix trees to enhance acceptance rates, while TAPS optimizes prefix tree selection for diffusion-drafted decoding, yielding up to 7.9x speedup. KnapSpec treats draft model selection as a knapsack problem to maximize throughput, achieving up to 1.47x speedup, and Vegas uses verification-guided sparse attention for improved decoding throughput. Additionally, LK Losses directly optimize the acceptance rate during training, leading to gains of 8-10% in average acceptance length. AI

IMPACT These advancements in speculative decoding promise significant speedups and efficiency gains for LLM inference, potentially lowering costs and increasing accessibility.
- Qwen3-235B
- FlexDraft
- Graft
- Ollama
- Llama-3-8B
- Llama-3-70B
- GPT-4
- Claude Sonnet
- vLLM
- Speculative Decoding
- EvoSpec
- Qwen3
- Speculative Pipeline Decoding
- Bastion
- LLM
- ToolSpec
- Hugging Face
- LK Losses
- arXiv
- AdaPLD
- D^2SD
- KnapSpec
RESEARCH · HN — machine learning stories English(EN) · 14mo · [2 sources]

Understanding Aggregate Trends for Apple Intelligence Using Differential Privacy

Apple is advancing research in privacy-preserving machine learning and AI, hosting a workshop to discuss techniques like federated learning and differential privacy. The company is applying these methods to its upcoming Apple Intelligence features, such as Genmoji, Image Playground, and writing tools, to understand usage trends without compromising user data. Apple is also exploring the creation of synthetic data that mimics real user content to improve these features while maintaining strict privacy standards. AI

IMPACT Apple's focus on privacy-preserving AI techniques for Apple Intelligence features may set new standards for user data protection in generative AI.
TOOL · HN — machine learning stories English(EN) · 14mo

SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

Researchers have developed SeedLM, a novel post-training compression technique for large language models that utilizes pseudo-random generator seeds to encode model weights. This method aims to reduce the high runtime costs associated with LLMs by generating weight matrices on-the-fly during inference, thereby decreasing memory access and improving speed for memory-bound tasks. SeedLM achieves this by trading compute for fewer memory accesses and notably does not require calibration data, generalizing well across diverse tasks and maintaining accuracy comparable to FP16 baselines even at significant compression levels. AI

IMPACT This compression technique could significantly reduce the deployment costs and increase the inference speed of large language models.
- SeedLM
- LLMs
- FP16
- Llama3 70B
- Meta
- IEEE Visualization
- Llama 2
TOOL · HN — machine learning stories English(EN) · 14mo

Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

A developer is creating a versatile OCR pipeline designed to extract structured data from complex educational materials for machine learning training. The system, which supports multilingual text, mathematical formulas, tables, and diagrams, aims to achieve over 90-95% accuracy on academic datasets. It generates AI-ready outputs in JSON or Markdown, including semantic annotations for visual content, and is built using various tools like Google Vision API and OpenAI API. The project's public release has been delayed due to the developer's academic commitments but is expected once the system is finalized. AI

IMPACT This tool could streamline the creation of specialized datasets for ML training, particularly in academic and research contexts.
TOOL · HN — machine learning stories English(EN) · 14mo

Show HN: Formal Verification for Machine Learning Models Using Lean 4

A new open-source framework called FormalVerifML has been released, utilizing Lean 4 for the formal verification of machine learning models. This tool aims to provide mathematically rigorous proofs of properties like robustness, fairness, and safety for high-stakes applications. It supports large-scale models, including transformers and vision models, with features for enterprise use and distributed verification. AI

IMPACT Enhances trust and reliability in ML models for critical applications through formal verification.
TOOL · HN — machine learning stories English(EN) · 14mo

Math for Computer Science and Machine Learning [pdf]

This PDF provides a comprehensive overview of the mathematical foundations essential for computer science and machine learning. It covers topics ranging from linear algebra and calculus to probability and statistics, aiming to equip readers with the necessary quantitative skills for advanced study and research in these fields. The material is structured to build a strong theoretical understanding, enabling practitioners to better grasp and develop complex algorithms and models. AI

IMPACT Provides foundational mathematical knowledge crucial for understanding and developing advanced AI models and algorithms.
- University of Pennsylvania
- Math for Computer Science and Machine Learning
TOOL · HN — machine learning stories English(EN) · 15mo

Merlion: A Machine Learning Framework for Time Series Intelligence

Salesforce has released Merlion 2.0, an open-source Python library designed for time series intelligence. The framework offers an end-to-end solution for tasks such as forecasting, anomaly detection, and change point detection. Merlion 2.0 includes a diverse set of models, automated hyperparameter tuning, and practical post-processing rules to enhance model interpretability and reduce false positives. AI

IMPACT Provides a comprehensive toolkit for developing and benchmarking time series models, potentially accelerating adoption in industry.
- pandas
- Salesforce
- Merlion
- Python
- PySpark
TOOL · HN — AI infrastructure stories English(EN) · 15mo

Show HN: Globstar – Open-source static analysis toolkit

DeepSource has open-sourced Globstar, a static analysis toolkit designed for creating custom code quality and security checkers. The toolkit leverages tree-sitter for parsing code and utilizes AI assistants like ChatGPT and Claude to generate complex queries, simplifying the process for developers. Globstar offers both YAML and Go interfaces, supporting over 20 languages with plans to add C/C++ support. AI

IMPACT Simplifies the creation of custom code quality and security checkers by leveraging AI for query generation.
- Comby
- C++
- Globstar
- ChatGPT
- Claude
- Semgrep
- YAML
- tree-sitter
- DeepSource
TOOL · HN — machine learning stories English(EN) · 16mo

Apple Robot Research

Researchers at Apple have developed ELEGNT, a framework for designing robot movements that blend functional task fulfillment with expressive qualities like intention and emotion. Their work, detailed in a recent paper, involved creating a lamp-like robot and a methodology to generate movement sequences that enhance user engagement, particularly in social contexts. A user study confirmed that expression-driven movements were perceived more positively than purely function-driven ones. AI

IMPACT Enhances human-robot interaction by making robots more expressive and engaging, potentially improving user experience in social and task-oriented scenarios.
- Mouli Sivapurapu
- Apple
- ELEGNT
- EMOTION
- ARMADA
- Yuhan Hu
- Peide Huang
- Jian Zhang
- Apple Vision Pro
TOOL · HN — machine learning stories English(EN) · 19mo

When machine learning tells the wrong story

A former MIT student reflects on a hardware security research paper he co-authored, "There’s Always a Bigger Fish: A Clarifying Analysis of a Machine-Learning-Assisted Side-Channel Attack." The paper, which demonstrated a machine-learning-assisted side-channel attack executable in web browsers and highlighted how system interrupts can leak user information, has received significant awards. The author discusses the challenges of writing about the research, particularly the dual narrative of ML's potential for attacks and its frequent misapplication, and how the project profoundly influenced his academic and personal path. AI

IMPACT Highlights potential vulnerabilities in web browsers through machine learning-assisted attacks, underscoring the need for careful application of ML in security.
- MIT
- ISCA
- Jules Drean
- Intel
- IEEE Micro
- Mengjia Yan
- Hacker News
TOOL · HN — machine learning stories English(EN) · 19mo

AI for real-time fusion plasma behavior prediction and manipulation

Researchers are developing AI models to predict and control the behavior of fusion plasma in real-time. These models aim to optimize the process of achieving stable fusion reactions, which is crucial for developing clean energy sources. The project utilizes machine learning techniques to analyze complex plasma dynamics and enable precise manipulation. AI

IMPACT Potential to accelerate fusion energy development by enabling real-time control of plasma.
- Princeton University
- tokamak
TOOL · HN — machine learning stories English(EN) · 20mo

Machine learning and information theory concepts towards an AI Mathematician

This paper explores the gap between current AI's language capabilities and its mathematical reasoning abilities. It proposes an information-theoretical approach to developing an AI mathematician, focusing on discovering new conjectures rather than proving existing theorems. The core idea is that a valuable set of theorems should efficiently summarize provable statements and be closely related to many of them. AI

IMPACT Proposes a novel framework for AI mathematical reasoning, potentially advancing AI's capabilities beyond language tasks.
- arXiv
- Hugging Face
TOOL · HN — machine learning stories English(EN) · 21mo

Machine Learning Model Homotopy

The concept of model homotopy, applying topological ideas to machine learning, suggests that a single model may not fully capture a modeling situation. Instead, a trajectory of fits, parameterized continuously by weights, can offer a richer understanding. This approach can reveal counter-intuitive behaviors, such as linear regression coefficients changing signs multiple times as variables are added, challenging the intuition that coefficients would smoothly interpolate. AI

IMPACT Introduces a novel theoretical framework for understanding model behavior and parameter sensitivity.
- Lasso
- Topological Data Analysis
SIGNIFICANT · HN — AI infrastructure stories English(EN) · 21mo

Launch HN: Silurian (YC S24) – Simulate the Earth

Silurian, a startup founded by former Microsoft researchers, has launched Generative Forecasting Transformer (GFT), a 1.5 billion parameter model designed to simulate Earth's weather up to 14 days in advance. This deep learning model, which learns purely from data without explicit physics, has demonstrated strong performance in predicting hurricane tracks, outperforming traditional forecasting methods. The company aims to expand its simulations to model other weather-impacted infrastructure like energy grids and agriculture. AI

IMPACT This new weather simulation model could significantly improve forecasting accuracy and lead to better infrastructure planning.
- GFT
- Google DeepMind
- Microsoft
- Silurian
- NVIDIA
- Generative Forecasting Transformer
- NeuralGCM
- WeatherBench
- ECMWF
- Aurora
- ClimaX
- Huawei
TOOL · HN — machine learning stories (HR) · 21mo

Micrograd.jl

This article introduces Micrograd.jl, a new automatic differentiation package for the Julia programming language. It aims to fill a gap in comprehensive tutorials for AD in Julia, requiring a solid understanding of both Julia and Calculus. The package is built upon Zygote.jl and ChainRules.jl, offering a different approach to AD compared to Python frameworks like PyTorch by leveraging Julia's functional programming and metaprogramming capabilities. AI

IMPACT Provides a new tool for Julia developers to build and train machine learning models, potentially improving efficiency and understanding of backpropagation.
COMMENTARY · HN — machine learning stories English(EN) · 22mo

The reanimation of pseudoscience in machine learning

A recent article in Patterns argues that the machine learning field is experiencing a resurgence of pseudoscience, particularly in areas like consciousness and general intelligence. The authors express concern that the field's rapid growth and the pressure to publish may be leading to a decline in rigorous scientific standards. They call for a renewed focus on empirical evidence and falsifiable hypotheses to maintain the integrity of machine learning research. AI

IMPACT Raises concerns about the scientific rigor and potential for pseudoscience within the machine learning research community.
- Patterns
- machine learning
RESEARCH · arXiv cs.LG English(EN) · 23mo · [2 sources]

Sequential Learning and Catastrophic Forgetting in Differentiable Resistor Networks

Researchers have developed a novel analog network of resistors capable of performing machine learning tasks without a traditional processor. This system, based on transistors, can learn and adapt to new tasks, demonstrating potential for highly energy-efficient computation. While currently a prototype, the technology shows promise for applications in edge devices and could eventually outperform conventional digital processors for specific machine learning workloads. AI

IMPACT This research could lead to more energy-efficient AI hardware, particularly for edge computing applications.
RESEARCH · HN — machine learning stories English(EN) · 24mo · [2 sources]

Apple's On-Device and Server Foundation Models

Apple has detailed its new foundation language models powering Apple Intelligence, including a ~3 billion parameter on-device model and a larger server-based model. These models are designed for multilingual and multimodal tasks, supporting image understanding and tool execution. The company emphasizes its Responsible AI approach, focusing on user privacy through innovations like Private Cloud Compute and on-device processing, ensuring user data is not used for training. AI

IMPACT Apple's detailed technical report on its foundation models may influence the development of efficient on-device and specialized server-based AI systems.
- iOS 18
- iPadOS 18
- macOS Sequoia
- Private Cloud Compute
- AXLearn
- JAX
- Apple
- Apple Intelligence
- XLA
TOOL · HN — machine learning stories English(EN) · 24mo

What kind of bug would make machine learning suddenly 40% worse at NetHack?

Researchers Bartłomiej Cupiał and Maciej Wołczyk observed a significant performance drop in their neural network trained to play NetHack. The model, which had been consistently scoring around 5,000 points, suddenly began scoring only 3,000 points, a 40% decrease. Despite extensive troubleshooting, including code reversion, software stack restoration, and rebuilding the entire system from scratch, the performance issue persisted. AI

IMPACT Highlights potential fragility in reinforcement learning models and the challenges of diagnosing performance regressions.
TOOL · HN — machine learning stories Deutsch(DE) · 25mo

Understanding Stein's Paradox (2021)

Stein's paradox, a counterintuitive statistical concept, demonstrates that in dimensions three and higher, a better estimate of a Gaussian distribution's mean can be achieved than simply using the drawn sample. The James-Stein estimator, which uses a specific formula involving the sample's magnitude and dimensionality, outperforms the naive approach in terms of mean squared error. This paradox challenges conventional statistical intuition, particularly regarding parameter estimation in higher-dimensional spaces. AI
RESEARCH · HN — machine learning stories English(EN) · 26mo · [21 sources]

A Visual Introduction to Machine Learning (2015)

This collection of resources offers a broad overview of machine learning, from foundational concepts and visual introductions to theoretical underpinnings and practical applications. It includes a visual guide to classification tasks, a discussion on the science and ethics of machine learning benchmarks, and pointers to comprehensive textbooks and course materials. Additionally, it highlights tools for interpretable machine learning and the engineering practices required for deploying models in production. AI

IMPACT Provides foundational knowledge and practical tools for understanding, developing, and deploying machine learning models.
RESEARCH · HN — AI infrastructure stories Română(RO) · 26mo · [2 sources]

1-Bit AI Infrastructure

Researchers have developed a software stack called 'this http URL' to enable fast and lossless inference of 1-bit Large Language Models (LLMs) like BitNet b1.58 on CPUs. This new infrastructure achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and 1.37x to 5.07x on ARM CPUs, depending on model size. The goal is to make LLMs more efficient and deployable on a wider range of devices. AI

IMPACT Enables more efficient and widespread deployment of LLMs on consumer hardware.
- BitNet
- ARM CPUs
- LLMs
- Shaoguang Mao
- x86 CPUs
- this http URL
- BitNet b1.58
TOOL · HN — machine learning stories English(EN) · 27mo

Opus 1.5 released: Opus gets a machine learning upgrade

The Opus 1.5 audio codec has been released with significant machine learning enhancements, marking the first time deep learning is used to process audio signals directly. These new ML-based features, including improved packet loss concealment (PLC) and a novel redundancy transmission method, are designed to be fully compatible with older versions and optimized to run efficiently on standard CPUs. While most users won't notice the performance impact, the ML features are disabled by default and require specific compile-time and run-time flags to activate. AI

IMPACT Enhances audio codec resilience to packet loss and improves redundancy, potentially improving real-time communication quality.
TOOL · HN — machine learning stories English(EN) · 27mo

Where is Noether's principle in machine learning?

This research paper explores the applicability of Noether's principle, a fundamental concept in physics linking symmetries to conservation laws, within the domain of machine learning. The authors investigate whether similar principles of invariance and conserved quantities can be identified in discrete machine learning processes, such as the training of neural networks. While acknowledging the potential for such connections, the paper suggests that directly applying Noether's theorem to machine learning is complex and not yet fully understood. AI

IMPACT Explores theoretical underpinnings that could lead to new optimization techniques or model architectures.
RESEARCH · Medium — MLOps tag English(EN) · 34mo · [63 sources]

Building Secure AI Gateways with MLflow AI Gateway

Google Research has introduced ReasoningBank, a novel framework designed to enhance AI agents' ability to learn from their experiences, both successes and failures, after deployment. This system distills generalizable reasoning strategies from past interactions, allowing agents to continuously improve and avoid repeating mistakes. Separately, new research explores optimizing multi-agent communication through latent representations and introduces Agent Evolving Learning (AEL) for agents operating in open-ended environments, focusing on how to effectively use remembered information. Additionally, DeepSeek has released preview models of its V4 series, offering large context windows and advanced capabilities at a significantly lower cost than comparable frontier models. AI

IMPACT New frameworks for agent learning and memory, alongside cost-effective frontier models, could accelerate AI adoption in complex tasks and personalized applications.
- MLflow
- Claude Opus 4.7
- OpenRouter
- MLflow AI Gateway
- LiteLLM
- OpenAI
- Anthropic
- Gemini
- GPT-5.5
- Portkey
- LLM
- Google
- ReasoningBank
- DeepSeek
- DeepSeek-V4-Pro
- DeepSeek-V4-Flash
- AI agents
- Hugging Face
- Nemobot
- DiffMAS
- Agent Evolving Learning (AEL)
- AgenticQwen
- Memora
RESEARCH · Google AI / Research English(EN) · 38mo · [475 sources]

Making LLMs more accurate by using all of their layers

Google Research has developed a new framework to evaluate the behavioral alignment of large language models with human social inclinations. This approach adapts established psychological questionnaires into large-scale situational judgment tests, allowing for the quantification of model tendencies in realistic scenarios. The research identifies gaps where model behaviors deviate from human consensus or fail to capture the range of human opinions, aiming to improve LLM navigation of social dynamics. Separately, Google Research also introduced SLED, a novel decoding strategy that enhances LLM factuality by utilizing all model layers instead of just the final one, without requiring external data or fine-tuning. AI

IMPACT New methods for evaluating LLM alignment and improving factuality could lead to more trustworthy and socially adept AI systems.
- Google Research
- ERQ
- Situational Judgment Tests
- IRI
- NeurIPS 2024
- SLED
- LLMs
- CodeGemma
- GitHub
SIGNIFICANT · OpenAI News English(EN) · 40mo · [1394 sources]

Computer-Using Agent

OpenAI and Google DeepMind are advancing AI agents for software development and security. OpenAI's Codex is being leveraged to write entire codebases with minimal human intervention, as demonstrated by Harness Engineering's internal beta product. Google DeepMind has introduced CodeMender, an AI agent designed to automatically identify and fix software vulnerabilities, and AlphaEvolve, which uses Gemini models to discover and optimize algorithms for applications like data center efficiency and chip design. Meta is also investing heavily in its own AI infrastructure with the development of its MTIA chip family, aiming to power AI experiences for billions of users. AI

IMPACT These advancements signal a rapid evolution in AI agent capabilities and infrastructure, potentially accelerating software development, improving code security, and optimizing complex computational tasks.
SIGNIFICANT · OpenAI News English(EN) · 46mo · [3619 sources]

Our approach to alignment research

OpenAI has announced a partnership with Apple to integrate ChatGPT into iOS, iPadOS, and macOS, enhancing Siri and system-wide writing tools with GPT-4o capabilities. Google DeepMind has published research on scaling AI agent systems, identifying that multi-agent coordination improves parallelizable tasks but can degrade sequential ones, and has developed a predictive model for optimal agent architectures. Additionally, OpenAI has released resources on prompting fundamentals and shared insights from Netomi on scaling agentic systems in enterprise environments, highlighting the use of GPT-4.1 and GPT-5.2 for complex workflows. AI

IMPACT Partnership integrates advanced AI into consumer devices, while research offers principles for scaling complex AI agent systems.
- OpenAI
- Sundar Pichai
- Koray Kavukcuoglu
- Mythos Preview
- Anthropic
- CodeMender
- Google
- GPT-4.1
- ChatGPT
- Netomi
- GPT-5.2
- Apple
- Siri
- GPT-4o
- Google DeepMind
- AI agent systems
RESEARCH · Hugging Face Blog English(EN) · 48mo · [405 sources]

The Annotated Diffusion Model

Apple's research paper explores the mechanisms behind compositional generalization in conditional diffusion models, particularly focusing on how these models handle generating images with more objects than trained on. The study identifies 'local conditional scores' as a key factor enabling this ability, demonstrating that models succeeding at length generalization exhibit these scores, while those that fail do not. The research also proposes a method to enforce these local scores, which successfully enabled length generalization in a previously underperforming model. AI

IMPACT Research into diffusion model generalization could lead to more robust and controllable image generation systems.
RESEARCH · 量子位 (QbitAI) 中文(ZH) · 71mo · [190 sources]

Secured 70 billion yuan in funding! DeepSeek Code is really coming, ACM gold medalist Cui Tianyi is in charge

New research explores the challenges and advancements in AI-native code generation, focusing on improving efficiency, reliability, and safety. Papers introduce novel architectures like MicroSkill for better context management and modular knowledge encapsulation, reducing token consumption and increasing compilation success rates. Other studies benchmark coding agents' performance on complex tasks, including their ability to handle underspecified user intent and detect potential sabotage, highlighting the need for human-centric safety mechanisms and robust evaluation frameworks. AI

IMPACT New benchmarks and architectures are pushing the boundaries of AI coding agents, addressing efficiency, safety, and complex task handling.
- Udemy
- Claude Code
- GitHub Copilot
- Cursor
- Codex
- Replit
- Replit Agent
- DeepSeek
- DeepSeek Code
- Cui Tianyi
- Python
- TSY Capital
- Anthropic
- Agent Harness
- OpenAI
- OpenAI Codex
- MiniMax-M2.7
- MicroSkill Architecture
- Asuka-Bench
- Gemini-3.1-Pro
- GPT-5.4
- Claude-Opus-4.6
- AI-native code generation
- TensorBench
- SABER
RESEARCH · OpenAI News English(EN) · 91mo · [1013 sources]

Better language models and their implications

Google DeepMind has introduced the FACTS Benchmark Suite, a new set of evaluations designed to systematically measure the factuality of large language models across various use cases. This suite includes benchmarks for parametric knowledge, search-based information retrieval, and multimodal understanding, alongside an updated grounding benchmark. The initiative aims to provide a more comprehensive understanding of LLM factuality and drive industry-wide improvements in accuracy and trustworthiness. AI

IMPACT Provides new evaluation tools to drive progress in LLM factuality and reduce hallucinations.
RESEARCH · OpenAI News English(EN) · 122mo · [741 sources]

RL²: Fast reinforcement learning via slow reinforcement learning

OpenAI has published a series of research papers detailing advancements in reinforcement learning. These include achieving superhuman performance in Dota 2 with OpenAI Five, developing benchmarks for safe exploration in RL, and quantifying generalization capabilities with the CoinRun environment. The company also explored novel methods like prediction-based rewards for curiosity-driven exploration, learning policy representations in multiagent systems, and an experimental metalearning approach called Evolved Policy Gradients for faster training on new tasks. Further research addresses variance reduction in policy gradients and the equivalence between policy gradients and soft Q-learning, alongside challenging robotics environments for multi-goal RL. AI

IMPACT Demonstrates significant progress in RL capabilities, including superhuman performance, safety, generalization, and exploration, pushing the boundaries of AI.