PulseAugur / Pulse
EN
LIVE 20:32:48

Pulse

last 48h
[50/2011] 98 sources

What AI is actually talking about — clusters surfacing on Bluesky, Reddit, HN, Mastodon and Lobsters, re-ranked to elevate originality and crush noise.

  1. Important machine learning equations

    A new guide compiles essential machine learning equations, focusing on their practical application and mathematical foundations. It covers key concepts from information theory, linear algebra, and optimization, including detailed explanations and Python implementations for entropy, cross-entropy, and KL divergence. The resource aims to serve as a handy reference for practitioners, drawing from frequently used formulas and including sections on neural network fundamentals and loss functions. AI

    Important machine learning equations

    IMPACT Provides a practical reference for core mathematical concepts used in machine learning model development.

  2. Dyna – Logic Programming for Machine Learning

    Dyna is a new programming language designed for machine learning researchers, aiming to bridge the gap between mathematical concepts and executable code. It builds upon logic programming paradigms like Datalog and Prolog, introducing features such as flexible execution orders and weighted rules. This allows for concise expression of complex algorithms, including matrix multiplication, the Fibonacci sequence, and neural networks, with minimal code. AI

    Dyna – Logic Programming for Machine Learning

    IMPACT Potentially streamlines the development cycle for ML algorithms by reducing the distance between mathematical notation and code.

  3. Unlocking dependable responses with Gemini Enterprise Agent Platform’s Agentic RAG

    Researchers are developing advanced agent frameworks to improve AI reliability and efficiency across various domains. Google introduced an agentic RAG system that enhances enterprise query handling by iteratively searching for complete context, boosting accuracy by up to 34%. Hugging Face demonstrated a multi-agent economy simulation using a small 3B model, highlighting the trade-offs between model size and real-time performance. Other research explores methods for reliable tool use, regulatory compliance through agent-to-agent protocols, dynamic benchmarking for agent behavior, and robust self-evolution mechanisms for AI agents. AI

    Unlocking dependable responses with Gemini Enterprise Agent Platform’s Agentic RAG

    IMPACT New agentic frameworks and evaluation methods promise more reliable, efficient, and compliant AI systems across enterprise, simulation, and regulatory domains.

  4. Solving a Childhood Mystery: How BASIC Games Learned to Win

    A programmer explores a childhood mystery surrounding the source code for a BASIC game called Hexapawn. This game, a simplified version of chess, was featured in an old programming book. The author delves into the game's DATA statements, which initially appeared as incomprehensible sequences of numbers, and seeks clarification from Claude.ai to understand their function within the game's logic. AI

    Solving a Childhood Mystery: How BASIC Games Learned to Win

    IMPACT Explores historical game AI, offering insights into early algorithmic approaches.

  5. Springer Nature book on machine learning is full of made-up citations

    A newly published machine learning textbook by Springer Nature, titled "Mastering Machine Learning: From Basics to Advanced," has been found to contain numerous fabricated citations. An investigation revealed that two-thirds of the checked citations were either non-existent or contained significant errors, with some researchers confirming they did not author the cited works. The publisher is currently investigating the matter, and the book's author has not confirmed whether an AI tool was used in its creation, though the nature of the errors is characteristic of LLM-generated content. AI

    Springer Nature book on machine learning is full of made-up citations

    IMPACT Highlights the ongoing challenge of AI-generated misinformation and the need for robust editorial oversight in publishing.

  6. Against "Brain Damage"

    Ethan Mollick's "One Useful Thing" newsletter addresses the growing concern about AI's impact on human cognition, particularly the idea of "brain damage." He clarifies that a recent MIT study, often misinterpreted, showed reduced engagement and memory retention in students using ChatGPT for essays, but no actual neurological harm. Mollick argues that while AI doesn't cause literal brain damage, over-reliance can hinder learning and critical thinking by allowing users to outsource intellectual work, citing an experiment where students using GPT-4 for homework scored worse on exams. However, he also notes that with proper guidance and pedagogical approaches, AI can be a powerful tool to enhance learning outcomes. AI

    Against "Brain Damage"
  7. Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All

    Multiple research papers released on arXiv explore advancements in AI agents, focusing on improving their reasoning, memory, and training efficiency. Qwen3.6-35B-A3B, an open-source sparse MoE model, demonstrates strong agentic coding capabilities. Other studies introduce methods for better skill presentation, long-context reasoning through RL, skill reuse as compression, and adaptive context management for agents tackling complex, long-horizon tasks. Additionally, research presents AutoSci, a system for automating the scientific research lifecycle, and PithTrain, a compact training framework for MoE models designed for agent-native development. AI

    Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All

    IMPACT Advances in agent capabilities, memory management, and training efficiency could accelerate the development of more sophisticated AI systems.

  8. Normalizing Flows Are Capable Generative Models

    Researchers have developed a new generative modeling framework utilizing cumulative flow maps for long-range transport in probability space. This approach aims to connect local updates with finite-time transport, allowing generative models to reason about global state transitions. The framework supports few-step and even one-step generation with minimal changes to existing models and no increase in capacity, demonstrating effectiveness across various tasks like image and SDF generation with reduced inference costs. AI

    Normalizing Flows Are Capable Generative Models

    IMPACT Introduces novel generative modeling techniques that could lead to more efficient and capable AI systems for various synthesis tasks.

  9. Show HN: Glowstick – type level tensor shapes in stable rust

    Glowstick is a new Rust crate designed to enhance tensor manipulation by integrating shape checking directly into the type system. This approach aims to make tensor operations safer and more intuitive, particularly for developers working with machine learning frameworks. The project, currently in its pre-1.0 phase, offers features like dynamic dimension support and improved error messages, with plans to align with ONNX operations. AI

    Show HN: Glowstick – type level tensor shapes in stable rust

    IMPACT Provides a type-safe approach to tensor manipulation in Rust, potentially improving developer experience and reducing errors in ML workflows.

  10. The Illusion of Thinking: Strengths and Limitations of Reasoning Models

    Researchers have introduced a new framework called "The Illusion of Thinking" to better understand the reasoning capabilities and limitations of Large Reasoning Models (LRMs). This framework utilizes controllable puzzle environments to analyze the internal reasoning traces of LRMs, moving beyond traditional evaluations that focus solely on final answer accuracy. Experiments revealed that LRMs experience a complete accuracy collapse at high problem complexities and exhibit a peculiar scaling limit where reasoning effort decreases despite sufficient computational resources. AI

    The Illusion of Thinking: Strengths and Limitations of Reasoning Models

    IMPACT Introduces a novel evaluation method for LLMs that probes reasoning capabilities beyond simple accuracy, potentially guiding future model development.

  11. Understanding and Coding the KV Cache in LLMs from Scratch

    The KV cache is a crucial technique for optimizing the inference speed of Large Language Models (LLMs) in production environments. It works by storing and reusing intermediate key and value computations, thereby avoiding redundant calculations during text generation. While it increases memory requirements and code complexity, the significant inference speed-ups often make it a worthwhile trade-off for deploying LLMs. AI

    Understanding and Coding the KV Cache in LLMs from Scratch
  12. Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

    Researchers are developing new methods to improve the evaluation and training of large language models (LLMs). One approach, SCOPE, calibrates LLM judges to ensure reliable pairwise evaluations with controlled error rates. Another technique, D3, uses dynamic influence graphs to optimize data scheduling during LLM training by considering sample interactions. Additionally, OBCache offers a principled framework for pruning key-value caches to reduce memory overhead during long-context inference, improving accuracy. AI

    IMPACT New research introduces methods for more reliable LLM evaluation, efficient training data scheduling, and optimized inference, potentially improving LLM performance and resource utilization.

  13. FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

    Researchers have developed several new methods to accelerate large language model (LLM) inference through speculative decoding. AdaPLD improves retrieval and draft construction by using semantic similarity and branched hypotheses, achieving up to 3.10x speedup. SSSD combines n-gram matching with hardware-aware speculation for up to 2.9x latency reduction without training. D^2SD uses a dual diffusion model and confidence-guided prefix trees to enhance acceptance rates, while TAPS optimizes prefix tree selection for diffusion-drafted decoding, yielding up to 7.9x speedup. KnapSpec treats draft model selection as a knapsack problem to maximize throughput, achieving up to 1.47x speedup, and Vegas uses verification-guided sparse attention for improved decoding throughput. Additionally, LK Losses directly optimize the acceptance rate during training, leading to gains of 8-10% in average acceptance length. AI

    FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

    IMPACT These advancements in speculative decoding promise significant speedups and efficiency gains for LLM inference, potentially lowering costs and increasing accessibility.

  14. Why We Think

    Lilian Weng's latest post explores the concept of "thinking time" or test-time computation in large language models. This approach draws an analogy to human cognition, where complex problems require deliberate, slow thinking (System 2) rather than immediate, intuitive responses (System 1). The post details how increasing computation at test time, such as through Chain-of-Thought prompting, allows models to perform more operations and potentially improve accuracy, especially for challenging tasks. Weng also frames this within latent variable modeling, suggesting that methods involving multiple reasoning paths can be viewed as sampling from a posterior distribution. AI

    Why We Think
  15. Understanding Aggregate Trends for Apple Intelligence Using Differential Privacy

    Apple is advancing research in privacy-preserving machine learning and AI, hosting a workshop to discuss techniques like federated learning and differential privacy. The company is applying these methods to its upcoming Apple Intelligence features, such as Genmoji, Image Playground, and writing tools, to understand usage trends without compromising user data. Apple is also exploring the creation of synthetic data that mimics real user content to improve these features while maintaining strict privacy standards. AI

    Understanding Aggregate Trends for Apple Intelligence Using Differential Privacy

    IMPACT Apple's focus on privacy-preserving AI techniques for Apple Intelligence features may set new standards for user data protection in generative AI.

  16. SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

    Researchers have developed SeedLM, a novel post-training compression technique for large language models that utilizes pseudo-random generator seeds to encode model weights. This method aims to reduce the high runtime costs associated with LLMs by generating weight matrices on-the-fly during inference, thereby decreasing memory access and improving speed for memory-bound tasks. SeedLM achieves this by trading compute for fewer memory accesses and notably does not require calibration data, generalizing well across diverse tasks and maintaining accuracy comparable to FP16 baselines even at significant compression levels. AI

    SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

    IMPACT This compression technique could significantly reduce the deployment costs and increase the inference speed of large language models.

  17. Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

    A developer is creating a versatile OCR pipeline designed to extract structured data from complex educational materials for machine learning training. The system, which supports multilingual text, mathematical formulas, tables, and diagrams, aims to achieve over 90-95% accuracy on academic datasets. It generates AI-ready outputs in JSON or Markdown, including semantic annotations for visual content, and is built using various tools like Google Vision API and OpenAI API. The project's public release has been delayed due to the developer's academic commitments but is expected once the system is finalized. AI

    Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

    IMPACT This tool could streamline the creation of specialized datasets for ML training, particularly in academic and research contexts.

  18. Show HN: Formal Verification for Machine Learning Models Using Lean 4

    A new open-source framework called FormalVerifML has been released, utilizing Lean 4 for the formal verification of machine learning models. This tool aims to provide mathematically rigorous proofs of properties like robustness, fairness, and safety for high-stakes applications. It supports large-scale models, including transformers and vision models, with features for enterprise use and distributed verification. AI

    Show HN: Formal Verification for Machine Learning Models Using Lean 4

    IMPACT Enhances trust and reliability in ML models for critical applications through formal verification.

  19. Math for Computer Science and Machine Learning [pdf]

    This PDF provides a comprehensive overview of the mathematical foundations essential for computer science and machine learning. It covers topics ranging from linear algebra and calculus to probability and statistics, aiming to equip readers with the necessary quantitative skills for advanced study and research in these fields. The material is structured to build a strong theoretical understanding, enabling practitioners to better grasp and develop complex algorithms and models. AI

    IMPACT Provides foundational mathematical knowledge crucial for understanding and developing advanced AI models and algorithms.

  20. AI-assisted coding with GitHub's COO

    A new paper explores the limitations of automated evaluation for AI code review bots, finding that current automated methods like G-Eval and LLM-as-a-Judge show only moderate alignment with human developer labels. The study analyzed 2,604 bot-generated comments from Beko, revealing that developer actions on these comments are influenced by contextual and organizational factors, making them unreliable ground truth. This suggests that fully automating the evaluation of AI code review comments in industrial settings remains a significant challenge. AI

    AI-assisted coding with GitHub's COO

    IMPACT Highlights challenges in reliably evaluating AI code review tools, impacting their adoption and effectiveness in development workflows.

  21. Merlion: A Machine Learning Framework for Time Series Intelligence

    Salesforce has released Merlion 2.0, an open-source Python library designed for time series intelligence. The framework offers an end-to-end solution for tasks such as forecasting, anomaly detection, and change point detection. Merlion 2.0 includes a diverse set of models, automated hyperparameter tuning, and practical post-processing rules to enhance model interpretability and reduce false positives. AI

    Merlion: A Machine Learning Framework for Time Series Intelligence

    IMPACT Provides a comprehensive toolkit for developing and benchmarking time series models, potentially accelerating adoption in industry.

  22. Show HN: Globstar – Open-source static analysis toolkit

    DeepSource has open-sourced Globstar, a static analysis toolkit designed for creating custom code quality and security checkers. The toolkit leverages tree-sitter for parsing code and utilizes AI assistants like ChatGPT and Claude to generate complex queries, simplifying the process for developers. Globstar offers both YAML and Go interfaces, supporting over 20 languages with plans to add C/C++ support. AI

    Show HN: Globstar – Open-source static analysis toolkit

    IMPACT Simplifies the creation of custom code quality and security checkers by leveraging AI for query generation.

  23. Apple Robot Research

    Researchers at Apple have developed ELEGNT, a framework for designing robot movements that blend functional task fulfillment with expressive qualities like intention and emotion. Their work, detailed in a recent paper, involved creating a lamp-like robot and a methodology to generate movement sequences that enhance user engagement, particularly in social contexts. A user study confirmed that expression-driven movements were perceived more positively than purely function-driven ones. AI

    Apple Robot Research

    IMPACT Enhances human-robot interaction by making robots more expressive and engaging, potentially improving user experience in social and task-oriented scenarios.

  24. Beyond Structure: Revolutionising Materials Discovery via AI-Driven Synthesis Protocol-Property Relationships

    Two new arXiv papers propose shifting AI-driven materials discovery from a structure-centric to a synthesis-first approach. The first paper, "Beyond Structure," outlines a roadmap for representing synthesis procedures as machine-readable protocols and using generative models to propose reaction pathways. The second paper, "Born-Qualified," introduces a framework that embeds manufacturability, cost, and durability constraints from the outset of autonomous development to bridge the gap between laboratory metrics and industrial viability. AI

    Beyond Structure: Revolutionising Materials Discovery via AI-Driven Synthesis Protocol-Property Relationships

    IMPACT These papers suggest a new paradigm for AI in materials science, potentially accelerating the discovery and deployment of advanced materials by focusing on synthesis and industrial viability.

  25. Agents

    Chip Huyen's latest post, adapted from her book "AI Engineering," explores the concept of intelligent agents, defining them as entities that perceive and act within an environment. These agents leverage the advanced capabilities of foundation models and can be augmented with tools to perform complex tasks. The post also delves into agent planning, tool selection, and methods for evaluating their performance and potential failure modes. AI

    Agents
  26. Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

    Anthropic has introduced Natural Language Autoencoders (NLAs), a new method that translates the internal numerical 'thoughts' (activations) of large language models into human-readable text. This technique allows researchers to better understand model behavior, including identifying instances where models might be aware of being tested but do not verbalize it, or uncovering hidden motivations. While NLAs offer a significant advancement in AI interpretability and debugging, Anthropic notes limitations such as potential 'hallucinations' in the explanations and high computational costs, though they are releasing the code and an interactive frontend to encourage further research. AI

    Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

    IMPACT Enables deeper understanding of LLM internal states, potentially improving safety, debugging, and trustworthiness.

  27. Does the UK’s liver transplant matching algorithm systematically exclude younger patients?

    A recent analysis of the UK's liver transplant matching algorithm suggests it may systematically disadvantage younger patients, contrary to initial expectations. The algorithm calculates a Transplant Benefit Score (TBS) based on predicted patient outcomes with and without a transplant. Researchers question the fundamental use of predictive AI in such critical life-or-death decisions, highlighting potential flaws and the ethical implications of using predictions rather than direct assessments. AI

    Does the UK’s liver transplant matching algorithm systematically exclude younger patients?
  28. When machine learning tells the wrong story

    A former MIT student reflects on a hardware security research paper he co-authored, "There’s Always a Bigger Fish: A Clarifying Analysis of a Machine-Learning-Assisted Side-Channel Attack." The paper, which demonstrated a machine-learning-assisted side-channel attack executable in web browsers and highlighted how system interrupts can leak user information, has received significant awards. The author discusses the challenges of writing about the research, particularly the dual narrative of ML's potential for attacks and its frequent misapplication, and how the project profoundly influenced his academic and personal path. AI

    When machine learning tells the wrong story

    IMPACT Highlights potential vulnerabilities in web browsers through machine learning-assisted attacks, underscoring the need for careful application of ML in security.

  29. AI for real-time fusion plasma behavior prediction and manipulation

    Researchers are developing AI models to predict and control the behavior of fusion plasma in real-time. These models aim to optimize the process of achieving stable fusion reactions, which is crucial for developing clean energy sources. The project utilizes machine learning techniques to analyze complex plasma dynamics and enable precise manipulation. AI

    IMPACT Potential to accelerate fusion energy development by enabling real-time control of plasma.

  30. In the Arena: How LMSys changed LLM Benchmarking Forever

    The AraGen benchmark, developed by Hugging Face, aims to improve LLM evaluation by addressing limitations of static benchmarks. It introduces a crowdsourced approach similar to LMSys's Chatbot Arena, allowing for more dynamic and user-aligned assessments. This method seeks to capture real-world user preferences and model performance beyond traditional metrics. Additionally, a new open-source OCR model called DharmaOCR has been released, demonstrating strong performance against larger commercial and open-source models. AI

    In the Arena: How LMSys changed LLM Benchmarking Forever

    IMPACT New evaluation methods and specialized open-source models offer improved benchmarking and cost-performance for AI operators.

  31. Implementing neural networks on the "3 cent" 8-bit microcontroller

    Researchers have successfully implemented neural network inference for the MNIST dataset on an extremely low-cost, 8-bit microcontroller. By significantly downscaling input images to 8x8 pixels and using highly quantized weights (as low as 2-bit), they achieved over 90% accuracy. This demonstrates the feasibility of running machine learning models on devices with minimal memory and processing power, specifically targeting microcontrollers with as little as 1KB of ROM and 64 bytes of RAM. AI

    Implementing neural networks on the "3 cent" 8-bit microcontroller

    IMPACT Demonstrates potential for running ML inference on ultra-low-cost microcontrollers, enabling new embedded AI applications.

  32. Machine learning and information theory concepts towards an AI Mathematician

    This paper explores the gap between current AI's language capabilities and its mathematical reasoning abilities. It proposes an information-theoretical approach to developing an AI mathematician, focusing on discovering new conjectures rather than proving existing theorems. The core idea is that a valuable set of theorems should efficiently summarize provable statements and be closely related to many of them. AI

    Machine learning and information theory concepts towards an AI Mathematician

    IMPACT Proposes a novel framework for AI mathematical reasoning, potentially advancing AI's capabilities beyond language tasks.

  33. FAQ about the book and our writing process

    The authors of "AI Snake Oil" have sold approximately 8,000 copies of their book, which aims to differentiate between useful AI applications and hype. They acknowledge AI's benefits but focus on the societal harms amplified by predictive AI, while viewing generative AI as a double-edged sword with long-term utility. The book breaks down the broad term "AI" to analyze specific technologies and their impacts. AI

    FAQ about the book and our writing process
  34. Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

    Researchers have developed a benchmark to test Large Language Models' ability to handle temporal changes in legal statutes, identifying issues like outdated information and recency bias. Meanwhile, the AI industry is seeing a significant shift as model labs increasingly focus on building agent-based products rather than just foundational models. This strategic pivot is exemplified by companies like AI21 and DeepSeek, and is further underscored by DeepSeek's aggressive pricing strategy for its V4-Pro model, making advanced AI more accessible. AI

    IMPACT The industry's focus is shifting from foundational models to agent-based products, with aggressive pricing making advanced AI more accessible and competitive.

  35. Can AI automate computational reproducibility?

    Researchers have developed AutoReproduce, a multi-agent framework designed to automatically reproduce AI experiments from research papers. This system utilizes a "paper lineage" to mine implicit knowledge from cited literature and employs a sampling-based unit testing strategy to ensure code executability. A new benchmark, CORE-Bench, has also been introduced to evaluate AI's capability in automating computational reproducibility. Initial tests show that while specialized agents like CORE-Agent with GPT-4o achieve 22% accuracy on difficult tasks, there is significant room for improvement in AI's ability to handle complex computational environments. AI

    Can AI automate computational reproducibility?
  36. Machine Learning Model Homotopy

    The concept of model homotopy, applying topological ideas to machine learning, suggests that a single model may not fully capture a modeling situation. Instead, a trajectory of fits, parameterized continuously by weights, can offer a richer understanding. This approach can reveal counter-intuitive behaviors, such as linear regression coefficients changing signs multiple times as variables are added, challenging the intuition that coefficients would smoothly interpolate. AI

    Machine Learning Model Homotopy

    IMPACT Introduces a novel theoretical framework for understanding model behavior and parameter sensitivity.

  37. Launch HN: Silurian (YC S24) – Simulate the Earth

    Silurian, a startup founded by former Microsoft researchers, has launched Generative Forecasting Transformer (GFT), a 1.5 billion parameter model designed to simulate Earth's weather up to 14 days in advance. This deep learning model, which learns purely from data without explicit physics, has demonstrated strong performance in predicting hurricane tracks, outperforming traditional forecasting methods. The company aims to expand its simulations to model other weather-impacted infrastructure like energy grids and agriculture. AI

    IMPACT This new weather simulation model could significantly improve forecasting accuracy and lead to better infrastructure planning.

  38. Learning to reason with LLMs

    OpenAI has released an early version of its new model, OpenAI o1-preview, which demonstrates significant improvements in reasoning capabilities compared to GPT-4o. The model excels in competitive programming, advanced math exams, and complex scientific benchmarks, surpassing human expert performance in some areas. This advancement is attributed to a large-scale reinforcement learning algorithm that teaches the model to think productively using a chain of thought, with performance scaling with both training and test-time compute. AI

    Learning to reason with LLMs

    IMPACT This new model sets a higher bar for reasoning capabilities, potentially accelerating the development of more sophisticated AI agents and tools across various domains.

  39. Start reading the AI Snake Oil book online

    The book "AI Snake Oil" by Normal Tech AI, published in September 2024, aims to demystify artificial intelligence by identifying hype and harmful applications. It distinguishes between different AI types, such as predictive and generative AI, and examines their real-world impacts and limitations. The authors explore why AI hype persists and offer a framework for understanding AI's future, building on their previous work. AI

    Start reading the AI Snake Oil book online
  40. Micrograd.jl

    This article introduces Micrograd.jl, a new automatic differentiation package for the Julia programming language. It aims to fill a gap in comprehensive tutorials for AD in Julia, requiring a solid understanding of both Julia and Calculus. The package is built upon Zygote.jl and ChainRules.jl, offering a different approach to AD compared to Python frameworks like PyTorch by leveraging Julia's functional programming and metaprogramming capabilities. AI

    Micrograd.jl

    IMPACT Provides a new tool for Julia developers to build and train machine learning models, potentially improving efficiency and understanding of backpropagation.

  41. ⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

    OpenAI has announced it will no longer use SWE-bench Verified to evaluate the coding capabilities of frontier AI models. The benchmark has become contaminated, with models showing improved scores primarily due to exposure to problems and solutions during training rather than genuine advancements in software engineering skills. OpenAI found that a significant portion of the benchmark's tests incorrectly reject valid solutions, and that many models can reproduce ground-truth solutions verbatim, indicating training data overlap. The company now recommends SWE-bench Pro for evaluations and is developing new, uncontaminated benchmarks. AI

    ⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data
  42. The reanimation of pseudoscience in machine learning

    A recent article in Patterns argues that the machine learning field is experiencing a resurgence of pseudoscience, particularly in areas like consciousness and general intelligence. The authors express concern that the field's rapid growth and the pressure to publish may be leading to a decline in rigorous scientific standards. They call for a renewed focus on empirical evidence and falsifiable hypotheses to maintain the integrity of machine learning research. AI

    IMPACT Raises concerns about the scientific rigor and potential for pseudoscience within the machine learning research community.

  43. Towards high-quality (maybe synthetic) datasets

    Google Research has introduced Simula, a framework that treats synthetic data generation as a mechanism design problem. This approach allows for fine-grained control over dataset characteristics like coverage, complexity, and quality, addressing the scarcity of real-world data for specialized AI applications. Separately, Google also presented CTCL, a privacy-preserving synthetic data generation algorithm that avoids the need to fine-tune large language models, making it suitable for resource-constrained environments. AI

    Towards high-quality (maybe synthetic) datasets

    IMPACT New frameworks for synthetic data generation could accelerate AI development in data-scarce domains and improve privacy-preserving techniques.

  44. Extrinsic Hallucinations in LLMs

    Lilian Weng's latest post delves into extrinsic hallucinations in large language models, defining them as generated content that is fabricated and not grounded in provided context or world knowledge. The piece explores how issues in pre-training data and the learning process during fine-tuning can contribute to these factual inaccuracies. Research suggests that while models struggle to learn new information during fine-tuning, attempting to do so can paradoxically increase their tendency to hallucinate. AI

    Extrinsic Hallucinations in LLMs
  45. Sequential Learning and Catastrophic Forgetting in Differentiable Resistor Networks

    Researchers have developed a novel analog network of resistors capable of performing machine learning tasks without a traditional processor. This system, based on transistors, can learn and adapt to new tasks, demonstrating potential for highly energy-efficient computation. While currently a prototype, the technology shows promise for applications in edge devices and could eventually outperform conventional digital processors for specific machine learning workloads. AI

    Sequential Learning and Catastrophic Forgetting in Differentiable Resistor Networks

    IMPACT This research could lead to more energy-efficient AI hardware, particularly for edge computing applications.

  46. AI scaling myths

    A recent analysis challenges the prevailing belief that continued scaling of AI models will inevitably lead to advanced capabilities like AGI. The author argues that the predictability observed in scaling laws primarily relates to reducing perplexity, not necessarily to the emergence of new, user-relevant abilities. Furthermore, the availability of high-quality training data is becoming a significant bottleneck, and the cost and potential backlash against data acquisition are increasing. AI

    AI scaling myths
  47. Apple's On-Device and Server Foundation Models

    Apple has detailed its new foundation language models powering Apple Intelligence, including a ~3 billion parameter on-device model and a larger server-based model. These models are designed for multilingual and multimodal tasks, supporting image understanding and tool execution. The company emphasizes its Responsible AI approach, focusing on user privacy through innovations like Private Cloud Compute and on-device processing, ensuring user data is not used for training. AI

    Apple's On-Device and Server Foundation Models

    IMPACT Apple's detailed technical report on its foundation models may influence the development of efficient on-device and specialized server-based AI systems.

  48. What kind of bug would make machine learning suddenly 40% worse at NetHack?

    Researchers Bartłomiej Cupiał and Maciej Wołczyk observed a significant performance drop in their neural network trained to play NetHack. The model, which had been consistently scoring around 5,000 points, suddenly began scoring only 3,000 points, a 40% decrease. Despite extensive troubleshooting, including code reversion, software stack restoration, and rebuilding the entire system from scratch, the performance issue persisted. AI

    What kind of bug would make machine learning suddenly 40% worse at NetHack?

    IMPACT Highlights potential fragility in reinforcement learning models and the challenges of diagnosing performance regressions.

  49. Scientists should use AI as a tool, not an oracle

    A recent analysis highlights significant issues with AI-driven scientific research, particularly concerning the "leakage" error where models inadvertently learn from future data, leading to inflated performance claims. This problem, identified across 30 disciplines, is exacerbated by a scientific culture that prioritizes publication and positive results over rigorous validation. The authors argue that AI hype fuels flawed research, making it difficult to trust AI-generated discoveries and emphasizing the need for greater skepticism and improved reproducibility standards. AI

    Scientists should use AI as a tool, not an oracle
  50. Understanding Stein's Paradox (2021)

    Stein's paradox, a counterintuitive statistical concept, demonstrates that in dimensions three and higher, a better estimate of a Gaussian distribution's mean can be achieved than simply using the drawn sample. The James-Stein estimator, which uses a specific formula involving the sample's magnitude and dimensionality, outperforms the naive approach in terms of mean squared error. This paradox challenges conventional statistical intuition, particularly regarding parameter estimation in higher-dimensional spaces. AI

    Understanding Stein's Paradox (2021)