Brief

last 24h

[9/9] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · r/LocalLLaMA English(EN) · 4h

CUDA: add fast walsh-hadamard transform by am17an · Pull Request #23615 · ggml-org/llama.cpp

A pull request to the llama.cpp project introduces a CUDA implementation of the Fast Walsh-Hadamard Transform (FWHT). This optimization, developed by user am17an, aims to speed up operations when quantizing the key-value cache. Initial benchmarks show modest performance gains, with a 1-2% boost in processing power (pp) and a 7-9% increase in token generation (tg) for the Gemma 4 26B model. AI

IMPACT Improves inference efficiency for local LLM deployments by optimizing KV cache operations.
TOOL · dev.to — LLM tag English(EN) · 1d

I ran Flux Schnell + LLMs on a $50 GPU. No CUDA. No cloud. No ROCm.

A developer demonstrated running large language models and image generation software on an older AMD RX 580 GPU with 8GB of VRAM, a feat previously thought impossible for such hardware. By leveraging the Vulkan backend for the ggml project, which powers tools like llama.cpp and stable-diffusion.cpp, the developer achieved a 3-4x performance increase over CPU-only processing. This approach bypasses the need for CUDA, ROCm, or DirectML, proving that modern AI tasks can be accessible on more modest, older hardware. AI

IMPACT Demonstrates that older, less powerful GPUs can run AI models, potentially lowering the barrier to entry for local AI development.
- ggml
- OpenVINO
- llama.cpp
- CUDA
- FLUX
- Vulkan
- ROCm
- DirectML
- AMD RX 580
- stable-diffusion.cpp
TOOL · dev.to — LLM tag English(EN) · 3d

Running Flux Schnell (12B) + LLMs on a Legacy AMD RX 580 (8GB) via Native Vulkan — Full Architecture Guide [2026]

A technical guide demonstrates how to run large language models (LLMs) on older AMD RX 580 graphics cards, which were previously considered obsolete for AI tasks. The method utilizes native Vulkan, bypassing the need for CUDA or ROCm, and employs a dual-architecture approach. This involves using the GPU for smaller models via Vulkan acceleration and the CPU for larger, more demanding models, with NVMe storage identified as a critical factor for reducing model load times. AI

IMPACT Enables running LLMs on older, less powerful hardware, potentially lowering the barrier to entry for AI experimentation.
- LLM
- ComfyUI
- CUDA
- OpenWebUI
- Flux Schnell
- NVMe
- Vulkan
- ROCm
- DirectML
- AMD RX 580
- Intel Xeon E5-2690 v3
RESEARCH · 36氪 (36Kr) 中文(ZH) · 3d

The Wireless Revolution of AI Intelligent Imaging Under the Computing Power Wave | 2026 AI Partner · Beijing Yizhuang AI+ Industry Conference

Shenmou, led by Yang Zuoxing, is developing ultra-low-power chip designs to free cameras from wires, envisioning a future with billions of smart visual terminals. Their first-generation chip achieves one-third the industry's power consumption, while the second generation reaches one-tenth, enabling all-weather smart cameras powered by a single watt of solar energy. Yang predicts a massive increase in camera demand, from hundreds of millions annually to potentially 100 billion by 2045, to feed real-time data into world-scale AI models. AI

IMPACT Enables massive scaling of real-world data input for AI models, potentially reducing hardware costs and expanding AI applications.
- Nvidia
- AI
- DeepSeek
- Samsung
- 36Kr
- TSMC
- CUDA
- Groq
- GPU
- Yang Zuoxing
- Shenmou
TOOL · dev.to — LLM tag English(EN) · 3d

TitanCore Core-1 – Trillion-parameter LLM training infra in C++/CUDA with ZeRO-3

A developer has created TitanCore Core-1, an open-source infrastructure for training trillion-parameter LLMs. Written in C++ and CUDA, it targets VRAM limitations by implementing ZeRO-3 FSDP and fused kernels. This approach reportedly achieves a 2.6x speedup over traditional methods by optimizing memory bandwidth utilization. AI

IMPACT Enables more efficient training of extremely large language models, potentially lowering the barrier for developing frontier models.
- C++
- ZeRO-3
- TitanCore Core-1
- CUDA
- Sarkar-AGI
RESEARCH · dev.to — LLM tag English(EN) · 5d

DeepSeek V4 on Huawei's Ascend 950: A Real Stress Test for China's AI Chip Ecosystem

DeepSeek's V4 model has successfully validated inference on Huawei's Ascend 950 chip, marking a significant step for China's domestic AI hardware. This validation required substantial engineering effort, including rewriting numerous CUDA operators and extensive testing, to achieve performance parity with NVIDIA's offerings for inference workloads within China. The Ascend 950 features a unique dual-architecture design with high-bandwidth memory to address both compute-bound and memory-bound phases of LLM operations, though its widespread adoption is hindered by manufacturing capacity limitations. AI

IMPACT Validates domestic AI hardware for inference, potentially reducing reliance on foreign suppliers within China.
- NVIDIA
- Huawei
- Claude Opus
- DeepSeek
- GPT-5
- V4
- CUDA
- Ascend 950
- SMIC
TOOL · r/MachineLearning English(EN) · 1d

Working on a cgo-free CUDA binding in Go for ML stuff Week 3 - open source [P]

A developer is creating a cgo-free CUDA binding for the Go programming language, aiming to simplify machine learning tool development. The project, currently in its early stages and worked on during weekends, addresses issues with large Docker images and cross-compilation inherent in cgo-based solutions. A key challenge overcome is managing CUDA's thread affinity by using a channel-based executor that locks OS threads, enabling smoother goroutine management for GPU operations. AI

IMPACT Enables easier development of ML tools in Go by simplifying CUDA integration.
TOOL · llama.cpp — Releases (SO) · 1d · [6 sources]

b9301

The llama.cpp project has released several updates, including versions b9315, b9313, b9311, b9310, b9305, and b9301. These releases introduce various improvements and bug fixes, such as parallelizing quantization look-up table initialization and fixing checkpoint creation in the server component. The updates also provide pre-compiled binaries for a wide range of operating systems and hardware architectures, including macOS, iOS, Linux, Android, and Windows, with support for different compute backends like Vulkan, ROCm, OpenVINO, SYCL, and CUDA. AI

IMPACT Provides updated tooling for running LLMs on diverse hardware, improving accessibility and performance for developers and users.
- CMake
- llama.cpp
- CUDA
- macOS
- iOS
- Windows
- Vulkan
- OpenMP
- Linux
- ROCm
- OpenVINO
- Android
RESEARCH · Lobsters — AI tag English(EN) · 3d · [3 sources]

Dissecting ThunderKittens, anatomy of a compact DSL for high-performance AI kernels

A new article details ThunderKittens, a compact domain-specific language (DSL) developed at Stanford's Hazy Research Lab for creating high-performance AI kernels. The DSL aims to strike a balance between research productivity and hardware efficiency by abstracting repetitive GPU programming tasks like tile layouts and memory allocation. This allows developers to maintain close reasoning about data movement and scheduling while still enabling performance optimization for modern AI workloads on hardware like NVIDIA's Hopper and Blackwell architectures. AI

IMPACT Enables more efficient AI model training and inference by optimizing low-level GPU kernel performance.
- NVIDIA
- AI
- Stanford
- FlashAttention-2
- Hopper
- PyTorch
- CUDA
- GPU
- Blackwell
- Triton
- Hazy Research Lab
- ThunderKittens