Cuda
PulseAugur coverage of Cuda — every cluster mentioning Cuda across labs, papers, and developer communities, ranked by signal.
10 天有情绪数据
-
llama.cpp adds CUDA FWHT for faster KV cache quantization
A pull request to the llama.cpp project introduces a CUDA implementation of the Fast Walsh-Hadamard Transform (FWHT). This optimization, developed by user am17an, aims to speed up operations when quantizing the key-valu…
-
Developer runs LLMs on $50 AMD RX 580 GPU using Vulkan
A developer demonstrated running large language models and image generation software on an older AMD RX 580 GPU with 8GB of VRAM, a feat previously thought impossible for such hardware. By leveraging the Vulkan backend …
-
Go developer creates cgo-free CUDA bindings for ML tools
A developer is creating a cgo-free CUDA binding for the Go programming language, aiming to simplify machine learning tool development. The project, currently in its early stages and worked on during weekends, addresses …
-
llama.cpp project releases multiple updates with broad platform support
The llama.cpp project has released several updates, including versions b9315, b9313, b9311, b9310, b9305, and b9301. These releases introduce various improvements and bug fixes, such as parallelizing quantization look-u…
-
Guide shows how to run LLMs on legacy AMD RX 580 GPUs using Vulkan
A technical guide demonstrates how to run large language models (LLMs) on older AMD RX 580 graphics cards, which were previously considered obsolete for AI tasks. The method utilizes native Vulkan, bypassing the need fo…
-
Open-source C++/CUDA infra trains trillion-parameter LLMs
A developer has created TitanCore Core-1, an open-source infrastructure for training trillion-parameter LLMs. Written in C++ and CUDA, it targets VRAM limitations by implementing ZeRO-3 FSDP and fused kernels. This appr…
-
Shenmou targets wireless cameras with ultra-low-power chips
Shenmou, led by Yang Zuoxing, is developing ultra-low-power chip designs to free cameras from wires, envisioning a future with billions of smart visual terminals. Their first-generation chip achieves one-third the indus…
-
Stanford's ThunderKittens DSL optimizes AI kernel performance
A new article details ThunderKittens, a compact domain-specific language (DSL) developed at Stanford's Hazy Research Lab for creating high-performance AI kernels. The DSL aims to strike a balance between research produc…
-
DeepSeek V4 validates on Huawei Ascend 950, testing China's AI chip ecosystem
DeepSeek's V4 model has successfully validated inference on Huawei's Ascend 950 chip, marking a significant step for China's domestic AI hardware. This validation required substantial engineering effort, including rewri…
-
MTP inference speed issues in llama.cpp explained
A technical blog post explains why Multi-Token Prediction (MTP) in llama.cpp might not improve inference speed as expected. The author details three primary reasons for this performance issue: a low acceptance rate of p…
-
DeepSeek V4 debuts with MegaMoE optimizations for efficient MoE
DeepSeek has released its V4 model, featuring significant optimizations through a new system called MegaMoE. This system utilizes a 1400-line fused CUDA kernel to enhance performance by fine-grained pipelining of commun…
-
CuPy tutorial guides GPU computing with CUDA kernels and NumPy comparisons
This tutorial provides a comprehensive guide to mastering GPU computing using CuPy, a Python library that offers GPU acceleration for numerical tasks. It covers essential aspects such as inspecting CUDA device propertie…
-
MLX achieves CUDA backend milestone, boosting GPU acceleration
Cheng announced a significant milestone for MLX, with all tests passing on its CUDA backend. This achievement enhances MLX's GPU acceleration and CUDA compatibility. It represents positive progress for integrating Apple…
-
Cerebras Systems boosts IPO on AI compute demand
Cerebras Systems is significantly increasing its IPO price and share count due to high demand driven by the AI industry's need for compute power. While GPUs, particularly from Nvidia, have dominated AI workloads like tr…
-
Fedora launches AI Developer Desktop initiative for local AI tooling
Fedora has approved an initiative to create specialized Atomic Desktop images tailored for AI development. These images will focus on local-first tooling, offering simplified setup for AI stacks and supporting various h…
-
Apple's MLX framework accelerates local LLMs on Macs
Apple's MLX framework is significantly boosting local LLM performance on Apple Silicon Macs, outperforming tools like llama.cpp. LM Studio, a popular LLM frontend, now leverages MLX on Apple Silicon, offering a substant…
-
DS4 model runs on NVIDIA DGX Spark hardware at 12 tokens/sec
The DS4 model is reportedly running on NVIDIA's DGX Spark hardware, utilizing GB10 and CUDA. Initial performance metrics indicate a speed of 12 tokens per second, with observed memory throughput limited to 270 GB/s. Thi…
-
NVIDIA releases experimental Rust-to-CUDA compiler backend
NVIDIA AI researchers have introduced cuda-oxide, an experimental compiler that enables developers to write GPU kernels in Rust and compile them directly to PTX, NVIDIA's intermediate representation for GPUs. This new t…
-
Clinical AI fine-tuned on AMD hardware, bypassing CUDA dependency
A project has successfully fine-tuned a clinical AI model, MedQA, using AMD hardware and ROCm, demonstrating that advanced AI development is possible without NVIDIA's CUDA. The fine-tuning process utilized the Qwen3-1.7…
-
Modal boosts multimodal inference performance over 10% with Python dict
Modal has identified a performance bottleneck in multimodal inference engines like SGLang, which can hinder GPU utilization. By profiling the scheduler, they discovered that expensive bookkeeping for shared GPU memory c…