PulseAugur
实时 20:31:04
实体 Cuda

Cuda

PulseAugur coverage of Cuda — every cluster mentioning Cuda across labs, papers, and developer communities, ranked by signal.

Show in brief
总计 · 30天
43
90 天内 43
发布 · 30天
0
90 天内 0
论文 · 30天
15
90 天内 15
层级分布 · 90 天
关系
情绪 · 30 天

10 天有情绪数据

最近 · 第 1/3 页 · 共 43 条
  1. TOOL · CL_49945 ·

    llama.cpp adds CUDA FWHT for faster KV cache quantization

    A pull request to the llama.cpp project introduces a CUDA implementation of the Fast Walsh-Hadamard Transform (FWHT). This optimization, developed by user am17an, aims to speed up operations when quantizing the key-valu…

  2. TOOL · CL_47069 ·

    Developer runs LLMs on $50 AMD RX 580 GPU using Vulkan

    A developer demonstrated running large language models and image generation software on an older AMD RX 580 GPU with 8GB of VRAM, a feat previously thought impossible for such hardware. By leveraging the Vulkan backend …

  3. TOOL · CL_48182 ·

    Go developer creates cgo-free CUDA bindings for ML tools

    A developer is creating a cgo-free CUDA binding for the Go programming language, aiming to simplify machine learning tool development. The project, currently in its early stages and worked on during weekends, addresses …

  4. TOOL · CL_47640 ·

    llama.cpp project releases multiple updates with broad platform support

    The llama.cpp project has released several updates, including versions b9315, b9313, b9311, b9310, b9305, and b9301. These releases introduce various improvements and bug fixes, such as parallelizing quantization look-u…

  5. TOOL · CL_44608 ·

    Guide shows how to run LLMs on legacy AMD RX 580 GPUs using Vulkan

    A technical guide demonstrates how to run large language models (LLMs) on older AMD RX 580 graphics cards, which were previously considered obsolete for AI tasks. The method utilizes native Vulkan, bypassing the need fo…

  6. TOOL · CL_44133 ·

    Open-source C++/CUDA infra trains trillion-parameter LLMs

    A developer has created TitanCore Core-1, an open-source infrastructure for training trillion-parameter LLMs. Written in C++ and CUDA, it targets VRAM limitations by implementing ZeRO-3 FSDP and fused kernels. This appr…

  7. RESEARCH · CL_43614 ·

    Shenmou targets wireless cameras with ultra-low-power chips

    Shenmou, led by Yang Zuoxing, is developing ultra-low-power chip designs to free cameras from wires, envisioning a future with billions of smart visual terminals. Their first-generation chip achieves one-third the indus…

  8. RESEARCH · CL_43418 ·

    Stanford's ThunderKittens DSL optimizes AI kernel performance

    A new article details ThunderKittens, a compact domain-specific language (DSL) developed at Stanford's Hazy Research Lab for creating high-performance AI kernels. The DSL aims to strike a balance between research produc…

  9. RESEARCH · CL_41028 ·

    DeepSeek V4 validates on Huawei Ascend 950, testing China's AI chip ecosystem

    DeepSeek's V4 model has successfully validated inference on Huawei's Ascend 950 chip, marking a significant step for China's domestic AI hardware. This validation required substantial engineering effort, including rewri…

  10. TOOL · CL_37617 ·

    MTP inference speed issues in llama.cpp explained

    A technical blog post explains why Multi-Token Prediction (MTP) in llama.cpp might not improve inference speed as expected. The author details three primary reasons for this performance issue: a low acceptance rate of p…

  11. FRONTIER RELEASE · CL_33854 ·

    DeepSeek V4 debuts with MegaMoE optimizations for efficient MoE

    DeepSeek has released its V4 model, featuring significant optimizations through a new system called MegaMoE. This system utilizes a 1400-line fused CUDA kernel to enhance performance by fine-grained pipelining of commun…

  12. RESEARCH · CL_32327 ·

    CuPy tutorial guides GPU computing with CUDA kernels and NumPy comparisons

    This tutorial provides a comprehensive guide to mastering GPU computing using CuPy, a Python library that offers GPU acceleration for numerical tasks. It covers essential aspects such as inspecting CUDA device propertie…

  13. TOOL · CL_31216 ·

    MLX achieves CUDA backend milestone, boosting GPU acceleration

    Cheng announced a significant milestone for MLX, with all tests passing on its CUDA backend. This achievement enhances MLX's GPU acceleration and CUDA compatibility. It represents positive progress for integrating Apple…

  14. RESEARCH · CL_26301 ·

    Cerebras Systems boosts IPO on AI compute demand

    Cerebras Systems is significantly increasing its IPO price and share count due to high demand driven by the AI industry's need for compute power. While GPUs, particularly from Nvidia, have dominated AI workloads like tr…

  15. SIGNIFICANT · CL_26027 ·

    Fedora launches AI Developer Desktop initiative for local AI tooling

    Fedora has approved an initiative to create specialized Atomic Desktop images tailored for AI development. These images will focus on local-first tooling, offering simplified setup for AI stacks and supporting various h…

  16. TOOL · CL_25715 ·

    Apple's MLX framework accelerates local LLMs on Macs

    Apple's MLX framework is significantly boosting local LLM performance on Apple Silicon Macs, outperforming tools like llama.cpp. LM Studio, a popular LLM frontend, now leverages MLX on Apple Silicon, offering a substant…

  17. RESEARCH · CL_24951 ·

    DS4 model runs on NVIDIA DGX Spark hardware at 12 tokens/sec

    The DS4 model is reportedly running on NVIDIA's DGX Spark hardware, utilizing GB10 and CUDA. Initial performance metrics indicate a speed of 12 tokens per second, with observed memory throughput limited to 270 GB/s. Thi…

  18. RESEARCH · CL_24751 ·

    NVIDIA releases experimental Rust-to-CUDA compiler backend

    NVIDIA AI researchers have introduced cuda-oxide, an experimental compiler that enables developers to write GPU kernels in Rust and compile them directly to PTX, NVIDIA's intermediate representation for GPUs. This new t…

  19. TOOL · CL_22630 ·

    Clinical AI fine-tuned on AMD hardware, bypassing CUDA dependency

    A project has successfully fine-tuned a clinical AI model, MedQA, using AMD hardware and ROCm, demonstrating that advanced AI development is possible without NVIDIA's CUDA. The fine-tuning process utilized the Qwen3-1.7…

  20. RESEARCH · CL_23761 ·

    Modal boosts multimodal inference performance over 10% with Python dict

    Modal has identified a performance bottleneck in multimodal inference engines like SGLang, which can hinder GPU utilization. By profiling the scheduler, they discovered that expensive bookkeeping for shared GPU memory c…