实体 Cuda

Cuda

PulseAugur coverage of Cuda — every cluster mentioning Cuda across labs, papers, and developer communities, ranked by signal.

总计 · 30天

43

90 天内 43

发布 · 30天

0

90 天内 0

论文 · 30天

15

90 天内 15

层级分布 · 90 天

significant 4
research 14
tool 24
commentary 1

关系

情绪 · 30 天

10 天有情绪数据

最近 · 第 2/3 页 · 共 43 条

TOOL · CL_18603 · May 6 · 04:00

VUDA 系统可在 GPU 上实现计算和图形的空间共享

研究人员开发了 VUDA 系统，旨在通过实现 CUDA 计算和 Vulkan 图形工作负载的同时执行来提高 GPU 利用率。这是通过打破这两种传统上在互斥时间切片中运行的独立 GPU 上下文之间的隔离来实现的。VUDA 通过 API 注释和驱动程序级别的修改促进空间并行性，从而实现统一的地址空间并消除关键路径上的数据复制。实验表明，VUDA 可将具身 AI 应用的吞吐量提高高达 85%。
TOOL · CL_16004 · May 5 · 04:00

New CUDA implementation speeds up optimal transport calculations on GPUs

Researchers have developed FastSinkhorn, a new CUDA implementation for the Sinkhorn algorithm used in optimal transport computations. This method operates entirely in the log-domain, ensuring numerical stability even wi…
RESEARCH · CL_14902 · May 4 · 19:11

OpenMythos project reconstructs Anthropic's secretive Claude Mythos AI model

A new open-source project called OpenMythos has been released, aiming to theoretically reconstruct the architecture of Anthropic's Claude Mythos model. This project implements a Recurrent-Depth Transformer (RDT) with a …
RESEARCH · CL_14450 · May 4 · 01:57

研究人员探索用于大型语言模型的新型注意力机制和优化技术

研究人员正在探索新颖的注意力机制，以克服 transformer 中标准自注意力机制的二次复杂度，尤其是在长上下文处理方面。几篇论文介绍了诸如 Lighthouse Attention（用于高效预训练）、Robust Filter Attention（将注意力视为状态估计）以及受神经连接组启发的 Stochastic Attention（以提高表达能力）等方法。其他工作则侧重于通过稀疏注意力的提前停止（S2O）等技术优化注意力的计算足…
RESEARCH · CL_12339 · May 1 · 15:49

AI代理自动化数据准备，而新的Python ML编译器加速LLM压缩

研究人员开发了一个仅用5000行Python编写的新开源机器学习编译器栈。该栈通过将大型语言模型降低到具有六个中间表示的CUDA，提供了前所未有的透明度。它旨在易于修改且针对CUDA进行了优化，与PyTorch或TVM等更复杂的系统形成对比。此外，AI代理因其自动化探索性数据分析和数据准备任务的潜力而受到关注，有望为数据科学家节省大量时间。
RESEARCH · CL_14104 · Apr 30 · 20:48

VkSplat 流水线通过 Vulkan 计算提升 3D 高斯溅射训练性能

研究人员开发了 VkSplat，一种利用 Vulkan 计算进行 3D 高斯溅射 (3DGS) 训练的新型流水线，可提高性能和兼容性。与传统的 CUDA 和 PyTorch 方法相比，这种新方法将速度提高了 3.3 倍，并将 VRAM 使用量减少了 33%。VkSplat 值得注意的是，它是第一个在不同 GPU 供应商上实现最先进结果的全 Vulkan 3DGS 训练流水线。
RESEARCH · CL_08672 · Apr 29 · 04:00

Gaussian Splatting advances enable faster, more accurate wireless RF reconstruction

Two new research papers introduce Gaussian Splatting techniques adapted for wireless radiance field reconstruction. The first, BiSplat-WRF, proposes a planar Gaussian framework that incorporates electromagnetic coupling…
SIGNIFICANT · CL_07248 · Apr 28 · 06:16

DeepSeek V4 First Release Adaptation Behind: Why does Ascend insist on not doing a CUDA compatibility layer?

Huawei's Ascend AI accelerators are forging a unique path by eschewing CUDA compatibility to build an independent ecosystem. This strategy focuses on deep architectural changes in their latest Ascend 950 chips to addres…
RESEARCH · CL_07063 · Apr 28 · 04:00

New GPU framework accelerates quantum state calculations for complex systems

Researchers have developed QiankunNet-cuSCI, a novel framework that fully accelerates the NNQS-SCI method for solving complex quantum systems using GPUs. This new approach addresses the scalability limitations of previo…
RESEARCH · CL_06527 · Apr 28 · 04:00

New methods QFlash and ELSA boost Vision Transformer attention efficiency

Researchers have developed two new methods to improve the efficiency of attention mechanisms in vision transformers. QFlash focuses on enabling integer-only operations for FlashAttention, achieving significant speedups …
RESEARCH · CL_10487 · Apr 28 · 01:11

AMD's MI300X falls short in AI training due to software issues

A recent benchmark analysis reveals that AMD's MI300X, despite theoretical advantages in specifications and total cost of ownership, is not competitive with NVIDIA's H100 and H200 for AI training workloads. The primary …
RESEARCH · CL_06196 · Apr 27 · 08:24

PointTransformerX offers portable, efficient 3D point cloud processing without sparse algorithms

Researchers have developed PointTransformerX (PTX), a new vision transformer backbone for processing 3D point clouds that eliminates the need for custom CUDA operators. This PyTorch-native model achieves competitive acc…
RESEARCH · CL_03577 · Apr 25 · 15:42

llama.cpp and ik_llama.cpp add FP4 inference support for VRAM savings

The llama.cpp and ik_llama.cpp projects have both integrated support for FP4 (4-bit floating-point) inference, a significant advancement for model quantization. llama.cpp now includes NVFP4, an Nvidia-specific format, w…
TOOL · CL_03576 · Apr 25 · 14:22

llama.cpp CUDA pull request optimizes MMQ stream-k overhead for MoE models

A pull request to the llama.cpp project aims to reduce overhead in CUDA's MMQ stream-k operations. This optimization targets Mixture of Experts (MoE) models, potentially leading to faster prompt processing speeds. The c…
FRONTIER RELEASE · CL_03105 · Apr 25 · 05:00

DeepSeek releases V4 Pro and Flash models with 1M context, runs on Huawei chips

DeepSeek has released its new V4 family of models, including V4 Pro and V4 Flash, which boast a 1 million token context window. These models were trained on 32 trillion tokens and feature a novel hybrid attention system…
SIGNIFICANT · CL_05791 · Apr 13 · 04:56

TianShu Zhixin cuts inference chip prices to gain market share amid revenue concerns

Chinese AI chip designer Tianshu Zhixin reported 10.34 billion yuan in revenue for 2025, a 91.6% year-over-year increase, though this fell short of market expectations. The company's training chip series, "Tianhe," rema…
FRONTIER RELEASE · CL_05793 · Apr 13 · 01:34

DeepSeek V4 to launch late April with trillion parameters, Huawei Ascend chip support

DeepSeek founder Liang Wenfeng has revealed that the company's next-generation flagship model, DeepSeek V4, is slated for release in late April. This new model is expected to feature trillion-scale parameters and a mill…
TOOL · CL_18066 · Mar 7 · 00:05

像 Claude 这样的 AI 编码助手重新点燃了老开发者们的热情

几位年长的开发者发现，由于 Claude Code 等 AI 编码助手的出现，他们对编码重新燃起了热情。这些工具使他们能够专注于架构设计和解决问题，而不会陷入现代框架和实现细节的复杂性之中。虽然有些人担心 AI 会剥夺从零开始构建的乐趣，但其他人则认为这些助手是宝贵的伙伴，就像一个不知疲倦的初级工程师一样，使他们能够更快地进行原型设计。
TOOL · CL_17743 · Jul 29 · 23:32

PHP-ORT 为 PHP 开发者带来机器学习推理能力

一个名为 PHP-ORT 的新基础设施项目旨在将机器学习推理能力直接引入 PHP，PHP 是网络上很大一部分使用的服务器端语言。这一发展旨在使数百万 PHP 开发者能够在不依赖外部服务或切换编程语言的情况下，将 AI 功能集成到他们的应用程序中。PHP-ORT 提供核心 Tensor API、高性能数学库，并与 ONNX 集成以实现直接推理，有望显著提速。
TOOL · CL_17711 · May 12 · 16:01

ParaQuery launches GPU-accelerated Spark SQL for cost-efficient data processing

ParaQuery, a new startup, has launched a GPU-accelerated Spark and SQL data processing solution. The platform aims to offer cost and performance benefits over existing solutions like Google BigQuery. ParaQuery leverages…