NVIDIA H100
PulseAugur coverage of NVIDIA H100 — every cluster mentioning NVIDIA H100 across labs, papers, and developer communities, ranked by signal.
7 天有情绪数据
-
研究发现:LLM上下文稀疏性可提供10倍推理加速
一篇新的研究论文提出,大型语言模型(LLM)中与注意力机制相关的计算和内存瓶颈是人为的,可以通过原则性的稀疏性来克服。该研究分析了五个家族的20个模型,发现当前的LLM对推理时间解码稀疏性具有惊人的鲁棒性,即使没有经过专门训练。这种方法可以显著加速LLM推理,稀疏解码内核在H100等硬件上可实现50倍稀疏度下的高达10倍的速度提升。
-
Photoroom通过AI管道优化将图像生成成本降低75%
Photoroom通过优化其扩散模型管道,显著降低了图像生成成本。该公司通过int8量化在UNet去噪阶段实现了39%的成本降低,并通过缓存LLM嵌入将文本编码器成本降低了79%。实施带有Bifrost的AI网关进一步将字幕API支出降低了61%,并提高了延迟,同时还减轻了与上游LLM中断相关的成本。
-
Together AI 增加 1,000 块 H100/H200 GPU 用于推理
Together AI 通过增加一千台 NVIDIA H100 和 H200 实例,显著扩展了其 GPU 容量。这些强大的 GPU 现已通过 Together 的按需 GPU 集群和专用端点服务提供。此次扩展旨在为 AI 推理和开源模型开发提供更强大的基础设施。
-
指南:在您自己的硬件上免费本地运行GPT-4级别的大型语言模型
本指南详细介绍了2026年如何在个人硬件上本地运行先进的大型语言模型,从而绕过昂贵的API成本。它强调VRAM是主要的硬件瓶颈,而非原始计算能力,并为不同预算推荐了特定的GPU配置。该指南推荐使用Ollama作为管理本地大型语言模型的标准工具,并重点介绍了Qwen 2.5和DeepSeek-R1等几款中国模型,因为它们在与其体量相比时表现强劲。
-
扩散模型加速取决于开销减少,而非仅仅是步数减少
单张图像扩散模型推理速度慢的原因在于内核启动开销和注意力内存流量,而非原始计算能力。通过在 `reduce-overhead` 模式下使用 `torch.compile` 进行优化,采用融合注意力后端,以及批处理无分类器引导,可以显著降低延迟。只有在这些优化之后,才应考虑使用蒸馏方法来进一步提高速度,同时仔细评估潜在的质量下降。
-
Cohere releases open-source Command A+ AI model for enterprise agents
Cohere has released Command A+, an open-source, multimodal AI model designed for enterprise use and agentic tasks. This new model integrates reasoning, vision, and multilingual capabilities, supporting 48 languages and …
-
Cohere 发布 Command A+ 开源模型,优化效率
Cohere 发布了其最新模型 Command A+,该公司声称这是其迄今为止最快、最强大的模型。该模型专为高效部署而设计,仅需两块 H100 GPU 即可运行。Command A+ 也将作为开源模型提供。
-
New metric OFU tracks GPU efficiency for AI workloads
Researchers have developed a new metric called Overall FLOP Utilization (OFU) to measure GPU efficiency for AI workloads. OFU is derived from on-chip performance counters and does not require application instrumentation…
-
LLM benchmarks mislead on inference speed for long contexts
Current LLM inference benchmarks are misleading because they primarily measure short-context performance, which does not reflect real-world usage involving longer contexts. This discrepancy arises from the differing com…
-
ByteDance unveils CVPR 2026 papers on efficient AI algorithms
ByteDance's Seed team presented four papers at CVPR 2026, focusing on algorithmic advancements to combat rising compute costs and hardware limitations. These papers explore techniques to compress model inference steps, …
-
GPU rental prices show early signs of volatility and transparency
New data from AIMC Technologies, which tracks GPU rental prices across 24 marketplaces, indicates that the market for AI compute is becoming more transparent and volatile. The dataset, comprising over 141,000 pricing ob…
-
Hugging Face and AWS Detail Foundation Model Infrastructure
Hugging Face and AWS have collaborated to detail the infrastructure required for training and running large foundation models. The blog post outlines a layered architecture, emphasizing the interplay between AWS's compu…
-
Google's TurboQuant cuts LLM memory use by 6x with no accuracy loss
Google researchers have developed a new technique called TurboQuant that significantly reduces the memory required by large language models. By employing a two-step process involving data rotation and scalar quantizatio…
-
Superhuman and Databricks build 200K QPS AI inference platform
Superhuman and Databricks engineers collaborated to build a high-throughput inference platform capable of handling over 200,000 queries per second. This joint effort modernized Superhuman's serving stack, migrating from…
-
Modal boosts multimodal inference performance over 10% with Python dict
Modal has identified a performance bottleneck in multimodal inference engines like SGLang, which can hinder GPU utilization. By profiling the scheduler, they discovered that expensive bookkeeping for shared GPU memory c…
-
GPU硬件分析揭示内存带宽而非FLOPS是LLM的关键
本文解释了 GPU 的基本架构,重点关注其设计如何优先考虑内存带宽而非原始计算能力来执行机器学习任务。文章详细介绍了 GPU 如何通过称为 warps 的系统和六层内存层级来管理数千个线程,以确保连续运行,即使单个线程遇到内存延迟。该解释旨在让机器学习工程师更深入地了解 CUDA API 下方的 GPU 硬件,为未来关于 KV 缓存管理和量化等性能优化技术的讨论奠定基础。
-
New SPES framework enables memory-efficient decentralized LLM pretraining on fewer GPUs
Researchers have developed a novel decentralized framework called SPES for pretraining large language models, specifically Mixture-of-Experts (MoE) architectures. This method significantly reduces memory requirements by…
-
AI model evaluations are becoming a costly bottleneck, surpassing training expenses
AI model evaluations are becoming prohibitively expensive, with recent benchmarks costing tens of thousands of dollars and consuming thousands of GPU hours. This high cost is particularly pronounced for agent-based eval…
-
SenseNova U1 unifies image understanding and generation with novel architecture
SenseTime has released SenseNova-U1, an open-source model that unifies image understanding and generation. This new architecture, particularly the 8B parameter version, can replicate advanced capabilities previously see…
-
Whole brain emulation unlikely to aid AI transition, study finds
Whole brain emulation (WBE) is unlikely to significantly impact the AI transition, according to an analysis based on the State of Brain Emulation 2025 report. Experts estimate WBE is decades away from AGI, requiring ext…