Fp8
PulseAugur coverage of Fp8 — every cluster mentioning Fp8 across labs, papers, and developer communities, ranked by signal.
3 天有情绪数据
-
Together AI releases FlashAttention-3 and -4 for faster LLM processing
Together AI has released FlashAttention-3 and FlashAttention-4, significant upgrades to their GPU-accelerated attention mechanism for large language models. FlashAttention-3, designed for Hopper GPUs, achieves up to 75%…
-
NVIDIA unveils 4-bit pretraining method, NVFP4, for LLMs
NVIDIA has developed a new 4-bit pretraining methodology called NVFP4, designed to overcome the challenges of reduced dynamic range and increased quantization error in narrower floating-point formats. This method was su…
-
llmcompressor tool enables LLM compression via FP8, GPTQ, SmoothQuant
A new open-source tool named llmcompressor allows developers to compress and benchmark instruction-tuned large language models. The tool demonstrates how to apply post-training quantization techniques such as FP8, GPTQ,…
-
LoKA framework enables low-precision FP8 for large recommendation models
Researchers have developed LoKA, a framework designed to make low-precision arithmetic, specifically FP8, practical for large recommendation models (LRMs). Unlike previous attempts that often degraded model quality, LoK…
-
Superhuman and Databricks build 200K QPS AI inference platform
Superhuman and Databricks engineers collaborated to build a high-throughput inference platform capable of handling over 200,000 queries per second. This joint effort modernized Superhuman's serving stack, migrating from…
-
LLM Study Diary #3: PyTorch tensors, float types, and training infrastructure
This LLM study diary entry focuses on PyTorch fundamentals for training large language models. It details tensor basics, exploring various floating-point data types like FP32, BF16, and FP8 for efficiency and stability.…
-
SnapMLA paper details hardware-aware FP8 quantized pipelining for efficient long-context MLA decoding
Researchers have developed SnapMLA, a new framework designed to enhance the efficiency of long-context decoding in Multi-head Latent Attention (MLA) architectures. This approach utilizes hardware-aware FP8 quantization …
-
TACO framework boosts LLM training throughput by 1.87X with tensor compression
Researchers have introduced TACO, a novel framework designed to enhance the efficiency of training large-scale tensor-parallel Large Language Models (LLMs). TACO addresses communication overhead by employing an FP8-base…
-
NVIDIA launches Nemotron 3 Nano Omni, unifying multimodal AI for efficiency
NVIDIA has released Nemotron 3 Nano Omni, an open multimodal model capable of processing text, images, audio, and video. This model aims to unify these modalities into a single architecture, improving efficiency and ena…
-
Qwen3.6-35B model quantizations show FP8 quality worse than INT8, NVFP4 is a lie
A user on Reddit's LocalLLaMA community shared findings on the Qwen3.6-35B model, focusing on Kullback-Leibler (KLD) divergence metrics for different quantization formats like INT8, FP8, and NVFP4. The analysis, conduct…
-
AI safety research proposes formal framework for computational substrates
This series of posts explores the concept of 'substrates' in AI, which refers to the computational context layers necessary for implementing AI systems. The authors argue that current AI safety research lacks a clear fr…
-
DeepSeek V4 models offer high performance with reduced inference costs and NPU support
DeepSeek has released its V4 family of open-weight large language models, featuring a 1.6 trillion parameter model and a smaller 284 billion parameter Flash MoE model. These new models claim to rival top proprietary LLM…
-
SpikingBrain2.0 model offers efficient long-context and cross-platform AI inference
Researchers have introduced SpikingBrain2.0 (SpB2.0), a 5 billion parameter model designed for efficient long-context processing and cross-platform inference. The model features a novel Dual-Space Sparse Attention mecha…