Fp8
PulseAugur coverage of Fp8 — every cluster mentioning Fp8 across labs, papers, and developer communities, ranked by signal.
10 day(s) with sentiment data
-
Ornith 1.0 models explained: Dense vs MoE and format/precision details
A guide has been released to explain the terminology and concepts behind the new Ornith 1.0 models. The guide clarifies the difference between Dense and Mixture of Experts (MoE) architectures, noting that MoE models act…
-
ComfyUI adds native INT8 support for faster Stable Diffusion image generation
ComfyUI, a popular interface for Stable Diffusion, has officially integrated native support for INT8 quantization. This update allows users to load INT8 models and text encoders directly within ComfyUI, significantly im…
-
Krea2 Turbo FP8 model tested for character recognition and performance
Users are testing the Krea2 Turbo FP8 model, noting its performance and character recognition capabilities. One extensive test involved over 1000 prompts to evaluate how well the model identifies characters from various…
-
New FFT method leverages FP8 tensor cores for high-precision GPU computation
A new research paper proposes an efficient method for calculating Fast Fourier Transforms (FFTs) using NVIDIA's Blackwell Ultra (B300) GPUs. The Ozaki-Bailey FFT technique leverages FP8 tensor cores for dense matrix mul…
-
Krea2 models released for StableDiffusion in GGUF and FP8 formats
New models and workflows for Krea2 have been released, including GGUF and FP8 formats. These resources are intended for use with StableDiffusion and are available via Hugging Face. The release also includes additional f…
-
Krea 2 image model released in multiple quantized formats for broader GPU access
The Krea 2 image generation model has been released in quantized versions, including FP8, MXFP8, NVFP4, and INT8 formats, making it accessible for a wider range of GPUs. The model comes in two variants: Krea 2 Raw for t…
-
NVIDIA Blackwell platform dominates MLPerf Training 6.0 benchmarks
NVIDIA's Blackwell platform has set new records in the MLPerf Training 6.0 benchmarks, achieving the fastest times across all seven tests. The platform demonstrated strong scaling, with clusters of up to 8,192 GPUs show…
-
Ideogram 4.0 FP8 VRAM Needs: 16GB vs 24GB GPU Debate
A user is seeking advice on GPU VRAM requirements for running Ideogram 4.0 FP8 locally. They are debating between a 16GB RTX 4070 Ti Super and a 24GB RTX 3090, noting that Ideogram 4.0 with its text encoder can consume …
-
New INT8 Kernel Accelerates Diffusion Transformers on Consumer GPUs
Researchers have developed a fused INT8 GEMM kernel that significantly speeds up diffusion transformers on consumer Ampere GPUs. This new kernel allows the hardware's INT8 tensor cores to be utilized, overcoming a softw…
-
Apple M4 Max GPU's Tensor Compute Path Emulated, Not Accelerated
Researchers have reverse-engineered the Metal 4.1 tensor compute path on Apple's M4 Max GPU, revealing that the fp8 matmul2d operation is emulated rather than hardware-accelerated. This means the operation runs on the G…
-
New quantization methods enable Ideogram 4.0 on consumer GPUs
Researchers have developed new post-training quantization techniques for the Ideogram 4.0 text-to-image diffusion transformer. Their INT8 W8A8 method maintains FP8 quality on consumer GPUs lacking FP8 tensor cores, outp…
-
Paper catalogs 84 numeric formats for ML hardware consistency
A new paper introduces a comprehensive catalog of 84 numeric formats used in machine learning hardware, addressing the challenge of silent divergences when porting models across different accelerators. The catalog inclu…
-
FP8 attention precision issues analyzed, reverse iteration and S=256 scaling proposed
A new research paper analyzes precision challenges in FP8 attention computations, specifically focusing on the softmax probability matrix (P) when cast to FP8. The study identifies an issue called "P-collapse" that occu…
-
FP8 with reconstruction schemes matches FP64 accuracy in HPC
A new research paper challenges the long-held belief that double-precision (FP64) hardware is essential for high-performance computing (HPC). The authors propose that using FP8 tensor cores, combined with specific recon…
-
Hcompany ships Holo3.1 agents for fast, local computer use
Hcompany has released Holo3.1, a new family of computer-use agents designed for robust performance across various environments and agent frameworks. This release emphasizes local inference capabilities, offering quantiz…
-
Fizgig Klein 9b Lora Studio updates for 16GB cards
Fizgig Klein 9b Lora Studio has released version 1.2.4, focusing on performance improvements and optimizations for users with 16GB graphics cards. This update enhances training speed through FP8 utilization and allows f…
-
RTX 3060 users: Disable low-VRAM flags for better Flux Klein performance
A user on Reddit discovered that for the Flux 2 Klein model on an RTX 3060 with 12GB VRAM, FP8 quantization performed similarly to GGUF quantization in terms of speed. The primary performance bottleneck was not the mode…
-
Trillion-parameter AI models challenge Kubernetes orchestration
Running trillion-parameter AI models within Kubernetes clusters presents significant challenges beyond standard container orchestration. These massive models require distributed systems approaches, where a single 'repli…
-
Together AI releases FlashAttention-3 and -4 for faster LLM processing
Together AI has released FlashAttention-3 and FlashAttention-4, significant upgrades to their GPU-accelerated attention mechanism for large language models. FlashAttention-3, designed for Hopper GPUs, achieves up to 75%…
-
New methods enhance LLM quantization for efficiency and accuracy
Researchers have developed several new methods to improve the efficiency and accuracy of quantizing large language models (LLMs). These techniques aim to reduce the memory footprint and computational cost of LLMs, makin…