PulseAugur
EN
LIVE 10:36:42
ENTITY half-precision floating-point format

half-precision floating-point format

PulseAugur coverage of half-precision floating-point format — every cluster mentioning half-precision floating-point format across labs, papers, and developer communities, ranked by signal.

Show in brief
Total · 30d
19
19 over 90d
Releases · 30d
0
0 over 90d
Papers · 30d
14
14 over 90d
TIER MIX · 90D
TOPICS
RELATIONSHIPS
SENTIMENT · 30D

6 day(s) with sentiment data

RECENT · PAGE 1/1 · 19 TOTAL
  1. RESEARCH · CL_111257 ·

    PersistentKV optimizes LLM serving on commodity GPUs with new scheduling techniques

    A new paper introduces PersistentKV, a system designed to optimize the serving of large language models (LLMs) with long contexts on commodity GPUs. PersistentKV employs page-aware decode scheduling and a native block-t…

  2. RESEARCH · CL_97851 ·

    SwitchBraidNet architecture offers lightweight hybrid BCI for low-power deployment

    Researchers have developed SwitchBraidNet, a novel lightweight architecture for hybrid brain-computer interfaces (BCIs) that integrates motor imagery and steady-state visual evoked potentials. This compact model is desi…

  3. RESEARCH · CL_95821 ·

    Ternary Mamba achieves 3.61x compression via QAT with knowledge distillation

    Researchers have developed a new method for compressing State Space Models (SSMs) like Mamba-2, significantly reducing their memory footprint for edge deployment. By employing grouped quantization-aware training (QAT) w…

  4. TOOL · CL_87068 ·

    Local LLM Hardware Guide: VRAM, Quantization, and Performance

    Running large language models (LLMs) locally, particularly those with 70 billion parameters, presents significant hardware challenges, primarily concerning VRAM capacity. While marketing often suggests minimal requireme…

  5. TOOL · CL_86852 ·

    Apple M4 Max GPU's Tensor Compute Path Emulated, Not Accelerated

    Researchers have reverse-engineered the Metal 4.1 tensor compute path on Apple's M4 Max GPU, revealing that the fp8 matmul2d operation is emulated rather than hardware-accelerated. This means the operation runs on the G…

  6. RESEARCH · CL_90886 ·

    Neural speaker diarization models compressed for efficiency

    Researchers have explored efficiency-performance trade-offs in neural speaker diarization for resource-constrained hardware, particularly for time-critical applications like medical dispatch. Using the SIMSAMU dataset, …

  7. TOOL · CL_68648 ·

    LLM inference speed bottlenecked by GPU memory bandwidth, not compute

    This article explains that the primary bottleneck for LLM inference in production is often the model's raw speed on the GPU, rather than serving logic or network overhead. It details how LLM inference, particularly duri…

  8. TOOL · CL_59553 ·

    Qwen 3.6 27B FP16 vs Q8 quantization performance debated

    A user on Reddit's r/LocalLLaMA subreddit is inquiring about the performance differences between FP16 and Q8 quantization for the Qwen 3.6 27B model. They are experiencing slow FP16 performance on their setup and are se…

  9. RESEARCH · CL_55741 ·

    Trillion-parameter AI models challenge Kubernetes orchestration

    Running trillion-parameter AI models within Kubernetes clusters presents significant challenges beyond standard container orchestration. These massive models require distributed systems approaches, where a single 'repli…

  10. RESEARCH · CL_47640 ·

    llama.cpp releases add Vulkan, optimize matrix math, and improve server logging

    The llama.cpp project has released several updates, including version b9580 which adds Vulkan support for matrix-matrix multiplication and Flash Attention, along with optimizations for FP16 dot2 extensions. Other recent…

  11. RESEARCH · CL_48899 ·

    ThriftAttention boosts AI efficiency with selective mixed-precision attention

    Researchers have developed ThriftAttention, a novel method to improve the efficiency of long-context attention mechanisms in AI models. This technique selectively applies higher precision (FP16) to a small percentage of…

  12. TOOL · CL_35323 ·

    Q4_K_M recommended for local LLM quantization, balancing quality and VRAM

    The article recommends Q4_K_M quantization as the best balance of quality and VRAM efficiency for most local LLM users, preserving 93-96% of FP16 quality. For users with more VRAM, Q5_K_M offers a noticeable improvement…

  13. TOOL · CL_22592 ·

    INT8 quantization can slow down AI inference, study finds

    A recent analysis explored the performance of INT8 quantization versus FP16 precision on NVIDIA's Ada Lovelace architecture, specifically using an L40S datacenter GPU and an RTX 4090 consumer card. The findings indicate…

  14. RESEARCH · CL_15546 ·

    EdgeLPR paper explores neural network precision vs performance trade-offs for LiDAR place recognition

    Researchers have developed EdgeLPR, a method for efficient LiDAR-based place recognition on edge devices. The approach utilizes Bird's Eye View representations to enable lightweight image-based networks for autonomous n…

  15. RESEARCH · CL_14350 ·

    Object detection models show mixed robustness to quantization and input degradations

    A new study investigates how post-training quantization (PTQ) affects the robustness of YOLO object detection models when faced with real-world input degradations like noise and blur. Researchers evaluated various preci…

  16. RESEARCH · CL_06527 ·

    New methods QFlash and ELSA boost Vision Transformer attention efficiency

    Researchers have developed two new methods to improve the efficiency of attention mechanisms in vision transformers. QFlash focuses on enabling integer-only operations for FlashAttention, achieving significant speedups …

  17. RESEARCH · CL_03804 ·

    AI safety research proposes formal framework for computational substrates

    This series of posts explores the concept of 'substrates' in AI, which refers to the computational context layers necessary for implementing AI systems. The authors argue that current AI safety research lacks a clear fr…

  18. TOOL · CL_17754 ·

    Apple's SeedLM compresses LLM weights using pseudo-random generators

    Researchers have developed SeedLM, a novel post-training compression technique for large language models that utilizes pseudo-random generator seeds to encode model weights. This method aims to reduce the high runtime c…

  19. RESEARCH · CL_01035 ·

    Optimizing Transformer Inference: Techniques for Faster, Cheaper Large Models

    Large transformer models present significant inference challenges due to their substantial memory footprint and computation costs, which scale quadratically with input length. Researchers and practitioners are exploring…