PulseAugur
EN
LIVE 10:58:36

Jetson AGX Orin 64GB sees faster LLM prefill with q8_0 quantization

A user on the r/LocalLLaMA subreddit shared performance observations for the Jetson AGX Orin 64GB, noting that the q8_0 quantization method for models resulted in significantly faster prompt processing compared to q6_k and q4_k_xl. The user tested this with the Unsloth Qwen3.6-27B-MTP-GGUF model on a recent llama.cpp build, observing over 20% speed improvement with q8_0. They hypothesize that the Jetson's CUDA cores may not be well-optimized for lower quantization levels on this specific hardware, as memory bandwidth does not appear to be the limiting factor. AI

IMPACT Performance insights for running large language models on edge devices like the Jetson AGX Orin.

RANK_REASON User-generated observation on model quantization performance on specific hardware. [lever_c_demoted from research: ic=1 ai=0.7]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 (SO) · /u/realblindseeker ·

    Jetson AGX Orin 64GB: q8_0 good, q6_k bad

    <!-- SC_OFF --><div class="md"><p>Just a quick observation for all three users of Jetson AGX Orin 64GB in this sub: q8_0 quant gives &gt;20% faster prefill (prompt processing) than q6_k, and 10% faster than q4_k_xl.</p> <p>Tested with Unsloth Qwen3.6-27B-MTP-GGUF on recent llama.…