A user on Reddit shared performance benchmarks for the GLM-5.2 UD-IQ1_M model running on llama.cpp. The tests utilized an RTX 5090 and an RTX 3090 Ti, reporting approximately 579 tokens/second for prefill at an 8k context window and 324 tokens/second at a 57k context window. Token generation speed, or decoding, was measured at around 10.6 tokens/second. AI
IMPACT Provides specific performance data for running large language models locally, aiding developers in hardware and software choices.
RANK_REASON User-generated performance benchmarks for a specific model and software combination.
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →