A user shared their experience setting up and testing the Qwen 3.6 27B model on a dual Radeon R9700 GPU configuration using llama.cpp. The setup achieved impressive token generation speeds, reaching up to 67 tokens/s with a context of 10-13k and over 40 tokens/s with a context of 125k. Prefill throughput was also strong, with over 1,000 tokens/s for prompts under 10k and around 400 tokens/s for larger prompts exceeding 100k. The user detailed their hardware, software, and testing methodologies, including performance metrics for decode and prefill throughput, and discussed prompt caching strategies. AI
IMPACT Demonstrates efficient multi-GPU inference for large language models on consumer hardware, potentially lowering barriers to entry for advanced AI tasks.
RANK_REASON User-generated report on running a specific model with specific hardware, including performance metrics. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →