User seeks help optimizing Qwen 3.5 9B inference on MI50 GPU

By PulseAugur Editorial · [1 sources] · 2026-06-11 22:29

A user is seeking assistance with configuring the Qwen 3.5 9B model for optimal local inference on a MI50 32GB GPU. They are experiencing slow speeds, below 1 token per second, while using a specific vLLM fork. The user is looking for guidance to improve performance and potentially set up a vision/text-to-text model or a Gemma 4 variant. AI

IMPACT This query highlights challenges in optimizing local LLM inference, particularly with specific hardware and model configurations.

RANK_REASON User-generated content seeking help with model configuration and performance, not a release or official announcement.

Read on r/LocalLLaMA →

infra
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

r/LocalLLaMA TIER_1 English(EN) · /u/exaknight21 · 2026-06-11 22:29

Mi50 32GB / GFX906 - vLLM Qwen 3.5 Configuration for Qwen 3.5:9B AWQ-4bit

<div class="md">Hi All: I am trying to get the optimal local inference set up for my single Mi50 32 GB. I am trying to use ai-infos vLLM fork, (aiinfos/vllm-gfx906-mobydick:latest), but I am getting low speeds, sub 1 TPS. Has anyone gotten this mod…

COVERAGE [1]

Mi50 32GB / GFX906 - vLLM Qwen 3.5 Configuration for Qwen 3.5:9B AWQ-4bit

RELATED ENTITIES

RELATED TOPICS