Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 4d · [2 sources]

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

Researchers have developed ModeSwitch-LLM, a lightweight controller designed to enhance the efficiency of large language model inference on a single GPU. This system dynamically routes requests to various inference modes, including quantized, speculative, and hybrid configurations, based on workload features. Evaluations on Meta-Llama-3.1-8B-Instruct demonstrated a 2.10x speedup in latency and a 51.7% reduction in energy consumption per token compared to standard FP16, while maintaining near-equivalent accuracy. AI

IMPACT Improves LLM inference efficiency on single GPUs, potentially lowering operational costs and increasing accessibility.

Meta-Llama-3.1-8B-Instruct
NVIDIA A100
ModeSwitch-LLM
NVIDIA A100 GPU