LLM inference efficiency explored on edge devices and cloud GPUs

By PulseAugur Editorial · [3 sources] · 2026-06-08 04:00

Two new research papers explore the challenges of running large language models (LLMs) efficiently. The first paper investigates the performance trade-offs of deploying LLMs on edge devices like smartphones and specialized NPUs, highlighting thermal constraints and memory bandwidth limitations. The second paper introduces a scalable framework using heuristic algorithms to optimize resource allocation for LLM inference in heterogeneous GPU cloud environments, aiming to meet service level objectives while minimizing costs. AI

IMPACT These papers offer insights into optimizing LLM performance and cost for both on-device and cloud deployments, crucial for scaling AI applications.

RANK_REASON The cluster contains two academic papers discussing LLM inference performance and resource allocation.

Read on arXiv cs.LG →

paper
infra

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

LLM inference efficiency explored on edge devices and cloud GPUs

COVERAGE [3]

arXiv cs.AI TIER_1 English(EN) · Shan Yu, Yifan Qiao, Mingyuan Ma, Yangmin Li, Shuo Yang, Xinyuan Tong, Yang Wang, Zhiqiang Xie, Yuwei An, Shiyi Cao, Ke Bao, Deepak Vij, Xiaoning Ding, Yichen Wang, Qingda Lu, Zhong Wang, Gao Gao, Harry Xu, Junyi Shu, Jiarong Xing, Ying Sheng · 2026-06-12 04:00

Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

arXiv:2505.04021v3 Announce Type: replace-cross Abstract: Inference providers must maintain availability for many LLMs, including low-volume but essential models, making resource efficiency increasingly important as token prices fall. Analysis of production traces reveals a dynam…
arXiv cs.LG TIER_1 English(EN) · Pranay Tummalapalli, Sahil Arayakandy, Ritam Pal, Kautuk Kundan · 2026-06-09 04:00

LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

arXiv:2603.23640v2 Announce Type: replace-cross Abstract: Deploying large language models on-device for always-on personal agents demands sustained inference from hardware tightly constrained in power, thermal envelope, and memory. We benchmark Qwen 2.5 1.5B (4-bit quantised) acr…
arXiv cs.LG TIER_1 English(EN) · Jiaming Cheng, Duong Tung Nguyen · 2026-06-08 04:00

Scalable Joint Resource Allocation for SLO-Constrained LLM Inference in Heterogeneous GPU Clouds

arXiv:2604.07472v2 Announce Type: replace Abstract: Serving large language model (LLM) inference in cloud environments requires jointly optimizing model selection, GPU provisioning, parallelism configuration, and workload routing under latency, accuracy, memory, and budget constr…

COVERAGE [3]

Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

Scalable Joint Resource Allocation for SLO-Constrained LLM Inference in Heterogeneous GPU Clouds

RELATED ENTITIES

RELATED TOPICS