PulseAugur
EN
LIVE 05:09:54

User explores hybrid model splitting for eGPU LLM performance

A user on the r/LocalLLaMA subreddit is exploring performance optimizations for running large language models across multiple eGPUs connected via Thunderbolt 3. They are experimenting with different model splitting techniques, specifically layer split versus tensor split, to maximize throughput for both prefill and decoding phases. The user is investigating the theoretical possibility of a hybrid split that could leverage the strengths of each method to overcome bandwidth limitations inherent in their TB3 setup. AI

IMPACT Potential for improved LLM inference performance on multi-GPU consumer hardware.

RANK_REASON User-generated discussion about technical implementation details for running LLMs on consumer hardware.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

User explores hybrid model splitting for eGPU LLM performance

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/tired514 ·

    Tensor split performance on low-bandwidth (TB3) eGPUs, and a question

    <!-- SC_OFF --><div class="md"><p>Hey everyone!</p> <p>I've got a pair of Morefine G1 4090M 16gb eGPUs connected at 40Gbps via TB3 (daisy-chained). I normally run them in layer split mode as it doesn't seem to need much bandwidth; I'm seeing around 1300t/s PP and 26t/s TG (35-40 …