A user on the r/LocalLLaMA subreddit is exploring performance optimizations for running large language models across multiple eGPUs connected via Thunderbolt 3. They are experimenting with different model splitting techniques, specifically layer split versus tensor split, to maximize throughput for both prefill and decoding phases. The user is investigating the theoretical possibility of a hybrid split that could leverage the strengths of each method to overcome bandwidth limitations inherent in their TB3 setup. AI
IMPACT Potential for improved LLM inference performance on multi-GPU consumer hardware.
RANK_REASON User-generated discussion about technical implementation details for running LLMs on consumer hardware.
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →