A lawyer has updated their local AI setup for legal drafting, now featuring twelve V100 SXM2 32GB GPUs and an additional box with four RTX 3090s and two V100 PCIe cards. They switched from vLLM to llama.cpp for running Mixture-of-Experts (MoE) GGUF models, finding that MoE models offer significantly better performance and context handling on their V100 hardware compared to dense models. The system now employs an orchestrator that routes tasks across multiple local models, utilizing all 16 GPUs for complex jobs like drafting affidavits and motions. AI
IMPACT Demonstrates effective local deployment of MoE models for specialized tasks, potentially reducing reliance on cloud services for niche applications.
RANK_REASON User-level deployment of hardware and software for a specific task.
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →