Brief

last 24h

[4/4] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · dev.to — LLM tag English(EN) · 2d

How to fix OOM crashes when running large open-source LLMs locally

Running large open-source language models locally can lead to out-of-memory errors, even if the model's weights seem to fit within the available VRAM. This is primarily due to the significant memory required for the KV cache, which scales with context length, and intermediate activation memory during inference. Developers can address these issues by profiling memory usage with tools like PyTorch's memory snapshot, applying appropriate quantization techniques to model weights and the KV cache, and managing memory fragmentation. AI

IMPACT Provides practical solutions for developers running large language models locally, addressing common memory issues.
- LLM
- PyTorch
- transformers
- llama.cpp
- KV cache
- bitsandbytes
- vLLM
- RTX 4090
- VRAM
- torch.cuda.OutOfMemoryError
COMMENTARY · dev.to — LLM tag English(EN) · 4d

You Probably Don't Need 8-Bit Quantization

For most users running large language models locally, 4-bit quantization offers a practical balance between performance and quality, significantly reducing VRAM requirements compared to 8-bit. While 4-bit models may show a slight decrease in reasoning capabilities on complex tasks, they remain nearly identical for text generation and instruction following. This approach is particularly beneficial for interactive chat and typical production workloads on consumer hardware, enabling faster inference speeds and making larger models accessible on less powerful GPUs. AI

IMPACT Enables wider accessibility of large language models on consumer hardware by optimizing resource usage.
TOOL · arXiv cs.CL English(EN) · 3d

Quantizing Whisper-small: How design choices affect ASR performance

A new study published on arXiv evaluates various post-training quantization (PTQ) techniques for the Whisper-small automatic speech recognition model. The research, which tested libraries like PyTorch, Optimum-Quanto, HQQ, and bitsandbytes, found that dynamic int8 quantization using Quanto provided the best balance of compression and accuracy. This method reduced model size by 57% while slightly improving word error rates on the LibriSpeech dataset, making Whisper-small more deployable on resource-constrained devices. AI

IMPACT Enables more efficient deployment of speech recognition models on edge devices by reducing size and computational cost.
RESEARCH · dev.to — LLM tag English(EN) · 5d · [2 sources]

I Thought Fine-Tuning LLMs Needed Expensive GPUs. I Was Wrong.

Developers can fine-tune large language models like TinyLlama on consumer hardware with as little as 3 GB of GPU memory using techniques such as QLoRA and NF4 quantization. This process involves training only a small fraction of the model's parameters, significantly reducing computational requirements. The process can be complex, with challenges arising from debugging, prompt formatting, and dependency management, but offers a path for solo developers to build sophisticated AI applications. AI

IMPACT Enables solo developers and smaller teams to fine-tune advanced LLMs, democratizing AI development and deployment.
- Hugging Face
- QLoRA
- LoRA
- BitsAndBytes
- FastAPI
- PEFT
- TinyLlama
- NF4 quantization

Brief

How to fix OOM crashes when running large open-source LLMs locally

You Probably Don't Need 8-Bit Quantization

Quantizing Whisper-small: How design choices affect ASR performance

I Thought Fine-Tuning LLMs Needed Expensive GPUs. I Was Wrong.