Brief

last 24h

[42/42] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · dev.to — LLM tag English(EN) · 6h

I ran Claude Code on a local LLM for 4 hours — 7M tokens, $0 (would have cost $94)

A developer successfully ran Anthropic's Claude Code locally for four hours, processing 7 million tokens without incurring API costs. This was achieved by routing Claude Code's requests through LiteLLM to a local Qwen3.6-27B-MTP model running on an AMD GPU via llama.cpp. The setup offers benefits such as no rate limits, enhanced privacy, and offline capability, with the developer providing detailed instructions and hardware requirements for replication. AI

IMPACT Enables cost-free, private, and offline use of advanced coding models by leveraging local hardware.
TOOL · r/LocalLLaMA English(EN) · 4h

CUDA: add fast walsh-hadamard transform by am17an · Pull Request #23615 · ggml-org/llama.cpp

A pull request to the llama.cpp project introduces a CUDA implementation of the Fast Walsh-Hadamard Transform (FWHT). This optimization, developed by user am17an, aims to speed up operations when quantizing the key-value cache. Initial benchmarks show modest performance gains, with a 1-2% boost in processing power (pp) and a 7-9% increase in token generation (tg) for the Gemma 4 26B model. AI

IMPACT Improves inference efficiency for local LLM deployments by optimizing KV cache operations.
TOOL · llama.cpp — Releases English(EN) · 16h

b9309: perplexity : fix even more integer overflows (#23623)

The llama.cpp project has released version b9309, which includes fixes for integer overflow issues. This release is part of ongoing development and maintenance for the C/C++ implementation of Llama models. AI

IMPACT Minor maintenance update for an open-source AI model implementation.
- perplexity
- llama.cpp
TOOL · r/LocalLLaMA (CA) · 5h

Llama.cpp: Split Mode Tensor Fix Incoming?

A fix is reportedly incoming for the llama.cpp project to address crashes related to split mode tensor operations. This issue has been causing instability, particularly for users employing multiple GPUs, with tests showing a significant performance uplift but also frequent crashes due to VRAM exhaustion. The upcoming fix aims to resolve this specific problem, improving stability for multi-GPU setups. AI

IMPACT This fix will improve stability and performance for users running large models on multi-GPU setups with llama.cpp.
- llama.cpp
- ggml-org/llama.cpp/issues/22404
TOOL · dev.to — LLM tag Deutsch(DE) · 1d

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

A technical analysis explores the performance of Qwen 3.6's 27B and 35B models when using Multi-Token Prediction (MTP), a speculative decoding technique. The tests, conducted on a 16GB VRAM GPU, reveal that MTP can significantly increase token generation speed by predicting multiple tokens per step. However, this speed boost comes at the cost of reduced context window size, particularly with higher MTP settings and certain quantization levels. AI

IMPACT Demonstrates how speculative decoding techniques like MTP can improve inference speed for large language models, albeit with trade-offs in context window size.
TOOL · dev.to — LLM tag English(EN) · 1d

I ran Flux Schnell + LLMs on a $50 GPU. No CUDA. No cloud. No ROCm.

A developer demonstrated running large language models and image generation software on an older AMD RX 580 GPU with 8GB of VRAM, a feat previously thought impossible for such hardware. By leveraging the Vulkan backend for the ggml project, which powers tools like llama.cpp and stable-diffusion.cpp, the developer achieved a 3-4x performance increase over CPU-only processing. This approach bypasses the need for CUDA, ROCm, or DirectML, proving that modern AI tasks can be accessible on more modest, older hardware. AI

IMPACT Demonstrates that older, less powerful GPUs can run AI models, potentially lowering the barrier to entry for local AI development.
- OpenVINO
- llama.cpp
- CUDA
- ggml
- FLUX
- Vulkan
- ROCm
- DirectML
- AMD RX 580
- stable-diffusion.cpp
TOOL · dev.to — LLM tag English(EN) · 1d

llama.cpp Native Tools, Qwen GGUF Models, and Local Multimodal Audio Tools

The llama.cpp project has integrated native tools, including shell command execution and file editing, directly into its server, enabling local large language models to perform actions and automate tasks. This advancement facilitates the creation of more capable autonomous agents that can operate entirely on local hardware. Additionally, a new 35-billion parameter Qwen model, Qwen3.6-35B-A3B, has been released in the GGUF format, optimized for efficient local inference on consumer hardware. AI

IMPACT Enhances local AI agent capabilities and accessibility of large open-weight models on consumer hardware.
- llama.cpp
- Ollama
- GGUF
- Qwen3.6-35B-A3B
- edit_file
- exec_shell
TOOL · r/LocalLLaMA English(EN) · 15h

server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

A pull request for the llama.cpp project aims to improve the responsiveness of agentic coding workflows. The proposed changes address issues where context rewriting by tools or models could force full prompt reprocessing, leading to significant delays. By optimizing how llama.cpp handles changes in the conversation history, the update seeks to ensure that only modified portions of the context are reprocessed, making agentic coding more fluid. AI

IMPACT Optimizes a key component for local LLM applications, potentially improving user experience for agentic coding tasks.
TOOL · r/LocalLLaMA English(EN) · 9h

Old Mac Pro still proving its worth

An old Mac Pro, originally costing nearly £10,000, is being repurposed for local LLM work thanks to new Linux drivers that enable its D700 GPUs. The machine, equipped with 64GB of RAM and 24 cores, can now run models via llama.cpp, achieving usable speeds for tasks like planning. Notably, the user found that Qwen 3.5 9B provided superior planning output compared to Anthropic's Claude Sonnet 4.6. AI

IMPACT Demonstrates that older, specialized hardware can still be viable for local LLM inference with software updates.
- Anthropic
- Apple
- Claude Sonnet 4.6
- llama.cpp
- Mac Pro
- Qwen 3.5 9B
- Linux
- D700 GPUs
TOOL · dev.to — LLM tag English(EN) · 6d

Free 35B Multimodal LLM Server on Kaggle GPU — Accessible from Any OpenAI-Compatible Client

A developer has created a method to run a 35 billion parameter multimodal LLM on free Kaggle GPUs, overcoming the typical limitations of such platforms. The solution involves using Qwen3.6-35B-A3B quantized to 4-bit, hosted on Kaggle's T4 GPUs for up to 12 hours per session. It leverages llama.cpp for inference and an OpenAI-compatible API, with Cloudflare Quick Tunnel providing a stable public URL that supports token streaming, unlike other free tunneling services. AI

IMPACT Enables developers to run powerful LLMs on free cloud GPUs, bypassing costly hardware or API fees.
TOOL · dev.to — LLM tag Italiano(IT) · 6d

Local LLMs: Bytedance Lance 3B Multimodal, llama.cpp MTP, Ollama Client

ByteDance has released Lance, a new 3-billion parameter open-source multimodal model designed to run on consumer GPUs. This model can process both images and text, aiming to make advanced AI capabilities more accessible. Concurrently, the popular inference engine llama.cpp has received significant performance enhancements through Multi-Threaded Pipelining (MTP), which boosts local inference speeds. Additionally, a new open-source chat client called Horizon has been launched, offering cross-platform support for interacting with local models via Ollama, as well as cloud-based AI services. AI

IMPACT Advances in lightweight multimodal models and inference engine optimizations will accelerate the development and deployment of local AI applications.
- Horizon
- Ollama
- llama.cpp
- ByteDance
- Lance
TOOL · dev.to — LLM tag English(EN) · 3d

Run Hermes Agent on Any Model — Free, Local, and Cost-Routed

Nous Research has released Hermes Agent, an open-source AI agent designed for continuous learning and broad platform integration. Hermes features a persistent memory, autonomous skill creation, and multi-platform support across messaging apps and terminals. It can be configured to use various LLM providers, including OpenAI, Anthropic, and Ollama, through a universal proxy like Lynkr. AI

IMPACT Enables greater flexibility and cost-efficiency for AI agent users by decoupling tools from specific LLM providers.
- Anthropic
- OpenAI
- OpenRouter
- Nous Research
- Databricks
- llama.cpp
- Ollama
- Azure
- Hermes Agent
- Bedrock
TOOL · dev.to — LLM tag English(EN) · 2d

How to fix OOM crashes when running large open-source LLMs locally

Running large open-source language models locally can lead to out-of-memory errors, even if the model's weights seem to fit within the available VRAM. This is primarily due to the significant memory required for the KV cache, which scales with context length, and intermediate activation memory during inference. Developers can address these issues by profiling memory usage with tools like PyTorch's memory snapshot, applying appropriate quantization techniques to model weights and the KV cache, and managing memory fragmentation. AI

IMPACT Provides practical solutions for developers running large language models locally, addressing common memory issues.
- LLM
- PyTorch
- transformers
- llama.cpp
- KV cache
- bitsandbytes
- vLLM
- RTX 4090
- VRAM
- torch.cuda.OutOfMemoryError
TOOL · dev.to — LLM tag English(EN) · 3d

I built a version manager for llama.cpp using nothing but vibe coding.

A developer created a version manager for the llama.cpp project, inspired by Node.js's nvm tool. This new tool, named 'lvm', allows users to easily install, switch between, and manage different versions of llama.cpp, simplifying the update process for those who frequently use the software. The project was developed using Go and is available on GitHub for community contributions. AI

IMPACT Simplifies workflow for developers using llama.cpp, potentially accelerating experimentation with new model versions.
TOOL · dev.to — LLM tag Italiano(IT) · 3d

Zero-Idle Local LLMs: Running Llama 3 in AWS Lambda Containers

A new approach allows running open-source LLMs like Llama 3 directly within AWS Lambda containers, bypassing traditional API providers for specific tasks. This method leverages model quantization and increased Lambda container limits to enable self-hosting of LLMs on serverless CPUs. While not universally cheaper than managed APIs, it offers significant cost savings and enhanced privacy for high-volume, low-reasoning workloads. AI

IMPACT Enables cost-effective, private LLM inference for high-volume, low-reasoning tasks, potentially shifting workloads from API providers to self-hosted solutions.
- Anthropic
- OpenAI
- AWS
- AWS Lambda
- Llama 3
- llama.cpp
- Amazon Bedrock
- Claude 3 Haiku
- Amazon SQS
- DynamoDB
TOOL · Mastodon — fosstodon.org English(EN) · 6d

New week, new slides: Run LLMs Locally Now including multi-token prediction using Qwen3.6 35B-A3B with Nextn quantization. Also speech recognition using Qwen-3-

Thomas Bley has released new slides detailing how to run large language models locally. The presentation covers multi-token prediction using the Qwen3.6 35B-A3B model with Nextn quantization. It also includes information on speech recognition with Qwen-3-ASR, which now functions with Llama.cpp. AI

IMPACT Provides a guide for local execution of open-source LLMs and ASR models, enabling broader experimentation and use.
TOOL · dev.to — LLM tag English(EN) · 4d

Qwen 3.6 & llama.cpp Push Local Inference Limits on Consumer GPUs

The open-weight model Qwen 3.6, in its 35 billion parameter version, has achieved an impressive 110 tokens per second inference speed on consumer GPUs with 12GB of VRAM. This performance was enabled by a specialized variant of llama.cpp, referred to as ik_llama.cpp, and specific quantization techniques. Additionally, a 27 billion parameter version of Qwen 3.6 has been successfully deployed locally using llama.cpp's server configuration, providing a practical example for self-hosted AI applications. AI

IMPACT Accelerates the accessibility and practicality of running powerful LLMs on local hardware, reducing reliance on cloud services.
RESEARCH · Mastodon — fosstodon.org 日本語(JA) · 6d · [2 sources]

New Features in llama.cpp: Model Management https:// huggingface.co/blog/ggml-org/m odel-management-in-llamacpp *AI-generated auto-post (headline + link) # AI # GenerativeAI # LLM # AIGenerated

Hugging Face is highlighting new developments in open-source AI models and tools. One post details how Codex is making its AI models available to the public, while another introduces new model management features within the llama.cpp project. AI

IMPACT Highlights advancements in open-source AI, potentially enabling broader community development and adoption.
TOOL · Mastodon — fosstodon.org English(EN) · 1d

# Copilot and I finally decided to drop LM studio and go for llama.cpp. I don't like to be bound by one company. We are in the process of moving our # AI Brower

A user is migrating their AI browser application cluster from LM Studio to llama.cpp. This move is motivated by a desire to avoid being tied to a single company's offerings. The application is intended for chatting with IBM's Granite 4.1 8B model and will also host over 20 other AI applications to support future research. AI

IMPACT User-level migration of AI tooling; minimal industry-wide impact.
COMMENTARY · dev.to — LLM tag English(EN) · 2d

CPU vs GPU inference in llama.cpp isn’t just about speed — it’s about real-world constraints. In many local AI deployments, consistency and availability matter more than peak performance. Great breakdown of the tradeoffs in local LLM inference. #LLM

This article explores the practical differences between CPU and GPU inference for large language models (LLMs) using the llama.cpp framework. It highlights that while GPUs offer superior speed, CPUs can be a viable alternative when factors like consistency, availability, and resource constraints are more critical for local deployments. The piece provides a detailed analysis of the trade-offs involved in choosing between these hardware options for running LLMs. AI

IMPACT Provides practical guidance for operators on hardware choices for local LLM deployments, impacting cost and performance considerations.
- llama.cpp
- GPU
- CPU
- Maxim Saplin
TOOL · dev.to — LLM tag English(EN) · 5d · [2 sources]

LM Studio Adds MTP Speculative Decoding; Qwen 3.6 GGUF Quants, Ollama Insights

LM Studio has updated to version 0.4.14 Build 2 (Beta), integrating MTP Speculative Decoding to accelerate local large language model inference. This feature allows for faster text generation by predicting multiple tokens simultaneously, making local AI interactions more fluid. Additionally, new GGUF quantizations for the Qwen 3.6 35B model have been released, with benchmarks comparing MTP and NTP performance across various hardware, providing users with data to optimize their local LLM deployments. AI

IMPACT Enhances local LLM inference speed and accessibility for users running models on their own hardware.
COMMENTARY · dev.to — LLM tag English(EN) · 4d

You Probably Don't Need 8-Bit Quantization

For most users running large language models locally, 4-bit quantization offers a practical balance between performance and quality, significantly reducing VRAM requirements compared to 8-bit. While 4-bit models may show a slight decrease in reasoning capabilities on complex tasks, they remain nearly identical for text generation and instruction following. This approach is particularly beneficial for interactive chat and typical production workloads on consumer hardware, enabling faster inference speeds and making larger models accessible on less powerful GPUs. AI

IMPACT Enables wider accessibility of large language models on consumer hardware by optimizing resource usage.
TOOL · r/LocalLLaMA English(EN) · 23h

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

A new open-source inference engine called hipEngine has been developed for AMD's RDNA3 GPUs, enabling faster native inference of the Qwen 3.6 large language model. The engine, written in Python with a HIP/C++ core, utilizes AMD's native libraries to achieve competitive performance against llama.cpp. Benchmarks show hipEngine outperforming llama.cpp in prompt processing speeds across various context lengths, particularly at 128K context, and demonstrating lower peak memory usage. AI

IMPACT Enables faster local LLM inference on AMD GPUs, potentially broadening hardware accessibility for AI model deployment.
- Qwen 3.6
- AMD
- llama.cpp
- RDNA3
- ROCm
- ParoQuant
- hipEngine
TOOL · dev.to — LLM tag English(EN) · 1w · [2 sources]

Unload All llama.cpp Router Models Without Restarting

The llama.cpp router mode allows local LLM operators to manage multiple models, offering performance and control similar to services like Ollama. While it supports loading and unloading individual models, there isn't a direct API endpoint to unload all models simultaneously. Users can achieve this by first querying the router for all loaded models and then programmatically sending individual unload requests for each, a method that provides explicit control and avoids restarting the entire inference service. AI

IMPACT Enables more efficient VRAM management for local LLM deployments, improving usability for self-hosted models.
- llama.cpp
- jq
- curl
- Ollama
TOOL · Tom's Hardware English(EN) · 2d · [3 sources]

768GB of cheap Intel Optane DIMM memory sticks used to run 1-trillion-parameter LLM on a system with a single GPU — local Kimi K2.5 install achieved roughly 4 tokens per second

A Redditor has successfully run a 1-trillion-parameter LLM, specifically Kimi K2.5, locally on a single GPU workstation by utilizing 768GB of second-hand Intel Optane Persistent Memory modules as RAM. This setup achieved approximately 4 tokens per second, a performance deemed impressive given the hardware's budget constraints. The use of discontinued Optane DIMMs highlights a potential market gap for affordable, high-capacity memory solutions for large language model inference, especially as DRAM prices fluctuate. AI

IMPACT Demonstrates a cost-effective method for running large LLMs locally, potentially influencing future hardware configurations for AI inference.
RESEARCH · llama.cpp — Releases (SO) · 3d · [8 sources]

b9289

The llama.cpp project has released several updates, including version b9297 which adds NVFP4 MTP scale tensors and links Qwen3.5 MTP tensors. Previous releases, such as b9296 and b9295, focused on bug fixes and improvements for Vulkan and other functionalities. These releases provide pre-compiled binaries for a wide range of operating systems and hardware architectures, including macOS, Linux, Android, and Windows, with support for various compute backends like CUDA, ROCm, Vulkan, and SYCL. AI

IMPACT Ongoing development of llama.cpp provides users with more efficient and compatible tools for running LLMs on diverse hardware.
- llama.cpp
TOOL · r/LocalLLaMA English(EN) · 1d

How I do use the recent llama.cpp native tools to do web rag a.k.a. web_fetch (or anything else for the matter) directly from inside the llama-server's webui

A user on Reddit's r/LocalLLaMA shared a detailed method for enabling Retrieval Augmented Generation (RAG) and other command-line functionalities within the llama.cpp server's web UI. This approach involves enabling native tools in llama-server, installing and configuring `firejail` for system-wide sandboxing, and creating a dedicated user with a virtual machine container harness called `smolmachines`. The setup culminates in a multi-layered sandboxing process that allows the LLM to safely execute commands, such as fetching web content using `wget`, directly from its interface. AI

IMPACT Enables more sophisticated RAG and command execution directly from local LLM interfaces, enhancing their utility for complex tasks.
TOOL · llama.cpp — Releases (SO) · 1d · [6 sources]

b9301

The llama.cpp project has released several updates, including versions b9315, b9313, b9311, b9310, b9305, and b9301. These releases introduce various improvements and bug fixes, such as parallelizing quantization look-up table initialization and fixing checkpoint creation in the server component. The updates also provide pre-compiled binaries for a wide range of operating systems and hardware architectures, including macOS, iOS, Linux, Android, and Windows, with support for different compute backends like Vulkan, ROCm, OpenVINO, SYCL, and CUDA. AI

IMPACT Provides updated tooling for running LLMs on diverse hardware, improving accessibility and performance for developers and users.
- CMake
- llama.cpp
- CUDA
- macOS
- iOS
- Windows
- Vulkan
- OpenMP
- Linux
- ROCm
- Android
- OpenVINO
MEME · r/LocalLLaMA English(EN) · 5h

Best coding model on RTX 3060

A user on the r/LocalLLaMA subreddit is seeking recommendations for the best coding-focused large language model that can run on hardware with 12GB of VRAM, specifically an RTX 3060. The user is also inquiring about optimal setup configurations, such as using vLLM or Llama.cpp, and the best quantization methods for this setup. They are looking for practical advice on achieving useful results with these constraints. AI
- Llama.cpp
- vLLM
- r/LocalLLaMA
- RTX 3060
COMMENTARY · r/LocalLLaMA English(EN) · 1d

Need Help Choosing a Harness for Qwen 3.6 27B

A user on Reddit's r/LocalLLaMA subreddit is seeking recommendations for an open-source harness to manage multiple local AI agents. They are currently using Qwen 3.5/3.6 27B models on a Windows 10 machine with an RTX 3090 Ti and 96GB RAM, with LM Studio as their server. The user needs a tool that can easily spawn sub-agents, manage their system prompts and tools, and provide a dashboard to monitor all agent outputs, including their thought processes and tool usage. They also want to integrate a prefill mechanism to pass context from smaller agents to the main agent before message processing. AI

IMPACT Niche tooling improvement; minimal industry-wide impact.
- llama.cpp
- LM Studio
- Postgres
- r/LocalLLaMA
- pi agent
- openwebui
- Redis
- N8N
- RTX 3090 TI
- browserless
- Qwen 3.5|3.6 27B
MEME · r/LocalLLaMA (CA) · 9h

llama.cpp out of memory issue

A user on Reddit's r/LocalLLaMA subreddit is experiencing a persistent out-of-memory (OOM) issue with the llama.cpp software. The problem causes the process to consume increasing amounts of system RAM over 20-40 minutes of use, eventually leading to it being killed. The user has attempted various configurations, builds, and even Docker images, but the issue persists, suggesting a potential memory leak or inefficient memory management within the software under specific usage patterns. AI

IMPACT User-level technical issue with a specific LLM implementation, not a broad industry impact.
TOOL · r/LocalLLaMA English(EN) · 1d

llama.cpp server have built-in native tools (exec_shell, edit_file, etc.)

The llama.cpp server now includes experimental native support for a suite of tools, enabling it to function as a basic agent harness. These tools, including file operations and shell command execution, can be enabled via a command-line flag. However, the implementation currently lacks security sandboxing, requiring users to exercise caution with exposed functionalities. AI

IMPACT Enables basic agent-like capabilities directly within the llama.cpp server, reducing the need for external wrappers for common tasks.
MEME · Mastodon — fosstodon.org English(EN) · 6d

I wonder if using a "dumber" local AI model might help mitigate the cognitive decline some researchers are starting to observe. I am currently running Qwen 3.6

A user on Mastodon is exploring the idea that employing less advanced, locally run AI models could potentially counteract cognitive decline observed by some researchers. They are currently using Qwen 3.6 26B via Llama.cpp, acknowledging its inferiority to models like Claude or Gemini, but finding value in its steerability for generating insights. This approach requires more user guidance to achieve desired outcomes. AI
- Claude
- Gemini
- Llama.cpp
- Qwen 3.6 26B
TOOL · r/LocalLLaMA English(EN) · 3d · [5 sources]

Choosing an abliterated version of Gemma 4 31B and 26B-A4B

New developments in local LLM inference are enhancing performance on consumer hardware. The BeeLlama v0.2.0 release, utilizing a DFlash update, significantly boosts token generation speeds for models like Qwen and Gemma on GPUs such as the RTX 3090, offering up to a 5x speedup. Additionally, ByteShape quantizations are improving Qwen model performance on laptops with limited VRAM, providing a notable speed increase. These advancements aim to make larger, more capable open-weight models practical for everyday local use. AI

IMPACT Enhances local LLM inference performance, making larger models more accessible on consumer hardware.
- llmfan46
- Qwen
- Gemma
- r/LocalLLaMA
- Qwen3.6-35B-A3B
- Gemma 4 31B
- Gemma4-26B-A4B
- ByteShape
- llama.cpp
- Ollama
- RTX 3090
- LLaMA 3.1
- BeeLlama
TOOL · dev.to — LLM tag English(EN) · 4d · [36 sources]

Hot To Run LLMs Locally

This series of guides provides comprehensive instructions for setting up and running large language models (LLMs) locally on Linux systems. It details hardware and software prerequisites, recommends using llama.cpp for its balance of performance and ease of use, and covers model selection, quantization, and API integration. The guides also include steps for setting up systemd services for 24/7 operation, monitoring performance, and optimizing for various hardware constraints. AI

IMPACT Enables developers to run and experiment with LLMs locally, reducing reliance on cloud services and facilitating custom application development.
- Llama-3
- Continue.dev
- OpenAI API
- Qwen2.5-coder
- Ollama
- VS Code
- Claude API
- Cursor
- Large Language Models
- RTX 3090
- NVIDIA GPU
- Apple Silicon
- Qwen 2.5
- DeepSeek-R1
- RTX 4090
- NVIDIA RTX 3060
- Mac
- llama.cpp
- Mistral-7B
- Ubuntu
- CPU
- RAM
- VRAM
- Linux
- RTX 3060
- Q4_K_M
- Q5_K_M
- NVIDIA
- Llama 2
- Qwen
- CodeLlama
- Phi-3
- Q8_0
- AMD
MEME · r/LocalLLaMA English(EN) · 18h

Could someone please help explain these results?

A user on Reddit's r/LocalLLaMA subreddit is seeking assistance understanding unexpected performance gains when running the Qwen3.6-35B-A3B-UD-Q4_K_XL model. They observed a doubling of inference speed, from 17 to 34 tokens/second, after increasing the `--n-cpu-moe` parameter from 8 to 30, which contradicts their expectation of a performance decrease due to increased CPU load. The user is also inquiring about further optimizations for their setup, which includes 12GB VRAM and 32GB RAM, utilizing llama.cpp with the TurboQuant variant. AI
RESEARCH · Mastodon — sigmoid.social 日本語(JA) · 3w · [133 sources]

NVIDIA Brings Agents to Life with DGX Spark and Reachy Mini https:// huggingface.co/blog/nvidia-rea chy-mini ※AI-generated automatic post (headline + link) # AI # GenerativeAI # LLM # AIGenerated

Hugging Face has announced several updates and collaborations across its platform. These include enhancements to OCR pipelines with open models, the integration of Sentence Transformers, and the release of Transformers.js v4. Additionally, Hugging Face is strengthening AI security through a partnership with VirusTotal and introducing new models like Granite 4.0 Nano and AnyLanguageModel for efficient LLM operations. AI

IMPACT Hugging Face continues to expand its ecosystem with new models, tools, and collaborations, enhancing capabilities in OCR, AI security, and efficient LLM deployment.
- llama.cpp
- NVIDIA
- Hugging Face
- LeRobot
- NVIDIA Isaac
- AprielGuard
- Google Cloud
- LLM
- AnyLanguageModel
- AMD
- IBM
- VirusTotal
- Transformers.js
- ServiceNow
- Sentence Transformers
- Granite 4.0 Nano
- Anthropic
MEME · r/LocalLLaMA English(EN) · 1d

GPU VRAM only for small models with llama.cpp: is it possible?

A user on the r/LocalLLaMA subreddit is seeking assistance with optimizing their GPU VRAM usage for running smaller language models. Despite successfully running larger models like Gemma4 26B and Qwen 3.6 35B MoEs, they are encountering issues with smaller models like Gemma4-2B still utilizing system RAM. The user has experimented with various command-line options for llama.cpp but has not yet achieved full VRAM utilization without relying on host memory. AI
- Qwen
- llama.cpp
- Gemma4
TOOL · Hugging Face Trending Models Bahasa(ID) · 3w

SulphurAI/Sulphur-2-base

SulphurAI has released its Sulphur-2-base model, a diffusion model designed for image generation. The model is available on Hugging Face and provides instructions for integration with various popular libraries and tools. These include Diffusers, llama-cpp-python, llama.cpp, Ollama, Unsloth Studio, Pi, and Hermes Agent, facilitating its use in local applications and cloud environments. AI

IMPACT Enables developers to integrate a new image generation model into various applications and workflows.
TOOL · Hugging Face Trending Models Dansk(DA) · 1mo

Jiunsong/supergemma4-26b-uncensored-gguf-v2

The Jiunsong/supergemma4-26b-uncensored-gguf-v2 model is now available for use with various popular AI libraries and applications. These include llama-cpp-python, llama.cpp, vLLM, Ollama, Unsloth Studio, and Pi. Detailed instructions and code snippets are provided for integrating the model into local applications and servers, enabling users to run inference directly or via OpenAI-compatible APIs. AI

IMPACT Facilitates broader adoption and experimentation with the Jiunsong/supergemma4-26b-uncensored-gguf-v2 model across different platforms.
RESEARCH · Hugging Face Trending Models Deutsch(DE) · 1mo · [2 sources]

HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive

HauhauCS has released two new models, Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive and Gemma-4-E4B-Uncensored-HauhauCS-Aggressive, available on Hugging Face. These models are designed for users who want to run them locally or through various inference providers. The releases include detailed instructions for integration with popular tools like llama-cpp-python, llama.cpp, vLLM, Ollama, and Unsloth Studio, facilitating direct use and experimentation. AI

IMPACT Provides new open-source models and integration guides for local AI development.
RESEARCH · Hugging Face Daily Papers English(EN) · 12mo · [85 sources]

Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

Researchers have developed several new tools and frameworks to improve the efficiency and accuracy of large language model (LLM) operations. Charon and Frontier are simulators designed to predict LLM training and inference performance with high accuracy, aiding in optimization efforts. FT-Dojo provides a benchmark environment for autonomous LLM fine-tuning, while rePIRL offers an inverse RL-inspired framework for learning process reward models. Additionally, PALS focuses on power-aware LLM serving for Mixture-of-Experts models, and LlamaWeb enables memory-efficient LLM inference in web browsers using WebGPU. AI

IMPACT New simulators and frameworks promise more efficient, accurate, and power-aware LLM operations, potentially accelerating research and deployment.
- PagedAttention
- LLMs
- FlashAttention
- Llama-2-7B
- A100 GPU
- Nested WAIT
- LLM
- Asteria
- KVDrive
- Sarathi-Serve
- SCICONVBENCH
- vLLM
- A100
- Orca
- FasterTransformer
- LLaDA2.0-mini
- TIDE
- LLaDA2.0-flash
- POPE benchmark
- DeepSeek-R1-Distill-7B
- V* benchmark
- LLMEval-Logic
- LlamaWeb
- FT-Agent
- Frontier
- WebGPU
- llama.cpp
- arXiv
- rePIRL
- PALS
- Charon
- FT-Dojo