Running local LLMs on a consumer GPU
You do not need a data center GPU to run useful language models locally. A consumer graphics card with 8-24GB of VRAM can run models that are genuinely capable for coding, writing, and analysis tasks. I have been running models on my own hardware for a while now, and the landscape has improved dramatically even in the last year. Here is what I have learned.
Why run local
Three reasons keep coming up:
Privacy. Your data never leaves your machine. For proprietary codebases, internal documents, or anything you would not paste into a cloud service, local inference is the only option that does not require trusting a third party.
Cost. If you are making more than a few hundred API calls a day, local inference pays for itself quickly. Teams spending $400+/month on API calls recover hardware costs in 6-8 weeks.
Speed and availability. No rate limits, no API outages, no latency from network round trips. A local 8B model on a decent GPU generates 80-130 tokens per second, which is faster than most API responses.
VRAM is the bottleneck
The single most important spec for local LLM inference is GPU memory (VRAM). The model needs to fit in VRAM for fast inference. If it does not fit, the model gets partially offloaded to system RAM, and performance drops from 50-100 tokens/second to 2-5 tokens/second. Fitting the model in VRAM is everything.
The rough math is simple: at 4-bit quantization (the standard for consumer hardware), you need about 0.5 bytes per parameter, plus overhead for the KV cache that grows with context length. A 7B model needs about 4-5GB. A 32B model needs about 18-20GB. A 70B model needs about 40GB.
Here is what common consumer GPUs can handle:
| VRAM | GPUs | What fits (Q4_K_M) |
|---|---|---|
| 8GB | RTX 4060 Ti, RTX 3060 Ti | 7-8B models comfortably |
| 12GB | RTX 3060 12GB, RTX 4070 | 8-14B models |
| 16GB | RTX 5080, RTX 4080, Arc A770 | Up to ~20B models |
| 24GB | RTX 4090, RTX 3090, RX 7900 XTX | Up to ~32B models, the sweet spot |
| 32GB | RTX 5090 | 32B at higher quality, 70B with aggressive quantization |
Quantization: trading precision for size
Full-precision models (16-bit weights) are too large for consumer hardware. A 7B model at full precision needs 14GB of VRAM. Quantization reduces the precision of the weights, dramatically shrinking the model with a modest quality loss.
The standard format for consumer hardware is GGUF, used by llama.cpp and Ollama. GGUF quantization uses a "K-quant" system where critical weights get higher precision and less important weights get lower precision. This is why Q4_K_M beats plain Q4_0 at similar file sizes.
The quantization levels that matter:
| Format | Description | Quality loss | When to use |
|---|---|---|---|
| Q8_0 | 8-bit | Negligible | When you have VRAM to spare |
| Q6_K | 6-bit | Minimal | Premium quality at reasonable size |
| Q5_K_M | 5-bit | Near imperceptible | Good balance if it fits |
| Q4_K_M | 4-bit (medium) | ~5-10% | The default recommendation |
| Q3_K_S | 3-bit | Noticeable | Squeezing into tight VRAM |
| Q2_K | 2-bit | Significant | Only if desperate |
For most use cases, Q4_K_M is the sweet spot. You lose maybe 5-10% of the model's capability compared to full precision, which is barely noticeable in practice. If you have the VRAM for Q5_K_M or Q6_K, go for it, but the jump from Q4 to Q5 is smaller than the jump from Q3 to Q4.
Other quantization formats
GGUF is not the only option, but it is the only one that supports CPU/GPU split inference (partially loading the model into system RAM when it does not fully fit in VRAM). If your model fits entirely in VRAM:
- AWQ is the fastest for production serving, especially with the Marlin kernel in vLLM.
- EXL2 (ExLlamaV2) lets you target an exact VRAM budget and maximize quality within it.
- GPTQ is mature with a huge library of pre-quantized models, and it is the only 4-bit format with LoRA adapter support.
But for most people running models locally with Ollama, GGUF is the right choice.
Which models to run
The model landscape moves fast. Here is what is worth running as of early 2026, organized by the GPU class you are working with.
If you have 8GB VRAM
- Qwen 3.5 9B is the current best at this size class. Strong reasoning, 262K native context, multilingual.
- Llama 3.1 8B remains an excellent all-rounder.
- Phi-4-mini-reasoning (3.8B) punches way above its weight on math and reasoning tasks.
If you have 16-24GB VRAM
This is where it gets interesting. The 27-32B model class has emerged as the sweet spot for consumer hardware: these models fit on a 24GB GPU at Q4 and their quality approaches 70B models.
- Qwen 3.5 27B (~17GB at Q4) is arguably the strongest open model in this size range. 262K context, multimodal capabilities.
- Qwen 3 32B (~20GB at Q4) outperforms Qwen 2.5 72B on reasoning and coding despite being less than half the size.
- DeepSeek R1 Distilled 32B (~20GB at Q4) carries DeepSeek's reasoning capabilities into a size that fits consumer hardware.
- Gemma 3 27B fits on a 24GB card with quantization and ranked top-10 on LM Arena.
- Mistral Small 3.1 (24B) (~14GB at Q4) is a solid multimodal model with 128K context and Apache 2.0 license.
If you have 24GB+ and want the biggest models
- Llama 3.3 70B at Q4 needs about 40GB, so it requires either a 48GB GPU, two 24GB GPUs with tensor splitting, or heavy CPU offloading (64GB+ system RAM). Performance with offloading is 5-15 tokens/second, which is usable for batch tasks but not great for interactive use.
- DeepSeek R1 Distilled 70B is in the same boat. Excellent reasoning but needs serious hardware.
For coding specifically
- Codestral (22B) from Mistral is purpose-built for code. Fill-in-the-middle support, code correction, test generation. About 13-14GB at Q4.
- DeepSeek Coder V2 remains strong for programming tasks.
- Qwen 3 32B is excellent at code despite being a general-purpose model.
A note on Mixture-of-Experts models
Models like Llama 4 Scout (109B total, 17B active) and DeepSeek V3/R1 (671B total) use a Mixture-of-Experts architecture where only a fraction of parameters are active per token. This sounds great for VRAM, but there is a catch: you still need to load ALL parameters into memory. Scout's "17B active" is misleading for VRAM planning. You need room for the full 109B, which is about 61GB at Q4. These models are not practical on consumer hardware unless you use extreme quantization (1.78-bit Scout reportedly fits on a single 24GB card at ~20 tokens/second, but the quality loss is substantial).
The GPU landscape
NVIDIA
The used RTX 3090 is the consensus best-value GPU for local LLM work in 2026. 24GB of VRAM at $700-900 on the used market, nothing else comes close on a price-per-VRAM basis. Performance is still excellent: 80-110 tokens/second on 8B models at Q4.
The RTX 4090 (24GB, ~$1,500-2,200 used) is faster but the 3090 is better value unless you need the extra speed. The RTX 5090 (32GB, MSRP $1,999 but street price $2,800+) is the performance king at 213 tokens/second on 8B models, and that extra 8GB over the 4090 lets you run 32B models at higher quantization levels.
On the budget end, the RTX 5080 (16GB, ~$1,050) and RTX 4080 (16GB, $700-900 used) are fine for up to 14-20B models but the 16GB limit feels tight with the current generation of models.
AMD
AMD's ROCm software stack has matured significantly. As of 2026, it actually works for local LLM inference. The RX 7900 XTX (24GB, competitive pricing) runs Ollama, PyTorch, and vLLM without the workarounds that used to be required. The newer RX 9070 XT (16GB) is officially supported in ROCm 7.2.
The ecosystem is still smaller than CUDA's. You will occasionally hit edge cases where something does not work as smoothly. But for straightforward Ollama use, AMD GPUs are a viable option, especially if you want 24GB of VRAM at a lower price than NVIDIA.
Apple Silicon
Apple Silicon is surprisingly competitive for single-user inference. The unified memory architecture means system RAM is GPU RAM, so a Mac with 64GB of unified memory can run models that would not fit on any single consumer GPU.
The tradeoff is bandwidth. For LLM inference, memory bandwidth matters more than raw compute because token generation is memory-bandwidth-bound (reading the model weights for every token). An M4 Max (546 GB/s) generates tokens noticeably faster than an M4 Pro (273 GB/s) despite having the same chip architecture, purely because of bandwidth.
Key specs:
- M4 Pro (up to 48GB): Good for up to 32B models at Q4.
- M4 Max (up to 128GB): Can run 70B Q4 models entirely in memory. This is one of the few consumer machines that can do this without offloading.
- M4 Ultra (up to 192GB): Handles 70B at higher quantization or even larger models.
Use MLX instead of Ollama on Mac. Apple's MLX framework is optimized for the unified memory architecture and runs 20-30% faster than llama.cpp and up to 50% faster than Ollama on Apple Silicon.
The software stack
Ollama (recommended starting point)
Ollama is "Docker for LLMs." A single binary that handles model downloading, quantization, and serving with an OpenAI-compatible API. It wraps llama.cpp under the hood.
# Install and run a model
ollama pull qwen3:32b
ollama run qwen3:32b
# Or in one command
ollama run llama3.1Ollama auto-detects your hardware and handles GPU/CPU allocation. It supports macOS, Linux, and Windows. For most people getting started with local inference, Ollama is the right choice.
llama.cpp
The engine that powers most of the local LLM ecosystem. Pure C/C++, no dependencies. Ollama and LM Studio both use it under the hood. Use llama.cpp directly when you need granular control: custom context lengths, tensor splitting across multiple GPUs, specific quantization settings, or speculative decoding.
# Multi-GPU inference
./llama-cli -m model.gguf --tensor-split 50,50 -ngl 99
# Custom context length
./llama-cli -m model.gguf -c 8192vLLM (for serving)
If you need to serve a model to multiple users concurrently, vLLM is the right tool. Its PagedAttention mechanism reduces memory fragmentation by 40%+, and at 128 concurrent users it handles 793 tokens/second versus Ollama's 41 tokens/second. That is a 19x difference.
vLLM is overkill for personal use. But if you are building a product or serving a team, it is worth the setup complexity.
LM Studio
A desktop GUI for downloading, configuring, and running models. If you want to try local models without touching a command line, LM Studio is the easiest entry point. It has a headless server mode for API access too.
Performance tuning
Close other GPU applications. Your browser, video player, and display all use some VRAM. Closing unnecessary applications frees up memory for the model.
Use the right context length. Longer context means more VRAM for the KV cache. If you do not need 128K context, set a smaller window:
ollama run llama3.1 --num-ctx 4096Monitor with nvidia-smi:
watch -n 1 nvidia-smiThis shows VRAM usage, GPU utilization, and temperature in real time. If VRAM usage is near the limit, the model may be partially offloading to CPU, which tanks performance.
Speculative decoding is a newer technique where a small "draft" model predicts multiple tokens, then the full model verifies them in a single pass. When the draft model predicts correctly (which it does surprisingly often for common patterns), you get 2-3x speedup. This is built into vLLM and llama.cpp (--draft flag) but not yet exposed in Ollama.
Power and cost
Running a GPU for inference is not free. An RTX 3090 at 300W for 8 hours a day costs about $9-12/month in electricity (US average rates). An RTX 5090 at 450W is more like $14-18/month. Apple Silicon is dramatically cheaper: an M4 Max at 50W is $1.50-2/month.
Compare that to API costs. At $0.002-0.075 per 1K tokens depending on the model, teams processing more than ~2M tokens/day break even on hardware within weeks. For high-volume use cases, local inference is significantly cheaper over time. For occasional use, the API is still more economical.
Sources
Related posts
What agentic coding actually looks like
Agentic coding changed how I build software. Not in the way the hype suggests.
Hermes Agent by Nous Research: the AI agent that actually cares about security
What Hermes Agent is, how it compares to OpenClaw on security and usability, and why it earned my trust.
How I would design an ad platform for LLMs
A technical breakdown of how a middleware ad layer for LLM APIs could work, why the economics demand it, and whether it should exist at all.
Enjoying the blog? Subscribe via RSS to get new posts in your reader.
Subscribe via RSS