Setting Up Ollama Locally: Running Open-Source LLMs on Your Homelab
We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.
I've been running Ollama on my homelab for eight months now, and it's fundamentally changed how I think about AI workflows. Instead of sending every prompt to OpenAI or Claude, I can run Mistral, Llama 2, and Qwen models directly on my hardware—no API costs, no data leaving my network, no waiting for rate limits. If you've got a spare laptop, a workstation with a decent GPU, or even a small NAS with some CPU headroom, you can do this too.
This guide walks you through installing Ollama, pulling your first model, and integrating it with Open WebUI so you have a ChatGPT-like interface running entirely on your machines. I'll cover hardware requirements, performance tuning, and how to handle models that are genuinely useful for real work.
What Is Ollama and Why Run It Locally?
Ollama is a command-line tool (with a nice macOS and Windows app) that downloads, installs, and runs large language models. It abstracts away most of the complexity of model quantization, VRAM management, and inference optimization. You tell it "run mistral" and it handles the rest—downloading a 4GB quantized model, loading it into memory, and serving it on localhost:11434.
The appeal for homelab folks is obvious: you own the hardware, own the model weights, and own your data. No subscription, no rate limits, no OpenAI knowing what you're building. A $40/year VPS from RackNerd gives you a public endpoint if you need to share inference with other machines on your network. Most of my local inference stays on my 2021 MacBook Pro or my Ryzen 5000 workstation, but having a cheap public Linux box ready as a fallback has been handy more than once.
Hardware: What You Actually Need
The dirty secret is that most open-source models run fine on CPU. Slow, but fine. A Ryzen 5 5600X or a modern Intel i5 can push tokens at maybe 5–15 tokens/second on quantized models. That's not fast enough for real-time chat, but it's perfectly good for batch processing, code generation, or offline summarization.
GPU is where the magic happens. I strongly prefer NVIDIA cards because CUDA support is mature. An RTX 3060 (12GB VRAM) costs around $200–250 used and runs a 13B model comfortably. AMD is coming along with ROCm support. Apple Silicon (M1/M2/M3 Macs) gets native Metal acceleration and is honestly excellent—better utilization than you'd expect.
For a homelab, I recommend at least 16GB system RAM and 8GB of dedicated VRAM if you want to run two models concurrently or work with anything larger than 7B parameters. My setup runs a Mistral 7B and a Llama 2 13B simultaneously on a machine with 32GB RAM and an RTX 3080.
Installation on Linux
Installing Ollama on a Linux server is refreshingly straightforward. The official build supports Ubuntu, Debian, CentOS, and others. Here's the one-liner from the Ollama website, followed by verification:
curl -fsSL https://ollama.ai/install.sh | sh
That script downloads the ollama binary, creates a systemd service, and sets up a dedicated user account. After it finishes, start the service:
sudo systemctl start ollama
sudo systemctl enable ollama
systemctl status ollama
Ollama listens on `localhost:11434` by default. If you want to expose it to your network (useful for Docker services or remote machines), edit the systemd unit:
sudo systemctl edit ollama
Add this to the service section:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Save (Ctrl+X in nano), then restart:
sudo systemctl restart ollama
Test it with a simple HTTP request from another machine:
curl http://your-server-ip:11434/api/tags
That returns JSON listing any models you've pulled. If you see `{"models":[]}`, Ollama is running and ready.
Pulling and Running Your First Model
Ollama's model library lives at ollama.ai/library. Popular choices:
- mistral — Fast, 7B, great instruction-following
- llama2 — Solid baseline, available in 7B and 13B
- neural-chat — Optimized for conversation
- qwen — Excellent multilingual support, 7B and 14B
- phi — Tiny (2.7B) but surprisingly capable
Pull a model with:
ollama pull mistral
The first pull downloads the model (might be 4GB–13GB depending on quantization). Subsequent pulls are instant since it's cached locally.
Run an interactive chat session:
ollama run mistral
You'll get a prompt. Type your question and press Enter. The model will stream tokens to your terminal. Press Ctrl+D to exit.
For programmatic use, the API is simple:
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Why is self-hosting cool?",
"stream": false
}'
That returns JSON with the completion. Set `"stream": true` to get newline-delimited JSON for real-time token streaming (useful for web UIs).
Open WebUI: Your Ollama Dashboard
Running models from the CLI is fun but impractical for daily use. Open WebUI is a self-hosted ChatGPT-like interface that connects to Ollama. I deploy it with Docker Compose:
version: '3.8'
services:
open-webui:
image: ghcr.io/open-webui/open-webui:latest
container_name: open-webui
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- open-webui-data:/app/backend/data
depends_on:
- ollama
restart: unless-stopped
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
- ollama-data:/root/.ollama
restart: unless-stopped
# Uncomment for GPU (NVIDIA CUDA)
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: 1
# capabilities: [gpu]
volumes:
ollama-data:
open-webui-data:
Save that as `docker-compose.yml` and run:
docker-compose up -d
Open WebUI appears on port 3000. Create an account, select a model from the dropdown, and start chatting. It's genuinely impressive—syntax highlighting, code execution, file uploads, conversation memory, all running on your hardware.
If Ollama isn't running in a container and you're running it directly on your host, change the `OLLAMA_BASE_URL` to `http://host.docker.internal:11434` (on Docker Desktop) or the actual IP of your machine.
GPU Acceleration: Making It Fast
By default, Ollama uses CPU for inference. Adding GPU support is model-dependent but usually works out of the box once drivers are installed.
NVIDIA (CUDA): Install nvidia-driver and nvidia-cuda-toolkit, then restart Ollama. The systemd service detects CUDA automatically.
AMD (ROCm): Ollama supports ROCm on Linux. Install the rocm-core package and set:
export ROCM_HOME=/opt/rocm
sudo systemctl restart ollama
Apple Silicon: Metal acceleration is built-in. Just run Ollama and it uses the GPU by default.
You can force CPU-only if needed:
export OLLAMA_NUM_GPU=0
ollama run mistral
Monitor VRAM usage with nvidia-smi while inference runs. You'll see dramatic speed improvements—from 5 tokens/sec on CPU to 50–150 tokens/sec on modern GPUs.
Quantization and Model Selection
Ollama pulls quantized models by default (usually 4-bit or 5-bit). These are smaller, faster, and lose minimal quality compared to full-precision weights. You don't need to understand the math—just know that `mistral:latest` is already optimized.
If you want unquantized models or different quantization levels, you can specify them:
ollama pull mistral:7b-instruct-v0.2-q5_K_M
ollama pull mistral:7b-instruct-v0.2-q8_0
Larger quantization (q8_0) is slower but more accurate. Smaller (q2_K, q3_K) are faster but degrade quality noticeably. I stick with q5 or q4 for most work.
Networking and Scaling
If you have multiple machines in your homelab, you can run Ollama on a central "inference server" and access it from anywhere on your network. A Ryzen workstation or an old laptop running 24/7 is perfect for this.
Point Open WebUI (or any client) to the remote Ollama instance by setting the API URL in settings. Or, if you're using the Docker Compose setup, deploy Open WebUI on a different machine and point it to your Ollama server's IP.
For internet-facing inference, pair Ollama with a reverse proxy (Caddy or Nginx) and add authentication. A basic example with Caddy:
inference.example.com {
reverse_proxy localhost:11434
basicauth /api/* {
user password-hash-here
}
}
Or route it through a VPN (Tailscale is my preference) so only your devices can access it.
Memory and Swap Management
Running large models consumes RAM. If you don't have enough, enable swap (though it's slow). Check current usage:
free -h
Add 16GB of swap on a Linux server:
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
Swap lets you run 70B models on modest hardware, but it's 10–100x slower than VRAM. Use it as a fallback, not a primary strategy.
Practical Tips from Six Months of Homelab Inference
Start small. Mistral 7B is my daily driver—fast enough for real-time use, smart enough for complex reasoning. I only pull larger models when I need them.
Use Ollama's context window wisely. Most models support 2K–8K tokens. Longer contexts are slower and less accurate. For document analysis, keep prompts tight.
Monitor disk space. Even quantized models add up. My `/root/.ollama` directory is 120GB with eight models. Prune unused ones occasionally:
ollama list # See all installed models
ollama rm model-name # Delete
Integrate with your workflow. I have a shell alias that pipes text to Ollama for summarization:
alias ask='curl -s http://localhost:11434/api/generate -d "{\"model\":\"mistral\",\"prompt\":\"$(cat)\",\"stream\":false}" | jq .response'
# Usage: echo "Summar