Setting Up Ollama Locally: Running LLMs on Your Homelab Hardware

Setting Up Ollama Locally: Running LLMs on Your Homelab Hardware

We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.

I've been running Ollama on my homelab for six months now, and I honestly can't imagine going back to relying on OpenAI or Claude for every inference. The moment I realized I could run mistral or llama2 locally—without paying per token—everything changed. This guide walks you through exactly what I learned: hardware requirements, installation on bare metal and Docker, model selection, and real gotchas that aren't in the official docs.

Why Ollama? Why Not Just Use ChatGPT?

The practical answer: cost and privacy. If you're running a homelab, you probably have spare CPU or GPU sitting around. A $40/year VPS from RackNerd (around the price they offer for entry-level boxes) costs nothing compared to paying $20/month for Claude Pro or burning API tokens. More importantly, when you run Ollama locally, your prompts never leave your network. No data collection, no terms-of-service questions about training data, no rate limits kicking in at midnight.

I use Ollama for document summarization, code review, creative writing, and local chatbot interfaces. For anything where I need privacy-first inference, Ollama is the answer.

Hardware Reality Check

Let's be honest: you don't need a monster rig, but you can't run Llama 2 70B on a Raspberry Pi 4. I've tested this on three different machines, and here's what actually works:

CPU-only (Intel/AMD x86-64): Mistral 7B runs fine on 8 cores and 16GB RAM. Expect 4–8 tokens per second. Not fast, but perfectly usable for background tasks. This is what I run on my old Dell Optiplex.

GPU-accelerated (NVIDIA): Even a used GTX 1070 (8GB VRAM) cuts inference time to 50–100 tokens/second. If you have an RTX 3060, 4090, or A100, you're golden. Ollama handles NVIDIA CUDA automatically if the drivers are installed.

Apple Silicon (M1/M2/M3): Metal acceleration is fast. I tested on an M2 Mac Mini, and Mistral 7B hits 15–20 tokens/second. ARM architecture is supported natively.

Tip: Start with a smaller model like Mistral 7B or Neural Chat 7B. They're fast enough for most tasks and fit on 8GB of RAM. You can always pull down a bigger model later.

Installation: Bare Metal Ubuntu

I prefer installing Ollama directly on a dedicated Linux box rather than containerizing it, mainly because GPU passthrough adds complexity. Here's the straightforward approach:

# Download the install script
curl -fsSL https://ollama.ai/install.sh | sh

# Verify installation
ollama --version

# Start the Ollama daemon (runs on port 11434)
ollama serve

# In another terminal, pull your first model
ollama pull mistral

# Run an interactive chat session
ollama run mistral

The ollama serve command starts the API server. By default, it listens on 127.0.0.1:11434—localhost only. If you want to expose it to other machines on your LAN, bind it differently:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

For systemd integration, Ollama installs a service automatically. To check logs or restart:

sudo systemctl status ollama
sudo systemctl restart ollama
journalctl -u ollama -n 50 --no-pager

Docker Compose Recipe

I also run Ollama in Docker for easier reproducibility and isolation from the host. Here's my production-grade docker-compose.yml:

version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    environment:
      # Optional: set model cache directory
      - OLLAMA_MODELS=/models
    volumes:
      # Persist model cache across restarts
      - ollama_models:/models
      # Optional: mount local model files
      - ./models:/models/custom
    restart: unless-stopped
    # For GPU support (NVIDIA only), uncomment:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: 1
    #           capabilities: [gpu]

  # Optional: Open WebUI for a ChatGPT-like interface
  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    ports:
      - "8080:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - webui_data:/app/backend/data
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_models:
  webui_data:

Deploy with:

docker compose up -d
docker compose logs -f ollama

# Pull a model from inside the container
docker exec ollama ollama pull mistral
docker exec ollama ollama pull neural-chat

Models download into the ollama_models Docker volume, so they persist between container restarts. The first pull takes time (depending on model size and your bandwidth), but subsequent restarts are instant.

Watch out: If you enable GPU passthrough in Docker, make sure the nvidia-container-toolkit is installed on the host. Without it, the container falls back to CPU inference and runs slowly. Verify with docker run --rm --gpus all nvidia/cuda:12.0-runtime-ubuntu22.04 nvidia-smi.

Choosing Models: Size vs. Speed vs. Quality

The Ollama library is massive. I've tested dozens of models, and here's what I actually use:

Mistral 7B: My daily driver. Fast, surprisingly smart, lightweight. Good balance. ~4GB on disk.

Neural Chat 7B: Fine-tuned for conversation. Feels more natural than raw Mistral. ~4.1GB.

Llama 2 13B: When I need more capability. Slower than 7B, but noticeably better at reasoning. ~7.4GB.

Orca Mini 3B: Super lightweight, surprisingly usable. If you have only 4GB RAM, this works. ~1.9GB.

Dolphin Mixtral 8x7B: Beast of a model. Excellent at coding tasks. ~46GB. Requires GPU or you'll wait forever.

Pull any model with ollama pull modelname. List what's installed:

ollama list

Delete old models to free space:

ollama rm llama2

Exposing Ollama Safely to Your Network

Running Ollama on 0.0.0.0:11434 works, but it's unauth'd. Anyone on your LAN can send inference requests (and drain resources). I use Caddy as a reverse proxy with a simple password:

ollama.home:11434 {
  reverse_proxy localhost:11434
  basicauth / {
    user $2a$14$...hashedpassword...
  }
}

Generate a bcrypt hash with:

caddy hash-password -plaintext yourpassword

Now requests to http://ollama.home:11434 require HTTP Basic Auth. Open WebUI connects through this same proxy using environment variables.

Integration: Open WebUI + Ollama

The Docker Compose recipe above includes Open WebUI, which gives you a ChatGPT-like interface pointing to your local Ollama instance. After running docker compose up -d, visit http://localhost:8080. It auto-discovers your running Ollama instance.

I prefer Open WebUI to the raw API because:

For programmatic access, the Ollama API is dead simple. Generate and stream text:

curl http://localhost:11434/api/generate \
  -d '{"model":"mistral","prompt":"Why is the sky blue?","stream":false}'

Or keep streaming enabled to get tokens as they're generated (useful for real-time UIs).

Resource Monitoring

Ollama can eat CPU or GPU quickly. Monitor it with:

# CPU and memory usage
top -p $(pgrep -f "ollama serve")

# GPU usage (NVIDIA)
nvidia-smi
watch -n 1 nvidia-smi

# System temperature
sensors

I set up Prometheus to scrape Ollama metrics and graph them in Grafana. Ollama exposes /metrics on port 11434, but parsing is a bit custom—worth it if you're running multiple models in production.

Performance Tuning

A few tweaks I've found helpful:

Set context window: By default, Ollama uses a 2048-token context. For document summarization, I increase it:

ollama run mistral --num-ctx 8192

Adjust thread count: On CPU-only systems, tune parallelism:

OLLAMA_NUM_THREAD=12 ollama serve

Model quantization: All Ollama models are quantized (usually 4-bit or 5-bit). You can't change quantization level easily, so choose a model variant that fits your hardware.

Next Steps: Where to Go From Here

Now that you have Ollama running, integrate it into your homelab:

The beauty of Ollama is that it's just an API. Anything that can make HTTP requests can use it. I've hooked it into Nextcloud for automated file tagging, into my monitoring stack for anomaly detection, and into custom Python scripts for batch processing. Once you own the inference layer, the possibilities multiply fast.

Discussion