Setting Up Ollama with Docker: Running Local LLMs on Your VPS or Homelab

Setting Up Ollama with Docker: Running Local LLMs on Your VPS or Homelab

We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.

Running large language models locally has gone from a niche experiment to a genuine daily-driver workflow for a lot of self-hosters — and Ollama is the reason. It strips away most of the friction: no Python environment hell, no manual CUDA setup, just pull a model and start talking to it. Wrapping Ollama in Docker takes that one step further by giving you a reproducible, restartable, network-isolated service you can actually depend on. In this tutorial I'll walk through a production-grade Docker Compose setup that covers CPU-only servers, Nvidia GPU passthrough, persistent model storage, and a full Open WebUI front end — whether you're running this on a DigitalOcean Droplet, a Hetzner VPS, or an old workstation in your garage.

Why Docker Instead of a Bare Install?

I've done both. The bare-metal Ollama install takes about 90 seconds and does work, but the moment you start running other services on the same machine you end up with port conflicts, no automatic restart on reboot, and no clean way to pin a specific Ollama version for reproducibility. Docker fixes all of that. You also get a named volume for model storage, which means your 30 GB of downloaded models survive a container rebuild — something that will save you a lot of frustration the first time you accidentally run docker compose down -v.

I prefer Caddy as a reverse proxy in front of everything, but for this tutorial I'll keep the Compose file self-contained so you can drop it into whatever proxy setup you already have. If you're starting fresh, a DigitalOcean Droplet with 4 vCPUs and 8 GB RAM handles Llama 3.2 3B and Mistral 7B at Q4 quantization without breaking a sweat. For anything 13B or larger, I'd look at a GPU-equipped bare-metal machine or a Hetzner Dedicated server.

Prerequisites

Tip: If you haven't installed Docker yet, run curl -fsSL https://get.docker.com | sh and then sudo usermod -aG docker $USER followed by a logout/login. This gets you the latest Engine and Compose plugin in one shot without the outdated distro packages.

Installing the Nvidia Container Toolkit (GPU Only)

Skip this section entirely if you're on a CPU-only server. For GPU hosts, this is the step most tutorials gloss over and where things break. After confirming nvidia-smi shows your card, run:

# Add the NVIDIA Container Toolkit repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Configure the Docker daemon to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify the toolkit can see your GPU inside a container
docker run --rm --gpus all nvidia/cuda:12.3.1-base-ubuntu22.04 nvidia-smi

If that last command prints your GPU model and driver version, you're good. If it errors with "could not select device driver", you likely have a driver mismatch — check that your host driver version is ≥ 525 with nvidia-smi outside Docker first.

The Docker Compose File

Here's my complete setup. I've included both a CPU-only service definition and a GPU override, with a comment showing which lines to add or remove. Open WebUI runs on port 3000 and talks to Ollama over the internal Docker network — Ollama itself is never exposed to the public internet in this config.

# docker-compose.yml
# Place this in /opt/ollama/ and run: docker compose up -d

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "127.0.0.1:11434:11434"   # Bind only to loopback — never expose this publicly
    environment:
      - OLLAMA_KEEP_ALIVE=24h      # Keep loaded models in VRAM/RAM for 24 hours
      - OLLAMA_MAX_LOADED_MODELS=2
    # --- GPU SUPPORT: uncomment the next four lines for Nvidia GPU passthrough ---
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    depends_on:
      - ollama
    ports:
      - "127.0.0.1:3000:8080"     # Proxy this with Caddy or Nginx — don't expose raw
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_SECRET_KEY=change-this-to-a-random-string-before-deploying

volumes:
  ollama_data:
  open_webui_data:

A few things worth calling out. I bind Ollama to 127.0.0.1:11434 rather than 0.0.0.0:11434 — Ollama has no built-in authentication, so binding it to all interfaces on a VPS would expose your inference API to the entire internet. Open WebUI is also bound to loopback, and you should put it behind a reverse proxy with authentication before opening it to any public URL. The OLLAMA_KEEP_ALIVE=24h env var is something I added after getting frustrated with Llama 3 being unloaded from RAM between every request — set it to suit your available memory.

Watch out: The default WEBUI_SECRET_KEY placeholder in this Compose file must be replaced with a genuine random string before you deploy. Open WebUI uses it to sign session tokens. Run openssl rand -hex 32 to generate one and paste it into your .env file or directly into the Compose file before the first docker compose up.

Starting the Stack and Pulling Your First Model

Bring everything up with:

cd /opt/ollama
docker compose up -d

# Watch the logs to confirm both containers started cleanly
docker compose logs -f

# Pull a model — this downloads into the named volume, not the container layer
docker exec -it ollama ollama pull llama3.2:3b

# Or pull a larger quantized model if you have the RAM
docker exec -it ollama ollama pull mistral:7b-instruct-q4_K_M

# Test inference directly from the CLI before touching the web UI
docker exec -it ollama ollama run llama3.2:3b "Explain Docker volumes in one sentence."

# List all downloaded models
docker exec -it ollama ollama list

The llama3.2:3b model is about 2 GB and runs fine on a 4 GB RAM VPS. mistral:7b-instruct-q4_K_M sits around 4.1 GB and needs at least 8 GB of system RAM to run without thrashing swap. Once the model is pulled, Open WebUI at http://localhost:3000 (or whatever domain you proxy it to) will automatically discover it — no manual configuration needed because of the OLLAMA_BASE_URL env var we set.

Choosing the Right Hardware (and VPS)

For CPU-only inference, a VPS with 8 GB RAM and 4 cores gets you 7B models at a usable speed — expect roughly 8–12 tokens per second on modern server hardware with Q4 quantization. For 13B models, 16 GB RAM is the practical minimum. I've had good results running this stack on DigitalOcean Droplets — their General Purpose Droplets have fast NVMe storage which matters a lot for the initial model load time.

If you want GPU acceleration without buying hardware, DigitalOcean's GPU Droplets (H100 and A100 options) work perfectly with this Compose file once you uncomment the GPU block — and the Nvidia Container Toolkit is pre-installed on their GPU images. For homelab use, a second-hand RTX 3090 with 24 GB VRAM is probably the best bang-for-buck option right now for running 13B and 34B models locally.

DigitalOcean

Keeping Ollama Updated with Watchtower

Ollama ships new model support and performance improvements frequently. I add Watchtower to the same Compose file so both ollama/ollama and Open WebUI stay current without manual intervention:

# Add this service to your docker-compose.yml under services:
  watchtower:
    image: containrrr/watchtower:latest
    container_name: watchtower
    restart: unless-stopped
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - WATCHTOWER_CLEANUP=true          # Remove old images after update
      - WATCHTOWER_POLL_INTERVAL=86400   # Check daily (seconds)
      - WATCHTOWER_INCLUDE_STOPPED=false
    command: ollama open-webui           # Only watch these two containers

Passing the container names to the Watchtower command limits updates to just the AI stack — it won't touch your other services. The WATCHTOWER_CLEANUP=true flag is important; Ollama images are around 2 GB each and old versions add up fast on a small VPS disk.

Conclusion

At this point you have a fully containerized Ollama stack with persistent model storage, an optional GPU path, a polished web UI, and automated updates — all bound safely to localhost and ready to put behind a reverse proxy. The next logical steps are adding Caddy or Nginx Proxy Manager in front of Open WebUI with HTTPS and either HTTP basic auth or Authelia for single sign-on, and setting up a nightly backup of the open_webui_data volume so you don't lose your chat history and model configurations. If you want to go deeper on the inference side, experiment with the OLLAMA_NUM_PARALLEL environment variable to handle multiple simultaneous requests — useful if you're sharing the instance with a few people on your network.

Discussion