Setting Up Ollama with Docker: Running Local LLMs on Your VPS or Homelab

Setting Up Ollama with Docker: Running Local LLMs on Your VPS or Homelab

We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.

Running a large language model on your own hardware used to mean wrestling with Python environments, CUDA drivers, and half-broken pip packages. Ollama changed all of that — and wrapping it in Docker makes the whole setup reproducible, portable, and easy to manage alongside your other self-hosted services. In this tutorial I'll walk you through exactly how I set up Ollama in Docker, wire it up with a persistent volume, expose a clean API, and optionally front it with Open WebUI so you have a proper chat interface. Whether you're on a Hetzner VPS with a handful of vCPUs or a homelab box with a consumer GPU, the same Compose file covers you.

Hardware Reality Check

Before you pull any images, be honest about your hardware. Ollama runs models entirely in RAM (or VRAM if you have a GPU). A 7B-parameter model quantised to Q4 needs roughly 4–5 GB of RAM to load and run inference. A 13B model needs about 8–9 GB. If your VPS has only 2 GB of RAM, you'll be limited to tiny models like phi3:mini or gemma2:2b — and even those will be slow on pure CPU.

My personal recommendation for a CPU-only VPS: at least 8 GB RAM and 4 vCPUs. I run llama3.2:3b comfortably on a Hetzner CX32 (4 vCPU / 8 GB) for lightweight tasks. For anything serious — coding assistance, long context — I use a local machine with a GPU. Keep that in mind as you choose your model.

Tip: If you're on a homelab machine with an NVIDIA GPU, Ollama will automatically use it when you pass the right runtime flags to Docker. I'll cover the GPU section below — but the CPU-only path works on any Linux VPS without any driver fuss.

Docker Compose Setup (CPU-Only)

I prefer Docker Compose over bare docker run commands because everything is version-controlled and reproducible. Create a project directory and drop in the following compose.yml:

mkdir -p ~/ollama && cd ~/ollama
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "127.0.0.1:11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    # CPU only — remove the 'deploy' block entirely for VPS usage
    # Uncomment below for NVIDIA GPU homelab machines:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "127.0.0.1:3000:8080"
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama_data:
  open_webui_data:

A few things I want to call out about this Compose file. I bind Ollama's port to 127.0.0.1:11434 rather than 0.0.0.0:11434. This is intentional — you do not want Ollama's API exposed to the public internet without authentication, because the API has no built-in auth. The same applies to Open WebUI on port 3000; we'll expose that through a reverse proxy with HTTPS in a moment. The OLLAMA_HOST=0.0.0.0 environment variable tells Ollama to listen on all interfaces inside the container so Open WebUI can reach it over the Docker network — that's different from the host-level port binding.

Bring the stack up:

docker compose up -d
docker compose logs -f ollama

The first start takes 10–20 seconds as Ollama initialises. Once you see Listening on [::]:11434 in the logs, you're ready to pull a model.

Pulling Your First Model

Ollama's model library lives at ollama.com/library. I usually start with llama3.2:3b for lightweight tasks or mistral:7b-instruct-q4_K_M when I want better reasoning quality. Pull models by execing into the container:

# Pull a small, fast model — good for CPU-only VPS
docker exec -it ollama ollama pull llama3.2:3b

# Pull a quantised 7B model for better quality
docker exec -it ollama ollama pull mistral:7b-instruct-q4_K_M

# List what you've pulled
docker exec -it ollama ollama list

The model files land inside the ollama_data named volume, so they survive container restarts and image upgrades. That's the key reason to use a named volume rather than relying on the container's writable layer — docker compose pull && docker compose up -d won't wipe your downloaded models.

Once a model is downloaded, test inference from the host:

curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.2:3b", "prompt": "What is the capital of France?", "stream": false}'

You should get a JSON response within a few seconds. On my Hetzner CX32 this takes about 4–6 seconds for a short prompt with llama3.2:3b. That's completely acceptable for personal use.

Enabling GPU Acceleration on a Homelab Machine

If you have an NVIDIA GPU on a homelab server, the payoff is enormous — inference that takes 5 seconds on CPU takes under 0.3 seconds on a decent GPU. You need the NVIDIA Container Toolkit installed on the host first:

# On Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Then uncomment the deploy block in the Compose file above and run docker compose up -d again. Verify GPU access with:

docker exec -it ollama nvidia-smi
Watch out: AMD GPU support in Ollama's Docker image is available via the ollama/ollama:rocm tag, but it requires the ROCm runtime on the host and is significantly more complex to configure. Unless you're comfortable with ROCm, I'd recommend starting with CPU mode and validating the full stack before tackling GPU drivers.

Exposing Open WebUI Through Caddy

I prefer Caddy for reverse proxying because it handles Let's Encrypt certificates automatically and the config is dead simple. If you already have Caddy running on your VPS, add a block to your Caddyfile:

ai.yourdomain.com {
    reverse_proxy localhost:3000
}

That's literally it. Caddy fetches the TLS certificate, handles renewal, and proxies traffic to Open WebUI. Reload with sudo systemctl reload caddy and your chat interface is live at https://ai.yourdomain.com. On first load Open WebUI asks you to create an admin account — do that immediately before anyone else reaches the URL.

If you want to expose the Ollama API itself (for integrations like Continue.dev in VS Code or n8n workflows), I strongly recommend putting it behind basic auth or restricting it to Tailscale only. Never expose port 11434 directly to the internet.

Keeping Models and the Stack Updated

Ollama releases new model versions regularly and the Docker image itself gets updates. I handle image updates with Watchtower, but I update models manually so I can test them before switching my default. A simple cron job or a manual habit works fine:

# Update the Ollama image and restart the stack
docker compose pull
docker compose up -d

# Update a specific model
docker exec -it ollama ollama pull llama3.2:3b

# Remove old model versions you no longer use
docker exec -it ollama ollama rm mistral:7b-instruct-q4_K_M

Model files can be large — a quantised 7B model is 4–5 GB. Keep an eye on your volume disk usage with docker system df -v and prune old models you're not using. On a budget VPS with a 40 GB disk, this matters.

Connecting External Tools to Your Ollama API

One of the best things about Ollama's API is that it's OpenAI-compatible. Tools that support a custom OpenAI base URL — like Continue.dev, Fabric, or n8n's AI nodes — can point straight at your Ollama instance. If you're on a VPS, expose the API through Caddy with basic auth and use https://ai.yourdomain.com/api as your base URL. If you're on a homelab, use Tailscale and connect directly to http://homelab-hostname:11434 over the private network without exposing anything publicly.

Tip: Open WebUI supports multiple Ollama backends. If you have both a VPS and a local GPU machine, you can add both as backends in Open WebUI's settings and route different models to different machines — lightweight queries to the VPS, heavy inference to the local GPU.

Wrapping Up

This Docker Compose setup gives you a production-ready Ollama deployment in under 15 minutes. The named volumes protect your model files across upgrades, the localhost port bindings keep the API off the public internet by default, and Open WebUI gives you a polished chat interface without any extra configuration. From here, I'd suggest exploring two things: adding Tailscale to your stack so you can access the API from anywhere without exposing it publicly, and trying a coding-specific model like deepseek-coder-v2:16b if your hardware can handle it — it's genuinely useful as a local coding assistant that never phones home.

Discussion

```