Setting Up Ollama with Docker for Local LLM Inference on a VPS

CompactHost · June 13, 2026

We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.

Running large language models on your own hardware used to mean juggling Python virtual environments, CUDA drivers, and a pile of pip packages that inevitably broke each other. Ollama changed that, and running it inside Docker on a VPS makes the setup even more repeatable and portable. In this tutorial I'll walk through exactly how I containerise Ollama, keep model weights on a persistent volume, and expose the API safely — whether your VPS has a GPU or not.

Why Docker Rather Than a Bare Ollama Install?

I've done it both ways. A bare install on Ubuntu is perfectly fine on a machine you own and rarely reprovision, but on a VPS I reach for Docker because rollbacks are trivial, the NVIDIA Container Toolkit handles GPU passthrough cleanly, and I can version-pin the image so a surprise upstream update doesn't change model behaviour mid-project. Containerising Ollama also makes it straightforward to add a Caddy sidecar for TLS later without touching the Ollama process itself.

The one trade-off: the Docker image for Ollama is large — expect around 1.5 GB for the base image before you pull any models. Make sure your VPS has enough disk. I typically provision at least 40 GB on the root volume and mount a second block device for model storage.

VPS and Hardware Requirements

For CPU-only inference with smaller models (Llama 3.2 3B, Gemma 3 4B, Phi-4 mini) you can get away with a VPS that has 4 vCPUs and 8 GB RAM — something like a Hetzner CX32 or a RackNerd 8 GB KVM node works fine. Inference will be slow (expect 3–8 tokens/second), but it's completely usable for personal tools and automation scripts.

If you want faster throughput, look for a VPS with a shared NVIDIA GPU — Hetzner's GX2 line or OVH's T1 instances are reasonable starting points in 2026. The GPU path in this guide uses the NVIDIA Container Toolkit; the CPU-only path skips those steps entirely.

Step 1 — Install Docker and the NVIDIA Container Toolkit

I'll assume you're on Ubuntu 24.04 LTS. First, get Docker from the official repo rather than the snap package — the snap version has had volume permission headaches in the past.

# Remove any old Docker packages
sudo apt-get remove -y docker docker-engine docker.io containerd runc

# Install prerequisites
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg lsb-release

# Add Docker's GPG key and repository
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
  | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" \
  | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin

# Allow your user to run Docker without sudo
sudo usermod -aG docker $USER
newgrp docker

# --- GPU ONLY: install NVIDIA Container Toolkit ---
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Tip: Skip the NVIDIA Container Toolkit section entirely if your VPS is CPU-only. The Docker Compose file further down has a commented-out GPU section — just leave those lines commented and everything will work over CPU.

Step 2 — Create the Docker Compose File

I keep all my Ollama config under /opt/ollama. This is the layout I use — a single compose.yaml and a named volume for models so they survive container rebuilds:

sudo mkdir -p /opt/ollama
sudo chown $USER:$USER /opt/ollama
cd /opt/ollama
cat > compose.yaml <<'EOF'
services:
  ollama:
    image: ollama/ollama:0.6.7          # pin a specific tag in production
    container_name: ollama
    restart: unless-stopped
    ports:
      - "127.0.0.1:11434:11434"         # bind to localhost only — do NOT expose to 0.0.0.0
    volumes:
      - ollama_models:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0             # listen on all interfaces inside the container
      - OLLAMA_KEEP_ALIVE=10m           # unload model from RAM after 10 min of inactivity
      - OLLAMA_NUM_PARALLEL=2           # allow 2 concurrent inference requests
    # --- Uncomment the block below for GPU passthrough ---
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]

volumes:
  ollama_models:
    driver: local
EOF

Watch out: Notice the port binding is 127.0.0.1:11434:11434, not 0.0.0.0:11434:11434. Ollama's API has no built-in authentication. If you bind to all interfaces on a public VPS you will be sharing your inference server — and your model storage — with the entire internet. Always proxy through Caddy or Nginx with authentication before exposing externally.

Step 3 — Start the Container and Pull a Model

With the compose file in place, bringing Ollama up is a single command:

# Start Ollama in detached mode
docker compose up -d

# Watch the logs to confirm it started cleanly
docker compose logs -f ollama

# Pull a model — Gemma 3 4B is a good starting point for CPU-only VPS
docker exec -it ollama ollama pull gemma3:4b

# Alternatively, pull Llama 3.2 3B for something even lighter
# docker exec -it ollama ollama pull llama3.2:3b

# Test inference from the command line
docker exec -it ollama ollama run gemma3:4b "Explain Docker volumes in one paragraph."

# Test the REST API from the host
curl http://localhost:11434/api/generate \
  -d '{"model":"gemma3:4b","prompt":"What is a VPS?","stream":false}'

When I first set this up on a Hetzner CX32 (4 vCPU, 8 GB RAM), Gemma 3 4B ran at about 6 tokens per second — slow but perfectly fine for a personal AI assistant or API backend. The model files landed in the ollama_models Docker volume at around 2.7 GB on disk.

Step 4 — Managing Models and Disk Space

Models pile up fast if you're experimenting. A few commands that save me regularly:

# List all downloaded models and their sizes
docker exec ollama ollama list

# Remove a model you no longer need
docker exec ollama ollama rm phi4:mini

# Check how much space the volume is consuming
docker system df -v | grep ollama_models

# To inspect exactly where the volume lives on the host filesystem:
docker volume inspect ollama_models | grep Mountpoint

I set OLLAMA_KEEP_ALIVE=10m in the compose file because leaving a 4B model loaded permanently consumes ~3.5 GB of RAM. On a shared VPS that headroom matters. If you're on a machine with 16 GB+ RAM and you want instant responses, you can set it to -1 to keep the model loaded indefinitely.

Step 5 — Exposing the API Securely with Caddy

If you want to call Ollama from outside the VPS — say, from your laptop or a home automation script — I strongly prefer Caddy over raw port exposure. Here's a minimal Caddyfile that puts basic auth in front of the API and handles TLS automatically:

# Install Caddy on the host (not in Docker, to keep routing simple)
sudo apt-get install -y debian-keyring debian-archive-keyring apt-transport-https
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' \
  | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
  | sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt-get update && sudo apt-get install caddy

# Generate a bcrypt password hash for basic auth
caddy hash-password --plaintext 'your-strong-password-here'
# Copy the output hash into the Caddyfile below

sudo tee /etc/caddy/Caddyfile > /dev/null <<'EOF'
llm.yourdomain.com {
    basicauth /* {
        yourusername $2a$14$HASH_OUTPUT_FROM_CADDY_HASH_PASSWORD_GOES_HERE
    }
    reverse_proxy localhost:11434
    encode gzip
    log {
        output file /var/log/caddy/ollama-access.log
    }
}
EOF

sudo systemctl reload caddy

Point your domain's A record at the VPS IP, and Caddy will automatically obtain a Let's Encrypt certificate. After that your Ollama API is reachable at https://llm.yourdomain.com/api/generate with HTTP basic auth protecting it.

Keeping the Container Updated

I don't use Watchtower for Ollama because auto-updating an LLM runtime mid-project can cause subtle behaviour changes. Instead I pin the image tag (as shown in the compose file) and update deliberately:

# Check the current image tag
docker inspect ollama --format '{{.Config.Image}}'

# Pull the new image, stop the old container, and recreate
cd /opt/ollama
sed -i 's/ollama\/ollama:0.6.7/ollama\/ollama:0.7.0/' compose.yaml
docker compose pull
docker compose up -d --remove-orphans

# Models are safe — they live in the named volume, not the image layer

Conclusion

Running Ollama in Docker on a VPS is genuinely straightforward once you understand the two non-obvious defaults: bind the port to localhost, and plan your disk layout before you start pulling 7B models. The compose file above gets you a solid base — persistent storage, safe API binding, configurable memory management, and a clear path to GPU passthrough when you're ready to upgrade hardware.

From here I'd recommend adding Open WebUI as a second service in the same compose.yaml for a ChatGPT-style interface, or wiring the Ollama API into n8n for workflow automation. Both sit neatly alongside the setup you've just built without any changes to the Ollama container itself.

Setting Up Ollama with Docker for Local LLM Inference on a VPS

Why Docker Rather Than a Bare Ollama Install?

VPS and Hardware Requirements

Step 1 — Install Docker and the NVIDIA Container Toolkit

Step 2 — Create the Docker Compose File

Step 3 — Start the Container and Pull a Model

Step 4 — Managing Models and Disk Space

Step 5 — Exposing the API Securely with Caddy

Keeping the Container Updated

Conclusion

Discussion