Setting Up Ollama with Docker for Local LLM Inference on a VPS
We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.
Running a large language model on a remote VPS — completely under your control, with no per-token billing — is genuinely one of the most satisfying things I've done in my homelab this year. Ollama makes it surprisingly approachable, and wrapping it in Docker keeps the installation clean and reproducible. In this tutorial I'll walk you through a production-ready Docker Compose setup for Ollama on a VPS, including Open WebUI for a ChatGPT-style interface and Caddy to put it all behind HTTPS.
I'll be specific about hardware requirements, networking gotchas, and the exact flags that tripped me up the first time. By the end you'll have a fully functional, password-protected AI inference server you can reach from anywhere.
Choosing a VPS for Ollama
The honest answer is: GPU VPS instances give you the best inference speed, but you can absolutely run 7B and 13B parameter models on CPU-only hardware if you're patient. I run llama3:8b on a Hetzner CCX33 (8 vCPU, 32 GB RAM) for development use cases where a few seconds of latency per response is acceptable. For anything interactive, I'd recommend either a GPU-enabled instance or sticking to quantised 7B models.
If you want GPU acceleration, look at providers like Hetzner (their GX2 line), OVH, or RunPod for spot-style GPU rentals. The NVIDIA CUDA toolkit needs to be installed on the host and Docker's NVIDIA container runtime must be present — I'll show you how to handle both. For CPU-only deployments, this guide works out of the box on any VPS with at least 16 GB RAM.
Preparing the Host
Start with a fresh Ubuntu 24.04 LTS VPS. Install Docker using the official convenience script, then add your user to the docker group:
# Install Docker Engine
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker
# Install Docker Compose plugin (bundled with modern Docker)
docker compose version
# If you have an NVIDIA GPU, install the container runtime
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
nvidia-smi on the host before touching Docker. If that command fails, the container won't see the GPU either — sort out the host driver first with sudo apt install -y nvidia-driver-535 and a reboot.The Docker Compose Stack
I prefer keeping Ollama and Open WebUI in the same Compose file so they share a private network and I only expose one port to Caddy. Create a working directory and drop this file in:
mkdir -p ~/ollama && cd ~/ollama
cat > compose.yml <<'EOF'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
# Remove or comment out the deploy block if you have no GPU
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- ollama_data:/root/.ollama
networks:
- ai
# Only bind to the internal network — do NOT expose port 11434 publicly
expose:
- "11434"
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
depends_on:
- ollama
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_SECRET_KEY=change_this_to_a_long_random_string
- ENABLE_SIGNUP=false
volumes:
- webui_data:/app/backend/data
networks:
- ai
expose:
- "8080"
caddy:
image: caddy:2-alpine
container_name: caddy
restart: unless-stopped
ports:
- "80:80"
- "443:443"
- "443:443/udp"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile:ro
- caddy_data:/data
- caddy_config:/config
networks:
- ai
depends_on:
- open-webui
volumes:
ollama_data:
webui_data:
caddy_data:
caddy_config:
networks:
ai:
driver: bridge
EOF
Now write the Caddyfile. Replace ai.yourdomain.com with your actual subdomain — point its DNS A record at your VPS IP first:
cat > Caddyfile <<'EOF'
ai.yourdomain.com {
reverse_proxy open-webui:8080
encode gzip
header {
Strict-Transport-Security "max-age=31536000; includeSubDomains"
X-Frame-Options DENY
X-Content-Type-Options nosniff
}
}
EOF
I prefer Caddy here because automatic HTTPS is completely hands-off — it provisions and renews Let's Encrypt certificates without any cron jobs or certbot configuration. Traefik works too, but Caddy's configuration is far easier to read and reason about when something goes wrong at 2 am.
Pulling Your First Model
Start the stack, then use docker exec to pull a model into the Ollama container:
# Bring the stack up
docker compose up -d
# Watch logs to confirm everything started cleanly
docker compose logs -f --tail=50
# Pull a model — llama3.2:3b is a good starting point for CPU-only VPS
docker exec -it ollama ollama pull llama3.2:3b
# For a smarter model on a GPU instance:
docker exec -it ollama ollama pull llama3:8b
# List what you've pulled
docker exec ollama ollama list
Models are stored in the ollama_data named volume, so they survive container restarts and updates. A 3B model weighs around 2 GB; an 8B quantised to Q4 is roughly 4.7 GB. Make sure your VPS has enough disk — I allocate at least 40 GB for a comfortable working set of a few models.
expose (internal only) rather than ports for the Ollama service. Keep it that way.Creating Your Admin Account
With ENABLE_SIGNUP=false in the environment, the first user to register becomes the admin and then signups are locked. Navigate to https://ai.yourdomain.com, create your account, and you'll land in the Open WebUI chat interface. Select one of your pulled models from the dropdown and you're talking to your own private LLM.
Keeping Everything Updated
I use Watchtower in monitoring-only mode to get notified when new images are available, then update manually on a schedule:
# Pull latest images and recreate containers if images changed
docker compose pull
docker compose up -d --remove-orphans
# Prune old image layers to reclaim disk space
docker image prune -f
Run this monthly from a cron job or just manually when you want the latest Open WebUI features. Ollama itself releases frequently and model support often improves with each point release.
UFW Firewall Rules
Before going live, lock down the host firewall so only SSH, HTTP, and HTTPS are reachable:
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp comment 'SSH'
sudo ufw allow 80/tcp comment 'HTTP (Caddy redirect)'
sudo ufw allow 443/tcp comment 'HTTPS'
sudo ufw allow 443/udp comment 'HTTP/3 QUIC'
sudo ufw enable
sudo ufw status verbose
Port 11434 is never opened externally — it stays on the internal Docker bridge network where only containers in the ai network can reach it.
Performance Expectations
On a CPU-only CCX33 Hetzner instance, llama3.2:3b generates around 8–12 tokens per second, which is usable for coding assistance and summarisation tasks but feels sluggish for rapid back-and-forth chat. Bumping to a GPU instance with an NVIDIA A10G pushes that to 80–120 tokens per second on an 8B model. If you're on a budget, the 3B model on a beefy CPU VPS is a solid starting point — and you can always migrate the ollama_data volume to a GPU machine later without re-downloading models.
Next Steps
With Ollama running behind Caddy and Open WebUI, you have a solid foundation. From here, I'd recommend looking at adding Authelia in front of Open WebUI if you want proper SSO and multi-factor authentication instead of relying solely on Open WebUI's built-in user system. You might also explore the Ollama API directly — it's OpenAI-compatible, which means tools like aider, Continue.dev, and LiteLLM can point at your VPS instance with a simple base URL change. That's where self-hosted inference really starts to pay off.
Discussion