Setting Up Ollama on a VPS: Running Local LLMs in the Cloud
We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.
Running Ollama on a VPS gives you the best of both worlds: the privacy and control of a self-hosted LLM with the uptime and accessibility of a cloud server. I've been doing this on a Hetzner CPX31 for several months, and it's become one of the most useful things in my infrastructure. Whether you want a personal AI API endpoint, a shared assistant for a small team, or just want to stop paying OpenAI, this guide walks you through the whole setup from a bare Ubuntu 24.04 VPS to a Caddy-proxied, auth-protected Ollama instance serving models like Llama 3.1 and Mistral.
Choosing the Right VPS for Ollama
CPU-only LLM inference is slower than GPU inference, but it's perfectly usable for personal or low-traffic workloads. The main constraint is RAM — models need enough memory to load fully, or they'll swap to disk and crawl. Here's what I've found works:
- 7B models (Mistral 7B, Llama 3.1 8B): ~6–8 GB RAM minimum. A Hetzner CPX31 (4 vCPU, 8 GB RAM) handles these fine.
- 13B models: You really want 16 GB RAM. A Hetzner CPX41 or a RackNerd 16 GB KVM plan works well here.
- 34B+ models: Unless you're on a GPU VPS (Hetzner CAX series with Ampere ARM chips are surprisingly capable), stick to quantized versions.
I use Hetzner for Ollama specifically because their CPX line has fast NVMe storage, which matters when Ollama loads model weights from disk on first inference. If you're budget-constrained, RackNerd's annual deals often land a 8 GB RAM KVM VPS for under $40/year, which is enough for 7B models.
ollama pull llama3.1:8b-instruct-q4_K_M to grab a lean version if RAM is tight.Installing Ollama on Ubuntu 24.04
SSH into your VPS and run the official installer. I always review install scripts before piping to bash, but the Ollama one is clean:
# Update system first
sudo apt update && sudo apt upgrade -y
# Install Ollama (official installer)
curl -fsSL https://ollama.com/install.sh | sh
# Verify the service is running
sudo systemctl status ollama
# Pull your first model (Mistral 7B is a great starting point)
ollama pull mistral
# Test it works
ollama run mistral "Explain what a VPS is in one sentence."
The installer sets up a systemd service that runs Ollama on 127.0.0.1:11434 by default. That binding to localhost is intentional — you do not want port 11434 open to the internet without authentication. We'll fix the exposure properly with Caddy in the next section.
After the pull completes, check that the API responds locally:
# Quick API test (runs on the VPS itself)
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "What is 2 + 2?",
"stream": false
}'
You should get a JSON response with a "response" field. If that works, the hard part is done — the rest is just proxying it securely.
Installing Caddy and Proxying Ollama
I prefer Caddy over Nginx for setups like this because automatic HTTPS is zero-config and the Caddyfile syntax is dramatically cleaner. Install it from the official apt repository:
# Add Caddy's official apt repo
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' \
| sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
| sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update && sudo apt install caddy -y
# Confirm Caddy is running
sudo systemctl status caddy
Now configure the Caddyfile. I'm using HTTP basic auth here to keep it simple — if you want SSO, look at my Authentik tutorial instead. Replace ai.yourdomain.com with your actual subdomain and make sure it has an A record pointing to your VPS IP.
# Generate a hashed password for basic auth
# Replace 'yourpassword' with something strong
caddy hash-password --plaintext 'yourpassword'
# Copy the $2a$... output for use in the Caddyfile
# Edit the Caddyfile
sudo nano /etc/caddy/Caddyfile
# /etc/caddy/Caddyfile
ai.yourdomain.com {
# Basic auth — paste the hash from caddy hash-password
basicauth {
youruser $2a$14$PASTE_YOUR_HASH_HERE
}
# Proxy to Ollama's local port
reverse_proxy localhost:11434 {
# Increase timeouts for long inference requests
transport http {
response_header_timeout 300s
read_timeout 300s
}
}
# Optional: restrict to your Tailscale IP or specific CIDR
# @blocked not remote_ip 100.64.0.0/10
# respond @blocked 403
}
# Reload Caddy to apply the config
sudo systemctl reload caddy
# Check for config errors first
caddy validate --config /etc/caddy/Caddyfile
Locking Down the Firewall
By default you want only ports 22 (SSH), 80 (HTTP for ACME), and 443 (HTTPS) open. Port 11434 should never be exposed directly:
sudo ufw allow OpenSSH
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
# Never: sudo ufw allow 11434 ← don't do this
sudo ufw enable
sudo ufw status
Caddy handles the Let's Encrypt certificate automatically on first request to your domain, so as long as your DNS is pointed correctly you'll have HTTPS with no extra steps.
Pulling and Managing Models
Once the proxy is up, you can interact with Ollama from any machine using standard HTTP with your credentials. But on the VPS itself, model management is straightforward:
# See what you have installed
ollama list
# Pull additional models
ollama pull llama3.1:8b # Meta's Llama 3.1 8B
ollama pull qwen2.5:7b # Alibaba's Qwen 2.5 — surprisingly good
ollama pull nomic-embed-text # Embedding model for RAG pipelines
ollama pull codellama:7b # Code-focused model
# Check how much disk space models are using
du -sh ~/.ollama/models/
# Remove a model you no longer need
ollama rm mistral
Model files live in /usr/share/ollama/.ollama/models/ when installed via the system installer (as opposed to the user home directory when running as your own user). Check with sudo systemctl cat ollama to see which user it's running as — this tells you where the model files actually live.
Connecting Open WebUI for a Chat Interface
If you want a ChatGPT-style web interface rather than just a raw API, Open WebUI pairs perfectly with Ollama. I run it as a Docker container alongside Ollama on the same VPS:
# Install Docker if you haven't already
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
# Run Open WebUI connected to the local Ollama instance
docker run -d \
--name open-webui \
--restart unless-stopped \
-p 127.0.0.1:3000:8080 \
-e OLLAMA_BASE_URL=http://host-gateway:11434 \
--add-host=host-gateway:host-gateway \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:main
Then add a second block to your Caddyfile for the WebUI on a different subdomain, like chat.yourdomain.com, proxying to localhost:3000. Open WebUI has its own user authentication built in, so you don't necessarily need the basic auth layer there — though I still put it behind Tailscale for good measure.
Performance Expectations and Tips
On a Hetzner CPX31 (4 AMD vCPU cores), Mistral 7B Q4 generates roughly 8–12 tokens per second. That's slow compared to a GPU but totally usable for non-interactive tasks like summarization, classification, or generating draft text. For interactive chat it's a little sluggish but workable. A few things that help:
- Set
OLLAMA_NUM_PARALLEL=1in the systemd environment to avoid splitting CPU resources across multiple concurrent requests. - Use streaming responses when querying the API — it makes responses feel faster even if total time is the same.
- Consider Hetzner's ARM-based CAX21 (4 Ampere cores, 8 GB RAM) — Ollama runs on ARM64 natively and I've seen 15–20% better throughput on Ampere compared to equivalent AMD cores for transformer inference.
OLLAMA_KEEP_ALIVE=30m to /etc/systemd/system/ollama.service.d/override.conf (create the directory if needed) and reload with sudo systemctl daemon-reload && sudo systemctl restart ollama. This eliminates the cold-start load time for your most-used model.Conclusion
You now have a private, HTTPS-secured Ollama instance running on a VPS, accessible from anywhere without exposing port 11434 to the open internet. The total monthly cost on Hetzner is around €8–12 depending on the plan, which is competitive with even light OpenAI API usage once you factor in privacy and the fact that your data never leaves your server.
My recommended next steps: set up Watchtower to keep the Open WebUI container updated automatically (docker run -d --name watchtower -v /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --interval 86400), and look into the nomic-embed-text model if you want to build a local RAG pipeline — it's a small embedding model that pairs well with a vector DB like Chroma or Qdrant for document search over your own files.
Discussion