Setting Up Ollama on a VPS: Running Local LLMs in the Cloud
We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.
Running Ollama on a VPS is one of those setups that sounds counterintuitive — isn't "local LLM" the point? — but it makes perfect sense once you think about it. I want private, always-on AI inference I can hit from any device without spinning up my home machine, and a VPS is exactly that. The catch is doing it securely: Ollama's default configuration binds to 0.0.0.0:11434 with zero authentication, which is a disaster on a public server.
In this tutorial I'll walk through provisioning a suitable VPS, installing Ollama, pulling a model, and then locking the whole thing down behind Caddy with HTTPS and basic auth. By the end you'll have a private LLM endpoint you can query from curl, Open WebUI, or any OpenAI-compatible client.
Choosing the Right VPS
LLM inference is CPU and RAM heavy, not necessarily GPU heavy — at least for smaller quantized models. For running llama3.2:3b or mistral:7b-instruct-q4_K_M you need at least 8 GB RAM; 16 GB is more comfortable. I've been running Mistral 7B Q4 on a DigitalOcean CPU-Optimized Droplet with 8 vCPUs and 16 GB RAM and it handles one or two concurrent users without breaking a sweat. Response times are slower than GPU inference, but perfectly usable for personal or small-team workloads.
For a pure CPU setup, DigitalOcean's Droplets offer predictable monthly pricing and a dead-simple control panel. The CPU-Optimized 8 vCPU / 16 GB tier is a reasonable starting point. If you want to experiment without commitment, create your DigitalOcean account today
— new accounts get free credit to play with.
If you want GPU inference, DigitalOcean's GPU Droplets (H100 slices) work fine with Ollama's CUDA backend. Hetzner also offers GPU instances in some regions. For budget-conscious testing, even a 4 GB RAM VPS will run qwen2.5:1.5b or gemma3:1b adequately.
Initial Server Setup
I always start every new VPS with the same hardening pass. SSH in as root, create a non-root user, disable password auth, and bring the firewall up. Here's my full bootstrap sequence:
# Create a non-root user and add to sudo group
adduser deploy
usermod -aG sudo deploy
# Copy your SSH key to the new user
rsync --archive --chown=deploy:deploy ~/.ssh /home/deploy
# Harden SSH — edit /etc/ssh/sshd_config
sed -i 's/^#\?PermitRootLogin.*/PermitRootLogin no/' /etc/ssh/sshd_config
sed -i 's/^#\?PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
systemctl restart sshd
# Set up UFW
ufw default deny incoming
ufw default allow outgoing
ufw allow 22/tcp # SSH
ufw allow 80/tcp # HTTP (needed for Let's Encrypt challenge)
ufw allow 443/tcp # HTTPS
ufw enable
# Update packages
apt update && apt upgrade -y
Notice I'm not opening port 11434 in UFW. Ollama will only be reachable through the Caddy reverse proxy on 443. That's the whole security model here.
Installing Ollama
Ollama ships a one-liner installer that works perfectly on Ubuntu 22.04 and 24.04:
# Install Ollama (run as your non-root sudo user)
curl -fsSL https://ollama.com/install.sh | sh
# By default Ollama listens on all interfaces — lock it to localhost only
# Create a systemd override
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo systemctl enable ollama
# Verify it's running and bound correctly
ss -tlnp | grep 11434
# Should show: 127.0.0.1:11434 — NOT 0.0.0.0:11434
# Pull your first model
ollama pull mistral:7b-instruct-q4_K_M
# Quick smoke test
ollama run mistral:7b-instruct-q4_K_M "Say hello in one sentence."
The OLLAMA_HOST=127.0.0.1:11434 environment variable is the critical piece. Without it, anyone who discovers your server's IP can talk directly to your LLM — including submitting expensive requests or extracting system prompt context from whatever you've configured.
systemd service automatically, but it defaults to binding on all interfaces. Always set the OLLAMA_HOST override before pulling models or doing anything else. I've seen people skip this step, pull a big model, and then find their inference endpoint indexed by Shodan within hours.Installing Caddy and Configuring the Reverse Proxy
I prefer Caddy over Nginx or Traefik for this use case because its automatic HTTPS is genuinely zero-config — you give it a domain name and it handles Let's Encrypt certificates, renewals, HTTP-to-HTTPS redirects, and sane TLS defaults without any extra tooling. For a single-service setup like this it's perfect.
# Install Caddy from the official repo
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' \
| sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
| sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update && sudo apt install caddy -y
# Generate a bcrypt password hash for basic auth
# Replace "yourpassword" with something strong
caddy hash-password --plaintext "yourpassword"
# Copy the $2a$... hash output — you'll paste it into the Caddyfile
# Edit the Caddyfile
sudo tee /etc/caddy/Caddyfile <<'EOF'
llm.yourdomain.com {
basicauth /* {
# Username: ollama
ollama $2a$14$REPLACE_WITH_YOUR_HASH_OUTPUT_HERE
}
reverse_proxy 127.0.0.1:11434 {
header_up Host {upstream_hostport}
}
log {
output file /var/log/caddy/ollama-access.log
format json
}
}
EOF
sudo mkdir -p /var/log/caddy
sudo chown caddy:caddy /var/log/caddy
sudo systemctl reload caddy
sudo systemctl enable caddy
Once Caddy is running, hit https://llm.yourdomain.com/api/tags from your browser or curl — you should get a JSON list of your installed models after entering your credentials. Caddy will have already fetched and installed the Let's Encrypt certificate; no certbot, no cron jobs, no manual renewal.
openai-python), set the base URL to https://llm.yourdomain.com and add an Authorization: Basic <base64(user:pass)> header. The Ollama API is OpenAI-compatible for chat completions at /api/chat and /v1/chat/completions.Managing Models and Storage
Models live in /usr/share/ollama/.ollama/models by default when installed via the system service. A 7B Q4 model is around 4.1 GB; a 13B Q4 is roughly 7.4 GB. On a VPS you're paying for disk, so be selective. I keep two or three models loaded and use ollama rm to prune ones I haven't touched in a while:
# List downloaded models and their sizes
ollama list
# Pull additional models
ollama pull llama3.2:3b # ~2 GB, fast on CPU
ollama pull qwen2.5-coder:7b # good for code tasks
# Remove a model you're not using
ollama rm mistral:7b-instruct-q4_K_M
# Check disk usage
du -sh /usr/share/ollama/.ollama/models/
# Move model storage to a larger volume if needed
# (Stop ollama first, then update OLLAMA_MODELS env var in the override)
sudo systemctl stop ollama
sudo tee -a /etc/systemd/system/ollama.service.d/override.conf <<EOF
Environment="OLLAMA_MODELS=/mnt/data/ollama/models"
EOF
sudo systemctl daemon-reload
sudo systemctl start ollama
Keeping Ollama Updated
Ollama releases frequently — model support, performance improvements, and new API features land every few weeks. Because it's installed via the shell script rather than apt, updates aren't automatic. I run this monthly:
# Re-running the install script upgrades Ollama in place
curl -fsSL https://ollama.com/install.sh | sh
# Verify the version
ollama --version
# Restart the service after upgrade
sudo systemctl restart ollama
Performance Expectations on CPU-Only VPS
Honest numbers from my DigitalOcean CPU-Optimized 8 vCPU / 16 GB droplet: Mistral 7B Q4_K_M generates roughly 8–12 tokens per second for single-user inference. That's about a sentence every two seconds — slow compared to GPU inference but completely fine for chat, code review, or summarization tasks where you're reading as it streams. The qwen2.5:1.5b model hits 35–45 tokens/second on the same hardware, which feels snappy.
If you need faster responses, look at GPU Droplets or consider Hetzner's cloud GPU lineup. But for a private, always-on assistant that costs $50–80/month all-in, CPU inference is a solid trade-off.
Next Steps
With Ollama running and secured, the natural next move is adding Open WebUI on the same server so you have a ChatGPT-style interface — I've covered that setup in detail in the Ollama + Open WebUI Docker Compose tutorial. You might also want to add fail2ban rules targeting the Caddy access log to rate-limit brute-force attempts against your basic auth endpoint — check out the fail2ban and CrowdSec hardening guide for that.
If you don't yet have a VPS to run this on, create a DigitalOcean account and spin up a Droplet — the whole setup from a fresh Ubuntu 24.04 image takes about 20 minutes.
Discussion