Setting Up Ollama on a VPS: Run Local LLMs in the Cloud
We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.
Running Ollama on your local homelab is great — until you need access from a laptop, a phone, or a teammate across the country. Putting Ollama on a VPS gives you a persistent, always-on inference endpoint that you fully control, with zero per-token costs and no data leaving your infrastructure. In this tutorial I'll walk you through everything: picking the right VPS, installing Ollama, locking down the API, and fronting it with Caddy so you get HTTPS out of the box.
Choosing the Right VPS for Ollama
Model inference is CPU and RAM heavy. A tiny $4/month box will technically run Ollama, but you'll be staring at the spinner for thirty seconds per response. My recommendations, in order of preference:
- Minimum for 7B models (Q4): 4 vCPUs, 8 GB RAM — expect 3–6 tokens/second.
- Comfortable for 13B models (Q4): 8 vCPUs, 16 GB RAM — expect 2–4 tokens/second.
- GPU droplet: Any NVIDIA GPU with 8+ GB VRAM will dwarf CPU inference speed.
I've been using DigitalOcean for this workload and I'm happy with it. Their General Purpose Droplets give you dedicated vCPUs, which matters a lot for inference — shared-CPU instances throttle hard under sustained load. Create your DigitalOcean account today and you'll get started with solid infrastructure that won't let you down mid-generation.
For the OS, I always reach for Ubuntu 24.04 LTS. It's well-supported by Ollama's install script and has a long support window.
llama3.2:3b-instruct-q4_K_M is a surprisingly capable setup. The 3B quantized model loads in under five seconds and responds at a comfortable pace for personal use.Initial VPS Setup
Before touching Ollama, harden the server. These steps are non-negotiable — Ollama's API has no authentication by default, and an open port 11434 on the public internet is a gift to anyone who wants free inference on your bill.
# Update packages
sudo apt update && sudo apt upgrade -y
# Create a non-root user (replace 'deploy' with your preferred username)
sudo adduser deploy
sudo usermod -aG sudo deploy
# Copy your SSH key to the new user (run this from your LOCAL machine)
ssh-copy-id deploy@YOUR_VPS_IP
# Back on the VPS: lock down SSH
sudo nano /etc/ssh/sshd_config
# Set these values:
# PermitRootLogin no
# PasswordAuthentication no
# PubkeyAuthentication yes
sudo systemctl restart ssh
# Configure UFW firewall
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp # SSH
sudo ufw allow 80/tcp # HTTP (for Let's Encrypt challenge)
sudo ufw allow 443/tcp # HTTPS
sudo ufw enable
# Do NOT open port 11434 — Ollama will only listen on localhost
127.0.0.1 and gate all access through your reverse proxy.Installing Ollama
Ollama provides a one-liner installer that handles the systemd service, binary placement, and GPU detection automatically. On a VPS without a GPU it still works perfectly — it just runs on CPU.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Verify the service started
sudo systemctl status ollama
# By default Ollama listens on 0.0.0.0:11434 — let's fix that immediately
sudo systemctl edit ollama
# This opens a drop-in override file. Add these lines:
# [Service]
# Environment="OLLAMA_HOST=127.0.0.1:11434"
# Save and restart
sudo systemctl daemon-reload
sudo systemctl restart ollama
# Confirm it's only listening on localhost
ss -tlnp | grep 11434
# Expected: 127.0.0.1:11434
# Pull a model — llama3.2 is a great starting point
ollama pull llama3.2
# Quick smoke test
ollama run llama3.2 "Explain what a VPS is in one sentence."
The systemd override approach is important. Without it, every time Ollama updates it will reset to its default binding. The drop-in file survives package upgrades.
Installing Caddy as a Reverse Proxy
I prefer Caddy over Nginx for this use case because automatic HTTPS with Let's Encrypt requires zero additional configuration — you just give it a domain and it handles the certificate. Install it from the official Caddy repository to get the latest stable build:
# Add Caddy's official repository
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' \
| sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
| sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update && sudo apt install caddy -y
# Edit the Caddyfile
sudo nano /etc/caddy/Caddyfile
Replace the contents of /etc/caddy/Caddyfile with the following. Substitute ollama.yourdomain.com with your actual subdomain, and make sure it's already pointing at your VPS via an A record:
ollama.yourdomain.com {
# Basic auth — creates a username/password gate in front of the API
basicauth * {
# Generate a hashed password with: caddy hash-password
youruser $2a$14$REPLACE_WITH_YOUR_CADDY_HASHED_PASSWORD
}
reverse_proxy 127.0.0.1:11434 {
header_up Host {host}
}
# Optional: restrict to specific IPs for extra security
# @blocked not remote_ip 203.0.113.0/24
# respond @blocked "Forbidden" 403
log {
output file /var/log/caddy/ollama-access.log
}
}
# Reload Caddy after saving
sudo systemctl reload caddy
# Check for config errors before reloading
sudo caddy validate --config /etc/caddy/Caddyfile
To generate the hashed password for the basicauth block, run caddy hash-password on the server and paste the output in. Now your Ollama API is reachable at https://ollama.yourdomain.com with HTTPS and a password gate — a huge upgrade from a raw open port.
Pulling Models and Testing the Remote API
With everything running, test the authenticated API from your local machine using curl:
# List available models via the API
curl -u youruser:yourpassword \
https://ollama.yourdomain.com/api/tags | python3 -m json.tool
# Send a chat completion request
curl -u youruser:yourpassword \
-X POST https://ollama.yourdomain.com/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "What is the capital of France?",
"stream": false
}' | python3 -m json.tool
# Pull a new model remotely (triggers download on the VPS)
curl -u youruser:yourpassword \
-X POST https://ollama.yourdomain.com/api/pull \
-H "Content-Type: application/json" \
-d '{"name": "mistral"}'
You can also point Open WebUI at this endpoint. In Open WebUI's settings, set the Ollama base URL to https://ollama.yourdomain.com and supply the basic auth credentials — it handles the auth header transparently.
Managing Disk Space
This is the gotcha that bites everyone. Ollama stores models in /usr/share/ollama/.ollama/models by default. A single 7B Q4 model is around 4–5 GB. A 13B model is 7–8 GB. On a 25 GB root disk you'll run out fast.
I mount a separate DigitalOcean Block Storage volume at /mnt/ollama-models and symlink the models directory:
# After attaching and formatting your volume (adjust device name as needed)
sudo mkfs.ext4 /dev/sda
sudo mkdir -p /mnt/ollama-models
sudo mount /dev/sda /mnt/ollama-models
# Add to /etc/fstab for persistence
echo '/dev/sda /mnt/ollama-models ext4 defaults 0 2' | sudo tee -a /etc/fstab
# Move existing model data and symlink
sudo systemctl stop ollama
sudo mv /usr/share/ollama/.ollama/models /mnt/ollama-models/
sudo ln -s /mnt/ollama-models/models /usr/share/ollama/.ollama/models
sudo chown -R ollama:ollama /mnt/ollama-models/
sudo systemctl start ollama
Keeping Ollama Updated
Ollama doesn't auto-update through apt. The easiest approach is to re-run the install script when a new version drops — it detects the existing installation and upgrades in place without losing your models:
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl restart ollama
ollama --version
I check for new releases weekly. The Ollama GitHub releases page is the canonical source, and new model support often ships alongside runtime updates — worth staying current.
Wrapping Up
You now have a hardened, HTTPS-protected Ollama endpoint running in the cloud with proper firewall rules, a reverse proxy, and a dedicated model storage volume. The total monthly cost for a capable 8 GB Droplet plus a 50 GB block volume lands well under $25 — less than most cloud AI API budgets for any serious usage.
Next steps: consider adding Open WebUI in front of your endpoint for a polished chat interface, or explore fine-tuning workflows to customise models for your specific use case. If you want to run this whole stack on DigitalOcean, their Droplets offer dependable uptime with a 99.99% SLA and predictable monthly pricing — exactly what you want for a persistent inference server.
Discussion