Setting Up Ollama on a VPS: Running Local LLMs in the Cloud

CompactHost · June 22, 2026

We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.

Running Ollama on a VPS is one of the most practical ways to get a persistent, always-on LLM endpoint without tying up your home machine or paying per-token to OpenAI. I've been doing this for several months now, and the combination of a mid-range Hetzner CPX31 instance, Ollama, and Caddy as a reverse proxy gives me a private AI API I can hit from anywhere. This tutorial walks you through the full setup: installing Ollama, pulling a model, locking down the API behind authentication, and putting Caddy in front so you get HTTPS for free.

Choosing the Right VPS for CPU-Only LLM Inference

Let's be honest up front: without a GPU, you're not going to be running 70B parameter models at interactive speeds. But for 7B and 8B models like Llama 3.1 8B, Mistral 7B, or Gemma 2 9B, a beefy CPU VPS is genuinely usable. I run a Hetzner CPX31 (4 vCPU, 8 GB RAM) and get roughly 8–12 tokens per second with Llama 3.1 8B — fast enough for scripted workflows and tolerable for light interactive use.

My current ranking for this workload: Hetzner CPX41 (8 vCPU, 16 GB RAM) is the sweet spot at around €18/month. RackNerd's higher-tier KVM plans work too, though their AMD EPYC boxes vary in single-thread performance. Contabo's VPS M (8 vCPU, 24 GB RAM) is good value if you want to run two models in memory simultaneously. Whatever you pick, aim for at least 8 GB RAM for a 7B model and 16 GB if you want to load a 13B or quantized 32B.

Tip: Ollama loads the model into RAM and keeps it resident between requests. If your VPS has fast NVMe storage but limited RAM, you'll get slow cold-start times every time the model is evicted. Size RAM first, storage second.

Installing Ollama on Ubuntu 24.04

I'll assume you're starting from a fresh Ubuntu 24.04 LTS VPS. First, update the system and then run Ollama's official install script. After that, I'll show you how to configure it as a proper systemd service so it survives reboots and restarts on failure.

# Update and install dependencies
sudo apt update && sudo apt upgrade -y
sudo apt install -y curl ca-certificates ufw

# Install Ollama (official one-liner)
curl -fsSL https://ollama.com/install.sh | sh

# Verify it installed correctly
ollama --version

# The installer creates a systemd service. Check its status:
sudo systemctl status ollama

# If it's not running, start and enable it:
sudo systemctl enable --now ollama

By default, Ollama listens on 127.0.0.1:11434 — that's intentional and good. It means the API is only reachable from localhost unless you explicitly change that. We'll keep it that way and proxy through Caddy rather than exposing port 11434 to the internet directly.

Now pull a model. I recommend starting with Llama 3.1 8B (quantized to Q4_K_M by default) since it's the best balance of quality and RAM usage at around 5 GB loaded:

# Pull Llama 3.1 8B (quantized, ~4.7 GB download)
ollama pull llama3.1:8b

# Or for something lighter on RAM, try Mistral 7B
ollama pull mistral:7b

# Test that inference works from the command line
ollama run llama3.1:8b "Explain ZFS in one sentence."

# List downloaded models
ollama list

The first run will take a moment to load the model into RAM. Once it's loaded, subsequent requests are much faster. If you want to pre-warm the model on boot, you can add a simple ExecStartPost to the systemd unit, but I find it's fine to let it load on first request.

Locking Down the Firewall

Before we expose anything publicly, get UFW in order. Port 11434 should never be open to the internet directly — the Ollama API has no built-in authentication.

# Allow SSH (do this first or you'll lock yourself out)
sudo ufw allow 22/tcp

# Allow HTTP and HTTPS for Caddy
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp

# Enable the firewall
sudo ufw enable

# Confirm port 11434 is NOT listed
sudo ufw status verbose

# Double-check Ollama really is only on localhost
ss -tlnp | grep 11434
# Should show: 127.0.0.1:11434

Watch out: If you see Ollama listening on 0.0.0.0:11434 instead of 127.0.0.1:11434, check /etc/systemd/system/ollama.service for an OLLAMA_HOST environment variable. The installer should set it to localhost, but some versions or manual configurations can override this. Fix it before continuing.

Putting Caddy in Front with HTTPS and Basic Auth

I prefer Caddy over Nginx Proxy Manager for VPS setups because there's no GUI to maintain, certificate renewal is completely automatic, and the Caddyfile syntax is refreshingly simple. Install Caddy from the official repo:

# Add Caddy's official APT repo
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' \
  | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
  | sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update && sudo apt install -y caddy

# Verify
caddy version

Now create your Caddyfile. Replace ai.yourdomain.com with your actual subdomain (make sure the DNS A record points to your VPS IP already). I use Caddy's built-in basicauth directive to add a username/password layer — it's not enterprise SSO, but it's enough to keep random crawlers out of your LLM endpoint.

# Generate a bcrypt hash for your password
caddy hash-password --plaintext 'your-strong-password-here'
# Copy the $2a$... hash output — you'll need it below

# Edit the Caddyfile
sudo nano /etc/caddy/Caddyfile

# /etc/caddy/Caddyfile

ai.yourdomain.com {
    basicauth * {
        # Format: username  bcrypt-hashed-password
        ollama $2a$14$REPLACE_THIS_WITH_YOUR_ACTUAL_HASH
    }

    reverse_proxy 127.0.0.1:11434 {
        header_up Host {upstream_hostport}
    }

    # Optional: limit to specific paths if you only want the API
    # route /api/* {
    #     reverse_proxy 127.0.0.1:11434
    # }

    log {
        output file /var/log/caddy/ollama-access.log
    }
}

# Create the log directory and reload Caddy
sudo mkdir -p /var/log/caddy
sudo systemctl reload caddy

# Watch the Caddy log to confirm certificate issuance
sudo journalctl -u caddy -f

Within about 30 seconds, Caddy will have obtained a Let's Encrypt certificate and your Ollama API will be available at https://ai.yourdomain.com. Test it from your local machine:

# Test the API endpoint from your laptop
curl -u ollama:your-strong-password-here \
  https://ai.yourdomain.com/api/generate \
  -d '{
    "model": "llama3.1:8b",
    "prompt": "What is self-hosting?",
    "stream": false
  }'

Using Your VPS Ollama Endpoint with Open WebUI

Once the API is secured behind HTTPS and basic auth, you can point any OpenAI-compatible client at it. Open WebUI is my go-to for a chat interface — you can run it locally on your laptop with Docker and have it talk to your VPS Ollama instance rather than running WebUI on the VPS itself (which saves RAM for the model).

# Run Open WebUI locally, pointing to your remote Ollama VPS
docker run -d \
  --name open-webui \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=https://ai.yourdomain.com \
  -e OLLAMA_BASE_URL_AUTH="ollama:your-strong-password-here" \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in your browser. On first launch it will ask you to create an admin account — that's local to Open WebUI and separate from the Caddy basic auth. Once inside, your models pulled on the VPS will show up in the model dropdown automatically.

Keeping Models and Ollama Updated

Ollama releases fairly frequently. The install script puts a binary at /usr/local/bin/ollama, so updating is a simple re-run:

# Update Ollama to latest version
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl restart ollama

# Update a specific model to its latest version
ollama pull llama3.1:8b

# Remove a model you no longer need to free disk space
ollama rm mistral:7b

# See how much disk the model store is using
du -sh ~/.ollama/models/
# Models are stored at /usr/share/ollama/.ollama/models/ when running as the ollama system user

Performance Tuning: Threads and Parallel Requests

By default, Ollama uses all available CPU threads. On a VPS you're sharing cores, so it's worth setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS explicitly to avoid out-of-memory situations. Edit the systemd service override:

sudo systemctl edit ollama

# Add inside the [Service] block:
[Service]
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_KEEP_ALIVE=10m"

sudo systemctl daemon-reload && sudo systemctl restart ollama

OLLAMA_KEEP_ALIVE=10m tells Ollama to evict the model from RAM after 10 minutes of inactivity — useful if you're on a RAM-constrained VPS and running other services alongside Ollama. Set it to -1 if you want the model permanently resident.

Wrapping Up

At this point you have a fully operational, HTTPS-secured Ollama endpoint running on your VPS with automatic certificate renewal, basic auth protecting the API, and a model ready to serve inference requests. The whole setup costs less than a couple of dollars a day on a Hetzner CPX31, with zero per-token costs regardless of how much you query it.

Two obvious next steps from here: first, explore adding Open WebUI directly on the VPS behind its own Caddy reverse proxy entry if you want a persistent web UI accessible from anywhere. Second, look into pairing this with a Cloudflare Tunnel if you want to skip the open firewall ports entirely — that tutorial is already on CompactHost if you search for it. Enjoy your private cloud AI endpoint.

Setting Up Ollama on a VPS: Running Local LLMs in the Cloud

Choosing the Right VPS for CPU-Only LLM Inference

Installing Ollama on Ubuntu 24.04

Locking Down the Firewall

Putting Caddy in Front with HTTPS and Basic Auth

Using Your VPS Ollama Endpoint with Open WebUI

Keeping Models and Ollama Updated

Performance Tuning: Threads and Parallel Requests

Wrapping Up

Discussion