Setting Up Ollama on a VPS to Run Local LLMs in the Cloud

CompactHost · June 26, 2026

We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.

Running Ollama on a VPS gives you a persistent, always-on LLM endpoint you can reach from any device — no leaving your home PC on overnight, no worrying about your ISP blocking inbound connections. I've been running this setup on a Hetzner CAX31 ARM instance (4 vCPUs, 8 GB RAM) for several months, and it handles models like llama3.2:3b and phi4-mini without complaint. The trick is doing it securely: Ollama's API is unauthenticated by default, so you absolutely must not expose port 11434 directly to the internet.

In this tutorial I'll walk you through installing Ollama on a fresh Ubuntu 24.04 VPS, locking it down with UFW, and putting Caddy in front of it with HTTPS and basic authentication so only you can reach it. No Docker required — though I'll mention the Docker path at the end for those who prefer it.

Choosing the Right VPS for LLM Inference

CPU-only inference is slower than GPU inference, but it's perfectly usable for 3B and 7B parameter models if you pick the right server. I strongly recommend at least 8 GB of RAM for a 7B model — the model weights alone eat about 4–5 GB in Q4 quantisation. My current go-to recommendations for budget LLM VPS setups:

Hetzner CAX31 — 8 GB RAM, 4 ARM vCPUs, ~€8/month. Exceptional value. ARM inference with Ollama works great.
Hetzner CPX31 — 8 GB RAM, 4 x86 vCPUs, ~€10/month if you need x86 compatibility.
RackNerd 8 GB KVM — cheaper upfront, but network can be inconsistent. Fine for personal use.
Contabo VPS M — 8 GB RAM, very cheap, but shared CPU contention can hurt inference speed noticeably.

If you want to run 13B models, bump to 16 GB RAM. For 70B models, you'll need a GPU instance — that's a whole different budget tier (Hetzner GX2 or OVH GPU nodes). Stick to 3B–7B for a cost-effective always-on setup.

Step 1: Initial Server Setup and UFW Firewall

Start from a clean Ubuntu 24.04 LTS install. The first thing I do on any new VPS is configure the firewall before installing anything that listens on a network port.

# Update the system
apt update && apt upgrade -y

# Install UFW if it's not already present
apt install -y ufw

# Deny everything inbound by default, allow outbound
ufw default deny incoming
ufw default allow outgoing

# Allow SSH (adjust if you've moved SSH to a non-standard port)
ufw allow 22/tcp comment 'SSH'

# Allow HTTP and HTTPS for Caddy
ufw allow 80/tcp comment 'HTTP'
ufw allow 443/tcp comment 'HTTPS'

# DO NOT open 11434 — Ollama stays internal only
ufw enable

# Confirm the rules look right
ufw status verbose

Watch out: Ollama binds to 0.0.0.0:11434 by default. If you skip the UFW setup and your VPS has no other firewall, the API is wide open — anyone can pull models, run inference, and exhaust your disk. Always configure UFW before starting Ollama.

Step 2: Install Ollama

Ollama provides a one-liner installer that works on both x86_64 and ARM64. It installs a systemd service that starts automatically on boot.

# Install Ollama (official installer, works on ARM64 and x86_64)
curl -fsSL https://ollama.com/install.sh | sh

# Verify the service is running
systemctl status ollama

# Pull a model — llama3.2:3b is a good starting point for limited RAM
ollama pull llama3.2:3b

# Quick sanity check — run a prompt from the command line
ollama run llama3.2:3b "Summarise what a VPS is in one sentence."

The installer creates an ollama system user, drops the binary at /usr/local/bin/ollama, and writes a systemd unit at /etc/systemd/system/ollama.service. Models are stored in /usr/share/ollama/.ollama/models by default. On a VPS with limited disk, I always check available space with df -h before pulling larger models — a 7B Q4 model is around 4.5 GB.

One important tweak: by default Ollama listens on all interfaces. I want it to listen only on localhost so UFW is a belt-and-suspenders measure rather than the only line of defence. Edit the systemd service:

# Create a systemd override to bind Ollama to localhost only
mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
EOF

# Reload and restart
systemctl daemon-reload
systemctl restart ollama

# Confirm it's only listening on loopback
ss -tlnp | grep 11434

You should see 127.0.0.1:11434 in the output, not 0.0.0.0:11434. That's the configuration I want.

Step 3: Install Caddy and Configure HTTPS with Basic Auth

I prefer Caddy over Nginx for this use case because automatic HTTPS via Let's Encrypt is built in — zero certificate management. You point it at a domain and it handles the rest. Install Caddy from the official apt repository:

# Add the Caddy official repo
apt install -y debian-keyring debian-archive-keyring apt-transport-https curl
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' \
  | gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
  | tee /etc/apt/sources.list.d/caddy-stable.list
apt update && apt install -y caddy

# Generate a bcrypt hash of your password for basic auth
# Replace 'yourpassword' with something strong
caddy hash-password --plaintext 'yourpassword'
# Copy the $2a$... hash that appears — you'll paste it into the Caddyfile

Now write the Caddyfile. Replace llm.yourdomain.com with your actual domain (pointed at this VPS's IP in DNS), and paste in the bcrypt hash:

cat > /etc/caddy/Caddyfile << 'EOF'
llm.yourdomain.com {
    basicauth {
        # Format: username bcrypt-hash
        myuser $2a$14$REPLACE_THIS_WITH_YOUR_ACTUAL_HASH
    }

    reverse_proxy 127.0.0.1:11434

    # Tighten up headers
    header {
        X-Content-Type-Options nosniff
        X-Frame-Options DENY
        -Server
    }

    log {
        output file /var/log/caddy/ollama-access.log {
            roll_size 10mb
            roll_keep 5
        }
    }
}
EOF

# Create the log directory
mkdir -p /var/log/caddy
chown caddy:caddy /var/log/caddy

# Validate the config and reload
caddy validate --config /etc/caddy/Caddyfile
systemctl reload caddy

Caddy will automatically obtain and renew a Let's Encrypt certificate for your domain. Within a minute or two, https://llm.yourdomain.com should prompt for your username and password, then proxy through to Ollama.

Tip: If you want to use this endpoint as an OpenAI-compatible API (for tools like Continue, Open WebUI, or LiteLLM), Ollama exposes the OpenAI-compatible endpoint at /v1/chat/completions. Point your tool at https://llm.yourdomain.com with your basic auth credentials and set the model name to whatever you've pulled locally — for example, llama3.2:3b.

Step 4: Pull Models and Test the API

With everything running, SSH into the server and pull the models you want to use. Here's a quick reference for models that work well on 8 GB RAM:

# Good models for 8 GB RAM VPS (Q4 quantisation)
ollama pull llama3.2:3b          # ~2 GB, fast, great for general chat
ollama pull phi4-mini             # ~2.5 GB, strong reasoning for its size
ollama pull qwen2.5-coder:7b     # ~4.5 GB, excellent for code completion
ollama pull mistral:7b            # ~4.1 GB, well-rounded general model

# List what you've pulled
ollama list

# Test the API directly from the server
curl http://127.0.0.1:11434/api/generate \
  -d '{"model":"llama3.2:3b","prompt":"Hello, what can you do?","stream":false}'

# Test through Caddy with auth (from your local machine)
curl -u myuser:yourpassword \
  https://llm.yourdomain.com/api/generate \
  -d '{"model":"llama3.2:3b","prompt":"What is self-hosting?","stream":false}'

Optional: Add Open WebUI for a Chat Interface

If you want a browser-based chat UI rather than raw API access, Open WebUI pairs perfectly with this setup. I run it as a Docker container on the same VPS, pointing at the local Ollama instance:

# Install Docker if you haven't already
curl -fsSL https://get.docker.com | sh

# Run Open WebUI, connecting to the local Ollama
docker run -d \
  --name open-webui \
  --restart always \
  -p 127.0.0.1:3000:8080 \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://host-gateway:11434 \
  --add-host=host-gateway:host-gateway \
  ghcr.io/open-webui/open-webui:main

Then add a second block to your Caddyfile to proxy chat.yourdomain.com to 127.0.0.1:3000. Open WebUI has its own user authentication system, so you can remove the basicauth directive from that block if you prefer — it handles login internally.

Keeping Models Updated and Disk Under Control

One thing that catches people out: Ollama doesn't auto-update models, and they consume significant disk. I run a quick audit monthly:

# See what's taking space
ollama list
du -sh /usr/share/ollama/.ollama/models/

# Remove a model you're no longer using
ollama rm mistral:7b

# Pull an updated version of a model
ollama pull llama3.2:3b

On a 40 GB VPS root disk, I typically keep two or three models maximum. If I want to experiment with something larger, I remove one first.

Wrapping Up

At this point you have Ollama running on a VPS, locked to localhost, fronted by Caddy with HTTPS and basic authentication, with models ready to serve. The total monthly cost for a usable always-on LLM endpoint is under €10 on Hetzner — far less than API fees if you're doing any meaningful volume of inference. The next steps I'd recommend: set up fail2ban to protect your SSH and Caddy endpoints, and if you want a proper chat UI, deploy Open WebUI with the Docker snippet above and add it behind Caddy with its own subdomain. Once you're comfortable with the setup, look at fine-tuning models for your specific workflows — that's where self-hosted LLMs really start to outshine the pay-per-token alternatives.

Setting Up Ollama on a VPS to Run Local LLMs in the Cloud

Choosing the Right VPS for LLM Inference

Step 1: Initial Server Setup and UFW Firewall

Step 2: Install Ollama

Step 3: Install Caddy and Configure HTTPS with Basic Auth

Step 4: Pull Models and Test the API

Optional: Add Open WebUI for a Chat Interface

Keeping Models Updated and Disk Under Control

Wrapping Up

Discussion