Setting Up Ollama on a VPS to Run Local LLMs in the Cloud

We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.

Running Ollama on a VPS gives you a persistent, always-on LLM endpoint you can reach from any device — your laptop, phone, or a home server — without needing beefy local hardware running 24/7. I set this up on a Hetzner CCX23 (4 vCPU, 16 GB RAM) and it comfortably handles Llama 3.1 8B for personal use. This tutorial walks through the full setup: installing Ollama, locking it down with UFW, and putting Caddy in front of it with HTTPS and basic authentication so you're not leaving an open AI endpoint on the internet.

Choosing the Right VPS

CPU-only inference is totally viable for 7B and 8B parameter models if you have enough RAM. I recommend at least 16 GB RAM for comfortable headroom. For context, Llama 3.1 8B in Q4 quantization needs roughly 5–6 GB of RAM, so a 16 GB node leaves room for the OS, Caddy, and other services. If you want to run 30B+ models or need faster response times, look for a VPS with a dedicated GPU — Hetzner's GPU cloud (GX2 or above) or OVH's T1 GPU instances are worth checking. For CPU-only work, Hetzner's Arm64 CAX series is surprisingly good value in 2026.

Tip: Hetzner's CCX (dedicated vCPU) series is noticeably faster than shared CX instances for LLM inference. The price difference is small for the smallest tiers and the improvement in token generation speed is real.

Installing Ollama on Ubuntu 24.04

SSH into your VPS and run the official install script. I prefer to audit the script before piping to shell, but the one-liner is fine for a fresh server:

# Update the system first
sudo apt update && sudo apt upgrade -y

# Install Ollama using the official installer
curl -fsSL https://ollama.com/install.sh | sh

# Verify the service is running
sudo systemctl status ollama

# Pull your first model — Llama 3.1 8B is a great starting point
ollama pull llama3.1:8b

# Test it works from the command line
ollama run llama3.1:8b "What is the capital of France?"

By default, Ollama listens on 127.0.0.1:11434. This is intentional — it should not be exposed directly to the internet. We'll keep it on localhost and route traffic through Caddy, which handles TLS termination and authentication for us.

If you want Ollama to start automatically on reboot (the installer usually handles this, but let's be sure):

sudo systemctl enable ollama
sudo systemctl start ollama

# Check that it's actually listening on localhost only
ss -tlnp | grep 11434
# Expected output: 127.0.0.1:11434
Watch out: Never set OLLAMA_HOST=0.0.0.0 in your environment without a firewall and authentication layer in front of it. Ollama's API has no built-in auth, and an exposed port 11434 will be found and abused quickly. Keep it bound to localhost at all times.

Firewall Setup with UFW

Before going further, harden the firewall. We only want SSH (22), HTTP (80), and HTTPS (443) exposed publicly. Port 11434 stays closed externally.

# Allow SSH (make sure this runs before enabling UFW!)
sudo ufw allow 22/tcp

# Allow web traffic for Caddy
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp

# Enable the firewall
sudo ufw enable

# Confirm the rules
sudo ufw status verbose

At this point port 11434 is not listed and will be blocked at the firewall level even if something changed Ollama's bind address. Defence in depth.

Installing Caddy and Protecting the Endpoint

I prefer Caddy over Nginx for this use case because automatic HTTPS is built in — no certbot, no cron jobs, no manual renewal. Caddy handles Let's Encrypt certificates completely on its own. Install it from the official Caddy repository:

# Add the Caddy repository
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' \
  | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
  | sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update
sudo apt install caddy

# Generate a bcrypt-hashed password for basic auth
# Replace 'yourpassword' with something strong
caddy hash-password --plaintext 'yourpassword'
# Copy the $2a$... hash that's printed — you'll use it in the Caddyfile

Now edit /etc/caddy/Caddyfile. Replace ollama.yourdomain.com with your actual subdomain (you'll need an A record pointing to your VPS IP):

sudo nano /etc/caddy/Caddyfile
ollama.yourdomain.com {
    basicauth {
        # username is 'ollama', replace the hash with your own
        ollama $2a$14$YOURHASHHERE
    }

    reverse_proxy localhost:11434 {
        header_up Host {host}
    }

    # Optional: increase timeouts for long LLM responses
    @streaming {
        path /api/generate
        path /api/chat
    }
    handle @streaming {
        reverse_proxy localhost:11434 {
            transport http {
                read_timeout 300s
                write_timeout 300s
            }
        }
    }

    log {
        output file /var/log/caddy/ollama-access.log
    }
}
# Reload Caddy to apply the config
sudo systemctl reload caddy

# Check for config errors first
caddy validate --config /etc/caddy/Caddyfile

Caddy will immediately start provisioning a TLS certificate from Let's Encrypt. Within 30 seconds your endpoint should be live at https://ollama.yourdomain.com with a valid cert and basic auth gate.

Pulling Models and Managing Storage

Ollama stores models under /usr/share/ollama/.ollama/models by default. On a VPS with a small root disk this can fill up fast — Llama 3.1 8B Q4 is about 4.7 GB, and Mistral 7B is similar. I usually either use a VPS with a large volume mounted at that path, or change the models directory via the environment variable OLLAMA_MODELS.

To change the storage location, edit the Ollama systemd service:

sudo systemctl edit ollama
# Add this block in the editor that opens:
[Service]
Environment="OLLAMA_MODELS=/mnt/data/ollama/models"
sudo systemctl daemon-reload
sudo systemctl restart ollama

Some models I find genuinely useful on a CPU-only VPS: llama3.1:8b for general tasks, mistral:7b for instruction following, nomic-embed-text for generating embeddings (tiny and fast), and qwen2.5-coder:7b for code assistance. All pull with ollama pull <modelname>.

Testing the API Remotely

Once Caddy is serving the endpoint, you can hit the Ollama API from anywhere using curl with basic auth credentials:

# Stream a completion from your laptop or another server
curl -u ollama:yourpassword \
  -X POST https://ollama.yourdomain.com/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "prompt": "Explain ZFS in one paragraph.",
    "stream": false
  }'

You can also point Open WebUI at this URL by setting the Ollama base URL to https://ollama:[email protected] in its environment config — though I prefer keeping Open WebUI on the same VPS to avoid sending credentials in the URL. That's a separate tutorial, but the short version is: run Open WebUI as a Docker container on the same host, set OLLAMA_BASE_URL=http://localhost:11434, and proxy that with Caddy on a different subdomain.

Keeping Ollama Updated

Ollama doesn't update itself automatically. When a new version drops (they ship frequently), re-run the install script — it handles upgrades cleanly:

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl restart ollama
ollama --version
Tip: Set a monthly calendar reminder to update Ollama and re-pull any models you use regularly. Model quantizations improve over time and newer Ollama versions often include meaningful performance improvements for CPU inference.

Conclusion

At this point you have a properly secured Ollama instance running on your VPS: bound to localhost, firewalled with UFW, reverse-proxied through Caddy with automatic HTTPS and basic authentication. It's a clean, minimal setup that I've run in production for several months without issues. The natural next step is adding Open WebUI as a chat front-end on the same server — it transforms the raw API into a proper ChatGPT-like interface. After that, consider wiring this Ollama endpoint into n8n or other automation tools so your self-hosted workflows can call a local LLM without touching any external API.

Discussion