Setting Up Ollama on a VPS: Running Local LLMs in the Cloud

Setting Up Ollama on a VPS: Running Local LLMs in the Cloud

We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.

Running Ollama on a VPS gives you the best of both worlds: the privacy and control of a self-hosted LLM with the uptime and accessibility of a cloud server. I've been doing this on a Hetzner CPX31 for several months, and it's become one of the most useful things in my infrastructure. Whether you want a personal AI API endpoint, a shared assistant for a small team, or just want to stop paying OpenAI, this guide walks you through the whole setup from a bare Ubuntu 24.04 VPS to a Caddy-proxied, auth-protected Ollama instance serving models like Llama 3.1 and Mistral.

Choosing the Right VPS for Ollama

CPU-only LLM inference is slower than GPU inference, but it's perfectly usable for personal or low-traffic workloads. The main constraint is RAM — models need enough memory to load fully, or they'll swap to disk and crawl. Here's what I've found works:

I use Hetzner for Ollama specifically because their CPX line has fast NVMe storage, which matters when Ollama loads model weights from disk on first inference. If you're budget-constrained, RackNerd's annual deals often land a 8 GB RAM KVM VPS for under $40/year, which is enough for 7B models.

Tip: Ollama supports GGUF quantized models. The Q4_K_M quantization of a 7B model typically needs about 4.5 GB of RAM and delivers excellent quality. Use ollama pull llama3.1:8b-instruct-q4_K_M to grab a lean version if RAM is tight.

Installing Ollama on Ubuntu 24.04

SSH into your VPS and run the official installer. I always review install scripts before piping to bash, but the Ollama one is clean:

# Update system first
sudo apt update && sudo apt upgrade -y

# Install Ollama (official installer)
curl -fsSL https://ollama.com/install.sh | sh

# Verify the service is running
sudo systemctl status ollama

# Pull your first model (Mistral 7B is a great starting point)
ollama pull mistral

# Test it works
ollama run mistral "Explain what a VPS is in one sentence."

The installer sets up a systemd service that runs Ollama on 127.0.0.1:11434 by default. That binding to localhost is intentional — you do not want port 11434 open to the internet without authentication. We'll fix the exposure properly with Caddy in the next section.

After the pull completes, check that the API responds locally:

# Quick API test (runs on the VPS itself)
curl http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "What is 2 + 2?",
    "stream": false
  }'

You should get a JSON response with a "response" field. If that works, the hard part is done — the rest is just proxying it securely.

Installing Caddy and Proxying Ollama

I prefer Caddy over Nginx for setups like this because automatic HTTPS is zero-config and the Caddyfile syntax is dramatically cleaner. Install it from the official apt repository:

# Add Caddy's official apt repo
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' \
  | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg

curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
  | sudo tee /etc/apt/sources.list.d/caddy-stable.list

sudo apt update && sudo apt install caddy -y

# Confirm Caddy is running
sudo systemctl status caddy

Now configure the Caddyfile. I'm using HTTP basic auth here to keep it simple — if you want SSO, look at my Authentik tutorial instead. Replace ai.yourdomain.com with your actual subdomain and make sure it has an A record pointing to your VPS IP.

# Generate a hashed password for basic auth
# Replace 'yourpassword' with something strong
caddy hash-password --plaintext 'yourpassword'
# Copy the $2a$... output for use in the Caddyfile

# Edit the Caddyfile
sudo nano /etc/caddy/Caddyfile
# /etc/caddy/Caddyfile
ai.yourdomain.com {
    # Basic auth — paste the hash from caddy hash-password
    basicauth {
        youruser $2a$14$PASTE_YOUR_HASH_HERE
    }

    # Proxy to Ollama's local port
    reverse_proxy localhost:11434 {
        # Increase timeouts for long inference requests
        transport http {
            response_header_timeout 300s
            read_timeout 300s
        }
    }

    # Optional: restrict to your Tailscale IP or specific CIDR
    # @blocked not remote_ip 100.64.0.0/10
    # respond @blocked 403
}
# Reload Caddy to apply the config
sudo systemctl reload caddy

# Check for config errors first
caddy validate --config /etc/caddy/Caddyfile
Watch out: The 5-minute timeouts in the reverse proxy config are not optional — they're essential. Ollama on a CPU VPS can take 60–180 seconds to generate a long response, and Caddy's default timeout will cut the connection before it finishes. I learned this the hard way with mysterious empty responses.

Locking Down the Firewall

By default you want only ports 22 (SSH), 80 (HTTP for ACME), and 443 (HTTPS) open. Port 11434 should never be exposed directly:

sudo ufw allow OpenSSH
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
# Never: sudo ufw allow 11434  ← don't do this
sudo ufw enable
sudo ufw status

Caddy handles the Let's Encrypt certificate automatically on first request to your domain, so as long as your DNS is pointed correctly you'll have HTTPS with no extra steps.

Pulling and Managing Models

Once the proxy is up, you can interact with Ollama from any machine using standard HTTP with your credentials. But on the VPS itself, model management is straightforward:

# See what you have installed
ollama list

# Pull additional models
ollama pull llama3.1:8b          # Meta's Llama 3.1 8B
ollama pull qwen2.5:7b           # Alibaba's Qwen 2.5 — surprisingly good
ollama pull nomic-embed-text     # Embedding model for RAG pipelines
ollama pull codellama:7b         # Code-focused model

# Check how much disk space models are using
du -sh ~/.ollama/models/

# Remove a model you no longer need
ollama rm mistral

Model files live in /usr/share/ollama/.ollama/models/ when installed via the system installer (as opposed to the user home directory when running as your own user). Check with sudo systemctl cat ollama to see which user it's running as — this tells you where the model files actually live.

Connecting Open WebUI for a Chat Interface

If you want a ChatGPT-style web interface rather than just a raw API, Open WebUI pairs perfectly with Ollama. I run it as a Docker container alongside Ollama on the same VPS:

# Install Docker if you haven't already
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER

# Run Open WebUI connected to the local Ollama instance
docker run -d \
  --name open-webui \
  --restart unless-stopped \
  -p 127.0.0.1:3000:8080 \
  -e OLLAMA_BASE_URL=http://host-gateway:11434 \
  --add-host=host-gateway:host-gateway \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

Then add a second block to your Caddyfile for the WebUI on a different subdomain, like chat.yourdomain.com, proxying to localhost:3000. Open WebUI has its own user authentication built in, so you don't necessarily need the basic auth layer there — though I still put it behind Tailscale for good measure.

Performance Expectations and Tips

On a Hetzner CPX31 (4 AMD vCPU cores), Mistral 7B Q4 generates roughly 8–12 tokens per second. That's slow compared to a GPU but totally usable for non-interactive tasks like summarization, classification, or generating draft text. For interactive chat it's a little sluggish but workable. A few things that help:

Tip: You can set a system-level environment variable to make Ollama keep models loaded in memory between requests. Add OLLAMA_KEEP_ALIVE=30m to /etc/systemd/system/ollama.service.d/override.conf (create the directory if needed) and reload with sudo systemctl daemon-reload && sudo systemctl restart ollama. This eliminates the cold-start load time for your most-used model.

Conclusion

You now have a private, HTTPS-secured Ollama instance running on a VPS, accessible from anywhere without exposing port 11434 to the open internet. The total monthly cost on Hetzner is around €8–12 depending on the plan, which is competitive with even light OpenAI API usage once you factor in privacy and the fact that your data never leaves your server.

My recommended next steps: set up Watchtower to keep the Open WebUI container updated automatically (docker run -d --name watchtower -v /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --interval 86400), and look into the nomic-embed-text model if you want to build a local RAG pipeline — it's a small embedding model that pairs well with a vector DB like Chroma or Qdrant for document search over your own files.

Discussion