Setting Up Ollama on a VPS to Run Local LLMs Without a GPU

We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.

Most Ollama tutorials assume you have a beefy desktop GPU sitting under your desk. I don't always — and you might not either. What I've found is that a well-specced CPU-only VPS can absolutely run smaller LLMs for personal use, private API access, or scripted automation tasks. You won't be doing real-time chat at blazing speed, but for summarisation, code generation, and batch queries, it's completely workable.

This guide walks through installing Ollama on a plain Ubuntu 24.04 VPS with no GPU, picking models that actually perform well on CPU, and making the whole thing accessible securely. I'll cover the gotchas I ran into so you don't have to.

Choosing the Right VPS for CPU Inference

Before touching a single command, hardware selection matters a lot here. CPU inference is memory-bandwidth-bound more than anything else — RAM speed and the number of physical cores are your two biggest levers. I've had good results on Hetzner's CPX31 (4 vCPUs, 8 GB RAM) for tiny models and their CPX51 (16 vCPUs, 32 GB RAM) for anything in the 7B parameter range. RackNerd's high-RAM KVM plans are worth looking at if budget is tight. Contabo offers a lot of RAM per dollar but their shared CPU performance is inconsistent.

My personal recommendation: aim for at least 8 GB RAM and 4 dedicated vCPUs. 16 GB RAM opens up 7B models comfortably. If your VPS has less than 4 GB RAM, stop here — even a quantised 1B model will be painful.

Tip: Ollama uses quantised GGUF models by default. A Q4_K_M quantisation of a 7B model needs roughly 4.5 GB of RAM. A Q4_K_M of a 3B model needs about 2 GB. Plan your VPS RAM around whichever model you want to run, plus at least 2 GB headroom for the OS.

Installing Ollama on Ubuntu 24.04

Ollama ships a convenience installer script that works fine on a VPS. I know some people are wary of piping to bash, but the script just downloads a pre-built binary to /usr/local/bin/ollama and sets up a systemd service. You can inspect it at https://ollama.com/install.sh before running it if you prefer.

SSH into your VPS and run the following:

# Update packages first
sudo apt update && sudo apt upgrade -y

# Install the Ollama binary and create the systemd service
curl -fsSL https://ollama.com/install.sh | sh

# Confirm the service is running
sudo systemctl status ollama

# Check the version
ollama --version

The installer creates a dedicated ollama system user, places the binary at /usr/local/bin/ollama, and stores models under /usr/share/ollama/.ollama/models by default. If your VPS has a separate data volume mounted at, say, /mnt/data, you'll want to change the model directory before pulling anything large. Set the environment variable in the systemd override:

# Create a systemd override directory for the ollama service
sudo mkdir -p /etc/systemd/system/ollama.service.d

# Write an override that changes the model storage location
# and binds Ollama to localhost only (important for security)
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <
Watch out: By default, Ollama binds to 0.0.0.0:11434, which means it's exposed to the internet. On a VPS, always set OLLAMA_HOST=127.0.0.1:11434 in the service override. Never expose port 11434 publicly — there is no authentication on the Ollama API.

Picking a Model That Works Without a GPU

This is where most guides gloss over the important bit. Not every model on the Ollama library is practical on CPU. Here's what I actually run and what I'd recommend:

  • gemma3:1b — Google's 1B model. Fast enough for near-real-time chat even on a 4-core VPS. Good for simple Q&A and quick scripts. Pulls about 815 MB.
  • llama3.2:3b — The sweet spot for a 8 GB RAM VPS. Noticeably smarter than 1B models, generates at roughly 5–10 tokens/sec on a 4-core CPU. Good for code generation and summarisation.
  • qwen2.5:7b — Excellent reasoning for its size. Needs 16 GB RAM to run comfortably. On a CPX51-class VPS, this generates at about 2–4 tokens/sec which is slow but usable for batch tasks.
  • phi3.5:mini — Microsoft's mini model. Punches above its weight for coding tasks and runs on 8 GB RAM.

I'd avoid anything above 7B on a CPU-only VPS unless you're running batch jobs overnight. Pull a model like this:

# Pull a small model suitable for CPU inference
ollama pull gemma3:1b

# Or the 3B Llama model for better quality
ollama pull llama3.2:3b

# List what's downloaded
ollama list

# Run a quick test from the command line
ollama run llama3.2:3b "Summarise the benefits of self-hosting in two sentences."

# Check how much RAM the model is using after loading
ps aux --sort=-%mem | grep ollama | head -5

Accessing the API Securely

Since Ollama is bound to 127.0.0.1, you need a way to reach it from outside. I prefer one of two approaches depending on the use case.

Option 1: SSH port forwarding — If you just need occasional access from your laptop, forward the port over SSH. No reverse proxy needed:

# Run this on your local machine, not the VPS
# Forwards localhost:11434 on your machine to localhost:11434 on the VPS
ssh -L 11434:127.0.0.1:11434 your-user@your-vps-hostname -N

# Now on your local machine you can hit the Ollama API as if it were local
curl http://localhost:11434/api/tags

Option 2: Caddy reverse proxy with authentication — I prefer Caddy because it handles SSL automatically and the config syntax is clean. If you want to expose the API to a specific app or to Open WebUI running on the same server, put Caddy in front of it with basic auth or restrict by IP.

# Install Caddy on Ubuntu
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' \
  | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
  | sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update && sudo apt install caddy -y

Then edit /etc/caddy/Caddyfile to proxy Ollama and require basic auth:

ollama.yourdomain.com {
    basicauth /* {
        # Generate a hash with: caddy hash-password --plaintext 'yourpassword'
        youruser $2a$14$REPLACE_WITH_HASHED_PASSWORD
    }
    reverse_proxy 127.0.0.1:11434
}

After saving, run sudo systemctl reload caddy. Caddy will automatically obtain a Let's Encrypt certificate for your domain. Now you can point any OpenAI-compatible client to https://ollama.yourdomain.com with the base URL set accordingly.

Performance Tuning for CPU Inference

A few environment variables make a real difference. I set these in the systemd override file shown earlier:

  • OLLAMA_NUM_PARALLEL=1 — On a CPU-only server, running one request at a time is almost always faster than trying to parallelise. Multiple parallel requests will thrash RAM and slow everything down.
  • OLLAMA_MAX_LOADED_MODELS=1 — Keep only one model in RAM. Loading multiple models on a CPU VPS will exhaust your RAM quickly.
  • OLLAMA_NUM_THREAD — Ollama auto-detects CPU threads, but you can pin it explicitly. On a 4-vCPU VPS I set this to 4.

Also make sure you're not running other memory-hungry services on the same VPS when doing inference. Docker containers for things like Nextcloud or Jellyfin should either be on a separate machine or stopped when you need the LLM to be responsive.

Monitoring Resource Usage

I keep a terminal open with htop during the first few runs to see exactly which cores are pegged and how much RAM is consumed. Install it if it's not already there:

sudo apt install -y htop

# Watch CPU and RAM in real time while running an inference
htop

# Or check current Ollama memory footprint quickly
cat /proc/$(pgrep -f "ollama runner")/status | grep -E "VmRSS|VmPeak"

If you see the VPS swapping to disk, your model is too large for the available RAM. Either pull a smaller quantisation (ollama pull llama3.2:3b:q4_0 for the smallest variant) or upgrade your VPS tier.

Wrapping Up

Running Ollama on a CPU-only VPS is absolutely viable — it just requires choosing models that fit your RAM, locking the API to localhost, and having realistic expectations about generation speed. For personal automation, private API access, or experimenting with open-source models without handing data to a cloud provider, a €10–20/month Hetzner or RackNerd VPS does the job surprisingly well.

From here, I'd suggest exploring Open WebUI as a browser-based chat interface you can run alongside Ollama in a Docker Compose stack on the same VPS. You can also hook the http://localhost:11434 endpoint into tools like n8n or Home Assistant for local AI automation that never leaves your server.

Discussion