Setting Up Ollama on a VPS to Run Local LLMs Without a GPU
We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.
Most Ollama tutorials assume you have a beefy desktop GPU sitting under your desk. I don't always — and you might not either. What I've found is that a well-specced CPU-only VPS can absolutely run smaller LLMs for personal use, private API access, or scripted automation tasks. You won't be doing real-time chat at blazing speed, but for summarisation, code generation, and batch queries, it's completely workable.
This guide walks through installing Ollama on a plain Ubuntu 24.04 VPS with no GPU, picking models that actually perform well on CPU, and making the whole thing accessible securely. I'll cover the gotchas I ran into so you don't have to.
Choosing the Right VPS for CPU Inference
Before touching a single command, hardware selection matters a lot here. CPU inference is memory-bandwidth-bound more than anything else — RAM speed and the number of physical cores are your two biggest levers. I've had good results on Hetzner's CPX31 (4 vCPUs, 8 GB RAM) for tiny models and their CPX51 (16 vCPUs, 32 GB RAM) for anything in the 7B parameter range. RackNerd's high-RAM KVM plans are worth looking at if budget is tight. Contabo offers a lot of RAM per dollar but their shared CPU performance is inconsistent.
My personal recommendation: aim for at least 8 GB RAM and 4 dedicated vCPUs. 16 GB RAM opens up 7B models comfortably. If your VPS has less than 4 GB RAM, stop here — even a quantised 1B model will be painful.
Installing Ollama on Ubuntu 24.04
Ollama ships a convenience installer script that works fine on a VPS. I know some people are wary of piping to bash, but the script just downloads a pre-built binary to /usr/local/bin/ollama and sets up a systemd service. You can inspect it at https://ollama.com/install.sh before running it if you prefer.
SSH into your VPS and run the following:
# Update packages first
sudo apt update && sudo apt upgrade -y
# Install the Ollama binary and create the systemd service
curl -fsSL https://ollama.com/install.sh | sh
# Confirm the service is running
sudo systemctl status ollama
# Check the version
ollama --version
The installer creates a dedicated ollama system user, places the binary at /usr/local/bin/ollama, and stores models under /usr/share/ollama/.ollama/models by default. If your VPS has a separate data volume mounted at, say, /mnt/data, you'll want to change the model directory before pulling anything large. Set the environment variable in the systemd override:
# Create a systemd override directory for the ollama service
sudo mkdir -p /etc/systemd/system/ollama.service.d
# Write an override that changes the model storage location
# and binds Ollama to localhost only (important for security)
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <
0.0.0.0:11434, which means it's exposed to the internet. On a VPS, always set OLLAMA_HOST=127.0.0.1:11434 in the service override. Never expose port 11434 publicly — there is no authentication on the Ollama API.Picking a Model That Works Without a GPU
This is where most guides gloss over the important bit. Not every model on the Ollama library is practical on CPU. Here's what I actually run and what I'd recommend:
- gemma3:1b — Google's 1B model. Fast enough for near-real-time chat even on a 4-core VPS. Good for simple Q&A and quick scripts. Pulls about 815 MB.
- llama3.2:3b — The sweet spot for a 8 GB RAM VPS. Noticeably smarter than 1B models, generates at roughly 5–10 tokens/sec on a 4-core CPU. Good for code generation and summarisation.
- qwen2.5:7b — Excellent reasoning for its size. Needs 16 GB RAM to run comfortably. On a CPX51-class VPS, this generates at about 2–4 tokens/sec which is slow but usable for batch tasks.
- phi3.5:mini — Microsoft's mini model. Punches above its weight for coding tasks and runs on 8 GB RAM.
I'd avoid anything above 7B on a CPU-only VPS unless you're running batch jobs overnight. Pull a model like this:
# Pull a small model suitable for CPU inference
ollama pull gemma3:1b
# Or the 3B Llama model for better quality
ollama pull llama3.2:3b
# List what's downloaded
ollama list
# Run a quick test from the command line
ollama run llama3.2:3b "Summarise the benefits of self-hosting in two sentences."
# Check how much RAM the model is using after loading
ps aux --sort=-%mem | grep ollama | head -5
Accessing the API Securely
Since Ollama is bound to 127.0.0.1, you need a way to reach it from outside. I prefer one of two approaches depending on the use case.
Option 1: SSH port forwarding — If you just need occasional access from your laptop, forward the port over SSH. No reverse proxy needed:
# Run this on your local machine, not the VPS
# Forwards localhost:11434 on your machine to localhost:11434 on the VPS
ssh -L 11434:127.0.0.1:11434 your-user@your-vps-hostname -N
# Now on your local machine you can hit the Ollama API as if it were local
curl http://localhost:11434/api/tags
Option 2: Caddy reverse proxy with authentication — I prefer Caddy because it handles SSL automatically and the config syntax is clean. If you want to expose the API to a specific app or to Open WebUI running on the same server, put Caddy in front of it with basic auth or restrict by IP.
# Install Caddy on Ubuntu
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' \
| sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
| sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update && sudo apt install caddy -y
Then edit /etc/caddy/Caddyfile to proxy Ollama and require basic auth:
ollama.yourdomain.com {
basicauth /* {
# Generate a hash with: caddy hash-password --plaintext 'yourpassword'
youruser $2a$14$REPLACE_WITH_HASHED_PASSWORD
}
reverse_proxy 127.0.0.1:11434
}
After saving, run sudo systemctl reload caddy. Caddy will automatically obtain a Let's Encrypt certificate for your domain. Now you can point any OpenAI-compatible client to https://ollama.yourdomain.com with the base URL set accordingly.
Performance Tuning for CPU Inference
A few environment variables make a real difference. I set these in the systemd override file shown earlier:
OLLAMA_NUM_PARALLEL=1— On a CPU-only server, running one request at a time is almost always faster than trying to parallelise. Multiple parallel requests will thrash RAM and slow everything down.OLLAMA_MAX_LOADED_MODELS=1— Keep only one model in RAM. Loading multiple models on a CPU VPS will exhaust your RAM quickly.OLLAMA_NUM_THREAD— Ollama auto-detects CPU threads, but you can pin it explicitly. On a 4-vCPU VPS I set this to4.
Also make sure you're not running other memory-hungry services on the same VPS when doing inference. Docker containers for things like Nextcloud or Jellyfin should either be on a separate machine or stopped when you need the LLM to be responsive.
Monitoring Resource Usage
I keep a terminal open with htop during the first few runs to see exactly which cores are pegged and how much RAM is consumed. Install it if it's not already there:
sudo apt install -y htop
# Watch CPU and RAM in real time while running an inference
htop
# Or check current Ollama memory footprint quickly
cat /proc/$(pgrep -f "ollama runner")/status | grep -E "VmRSS|VmPeak"
If you see the VPS swapping to disk, your model is too large for the available RAM. Either pull a smaller quantisation (ollama pull llama3.2:3b:q4_0 for the smallest variant) or upgrade your VPS tier.
Wrapping Up
Running Ollama on a CPU-only VPS is absolutely viable — it just requires choosing models that fit your RAM, locking the API to localhost, and having realistic expectations about generation speed. For personal automation, private API access, or experimenting with open-source models without handing data to a cloud provider, a €10–20/month Hetzner or RackNerd VPS does the job surprisingly well.
From here, I'd suggest exploring Open WebUI as a browser-based chat interface you can run alongside Ollama in a Docker Compose stack on the same VPS. You can also hook the http://localhost:11434 endpoint into tools like n8n or Home Assistant for local AI automation that never leaves your server.
Discussion