Getting Started with Ollama: Running Local LLMs on Your VPS or Homelab

CompactHost · June 23, 2026

We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.

Running a large language model on your own hardware used to mean wrestling with Python virtual environments, CUDA drivers, and half-broken HuggingFace scripts. Ollama changed all of that. It packages model management, an inference runtime, and a local REST API into a single binary that installs in about sixty seconds. In this tutorial I'll walk you through a complete setup — from bare server to chatting with Llama 3 — whether you're on a Hetzner VPS, a spare mini-PC, or a full homelab tower.

What Ollama Actually Does

Ollama is essentially a model manager and inference server rolled into one. It stores quantised GGUF models under ~/.ollama/models, exposes a REST API on http://localhost:11434, and handles loading/unloading models from RAM automatically. You interact with it through the ollama CLI or any HTTP client. The API is intentionally compatible with the OpenAI chat completions format, which means anything built for OpenAI — LangChain, Open WebUI, Aider, Continue.dev — works against your local instance with a one-line config change.

I prefer Ollama over running llama.cpp directly because the model lifecycle management is so much cleaner. You pull models by name, they version themselves, and you never have to hunt for a quantisation that matches your RAM budget. For a VPS without a GPU, I always start with a 4-bit quantised model — they run acceptably fast on a single CPU core once they're loaded.

Hardware Minimums Worth Knowing

Before you pull a 40 GB model and watch your VPS grind to a halt, here's a realistic RAM baseline:

7B models (Q4_K_M) — ~5 GB RAM. Runs fine on a 8 GB VPS or mini-PC. Llama 3.2 3B runs on 4 GB.
13B models (Q4_K_M) — ~9 GB RAM. You want 16 GB to leave headroom.
70B models (Q4_K_M) — ~40+ GB RAM. GPU strongly recommended; CPU-only is painful.

For a VPS I generally recommend a Hetzner CX32 (4 vCPU, 8 GB RAM) as a minimum for 7B models, or a CAX31 (ARM, 8 GB) which actually performs surprisingly well on llama.cpp's ARM optimisations. CPU-only inference at 7B runs at roughly 5–15 tokens/second depending on the box — slow compared to a GPU, but more than fast enough for batch tasks, coding assistants, or API calls that don't need real-time streaming.

Installing Ollama on a Fresh Linux Server

I run this on Ubuntu 24.04 LTS. The official installer script is the simplest path and I've never had it break on a clean system:

# Install Ollama (runs as a systemd service automatically)
curl -fsSL https://ollama.com/install.sh | sh

# Verify it started correctly
systemctl status ollama

# Check the API is responding
curl http://localhost:11434/api/tags

After install, Ollama runs as a systemd service under the ollama user. The service binds to 127.0.0.1:11434 by default, which is what you want — you don't want this port open to the internet unprotected. Now pull your first model:

# Pull Llama 3.1 8B (recommended starting point, ~5 GB download)
ollama pull llama3.1:8b

# Or for a lighter model on a small VPS
ollama pull llama3.2:3b

# Pull Mistral 7B if you want a strong coder/reasoner
ollama pull mistral:7b-instruct-q4_K_M

# List what you've pulled
ollama list

# Run an interactive chat session
ollama run llama3.1:8b

Tip: The q4_K_M quantisation suffix is generally the sweet spot for CPU-only inference — it cuts memory usage roughly in half compared to the full FP16 model while keeping quality very close to the original. When you pull a model without specifying a tag (e.g. ollama pull mistral), Ollama picks a sensible default quantisation automatically.

Using the REST API

The CLI is useful for testing, but in practice you'll hit the REST API from your applications. Ollama exposes two main endpoints: /api/generate for single-turn completions and /api/chat for multi-turn conversations with a messages array. The /api/chat endpoint mirrors the OpenAI format closely enough that most OpenAI SDKs work against it unchanged:

# Single-turn completion
curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "prompt": "Explain ZFS copy-on-write in two sentences.",
    "stream": false
  }'

# Multi-turn chat (OpenAI-compatible format)
curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "stream": false,
    "messages": [
      {"role": "system", "content": "You are a helpful Linux sysadmin."},
      {"role": "user",   "content": "How do I check disk I/O wait on Ubuntu?"}
    ]
  }'

# OpenAI-compatible endpoint (drop-in replacement for OpenAI API)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ollama" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

The /v1/chat/completions endpoint is particularly useful. Set your OpenAI base URL to http://localhost:11434/v1 and any API key string (Ollama ignores it), and you're done. I use this to point Continue.dev in VS Code at my homelab for free, private code completion.

Exposing Ollama to Your Local Network Safely

By default Ollama only listens on 127.0.0.1. If you want other machines on your LAN — or Open WebUI running in Docker — to reach it, you need to change the bind address. Edit the systemd service override:

# Create a systemd override for Ollama
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF

# Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart ollama

# Confirm it's listening on all interfaces
ss -tlnp | grep 11434

Watch out: Binding to 0.0.0.0 exposes the Ollama API to every network interface on your machine. On a VPS this means the public internet can reach port 11434 unless you have a firewall rule blocking it. Always add a UFW rule: sudo ufw deny 11434 and only allow access from your Tailscale or WireGuard interface, or put it behind a reverse proxy with authentication.

Managing Models and Keeping Disk Usage Sane

Models live under /usr/share/ollama/.ollama/models (when installed as a service) or ~/.ollama/models for a user install. A handful of 7B models will eat 15–25 GB quickly. A few commands you'll use regularly:

# List all downloaded models with their sizes
ollama list

# Remove a model you no longer need
ollama rm mistral:7b-instruct-q4_K_M

# Show model details (parameter count, quantisation, context length)
ollama show llama3.1:8b

# See what's currently loaded in memory
curl http://localhost:11434/api/ps

# Force-unload models from RAM when you need memory back
# (Ollama keeps models loaded for 5 minutes after last use by default)
# Set OLLAMA_KEEP_ALIVE=0 to unload immediately after each request
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=0"
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama

If you're on a VPS with a limited root disk, I strongly recommend mounting a larger block volume and symlinking the models directory there before you pull anything large. On Hetzner you can add a 50 GB volume for a few euros a month and mount it at /var/ollama, then set OLLAMA_MODELS=/var/ollama/models in the service environment.

What to Do Next

With Ollama running and a model pulled, the most natural next step for most people is adding a chat UI. Open WebUI is the go-to option — it's a polished ChatGPT-style interface that connects to Ollama out of the box and runs as a Docker container. I've covered the full Open WebUI setup with Docker Compose and a Caddy reverse proxy separately on this site. The other direction worth exploring is the OpenAI-compatible API: once you've pointed a tool like Aider, LangChain, or your own Python scripts at http://localhost:11434/v1, you have a fully private, zero-cost AI backend for every project on your homelab.

Start small — pull llama3.2:3b if you're on a tight RAM budget, verify the API responds, and build from there. The ecosystem around Ollama has matured enormously and almost everything that works with OpenAI works here too.

Getting Started with Ollama: Running Local LLMs on Your VPS or Homelab

What Ollama Actually Does

Hardware Minimums Worth Knowing

Installing Ollama on a Fresh Linux Server

Using the REST API

Exposing Ollama to Your Local Network Safely

Managing Models and Keeping Disk Usage Sane

What to Do Next

Discussion