Getting Started with Ollama: Running Local LLMs on Your VPS or Homelab
We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.
Ollama is the fastest path I've found to running open-source large language models like Llama 3, Mistral, Qwen, and Gemma on hardware you actually own. Whether that's a beefy homelab box in your spare room or a Hetzner VPS with a dedicated GPU, the setup process is nearly identical and surprisingly painless. In this tutorial I'll walk you through a complete, production-ready Ollama stack — bare-metal installation, model management, and a Docker Compose setup fronted by Open WebUI — so you can chat with your own private AI in under an hour.
Why Run LLMs Locally at All?
Privacy is the headline reason, but it's not the only one. When I started routing my AI queries through a local Ollama instance instead of the OpenAI API, I immediately stopped worrying about sensitive data leaving my network. But the secondary benefit — zero per-token cost — is equally compelling. A Hetzner AX52 bare-metal server (AMD Ryzen 9 5950X, 128 GB RAM) runs Llama 3.1 70B comfortably in Q4 quantization at a fixed monthly price. After the break-even point against API costs, every query is free.
The caveat is hardware: you need sufficient RAM for the model to fit entirely in memory (or VRAM if you have a GPU). A rough rule of thumb — a 7B model needs around 6–8 GB RAM in Q4, a 13B needs 10–12 GB, and a 70B needs around 40–48 GB. CPU inference is slower but totally usable for most tasks.
Installing Ollama Directly on the Host
The official install script handles everything on Debian, Ubuntu, and most RHEL-based distros. I always run this on a freshly hardened VPS after locking down SSH and setting up UFW. Run it as a non-root user with sudo privileges:
# Install Ollama using the official script
curl -fsSL https://ollama.com/install.sh | sh
# Verify the service started automatically
systemctl status ollama
# Pull your first model — llama3.2:3b is a great starter (2 GB)
ollama pull llama3.2:3b
# Run an interactive chat session
ollama run llama3.2:3b
Ollama installs as a systemd service called ollama and listens on 127.0.0.1:11434 by default. That local-only binding is intentional — you do not want the raw API exposed publicly without authentication. The service user is created automatically as ollama.
127.0.0.1:11434. If you change this to 0.0.0.0 via the OLLAMA_HOST environment variable so other containers or machines can reach it, make sure your firewall (UFW, iptables, or a cloud security group) blocks port 11434 from the public internet. Anyone who can hit that port can run arbitrary inference on your hardware and pull/delete your models.Once installed, test the REST API directly to confirm everything is working:
# Quick API smoke test
curl http://127.0.0.1:11434/api/generate \
-d '{
"model": "llama3.2:3b",
"prompt": "What is self-hosting in three sentences?",
"stream": false
}'
# List all locally available models
curl http://127.0.0.1:11434/api/tags
Setting Up the Full Stack with Docker Compose
For a proper homelab deployment I prefer to run Ollama and Open WebUI together in Docker Compose. It makes upgrades trivial, keeps the filesystem clean, and lets Watchtower handle automated image updates. The only time I deviate from this is when I need GPU pass-through on a machine where the NVIDIA Container Toolkit setup is already configured on the host — but even then, Docker Compose handles that gracefully.
Create a working directory and drop in this Compose file:
mkdir -p ~/ollama-stack && cd ~/ollama-stack
cat > compose.yml <<'EOF'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
volumes:
- ollama_data:/root/.ollama
# GPU pass-through — remove the deploy block entirely if CPU-only
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
# Expose to other containers on the internal network only
networks:
- ai-net
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_SECRET_KEY=changeme-use-a-long-random-string
volumes:
- open_webui_data:/app/backend/data
ports:
- "127.0.0.1:3000:8080"
depends_on:
- ollama
networks:
- ai-net
watchtower:
image: containrrr/watchtower:latest
container_name: watchtower
restart: unless-stopped
volumes:
- /var/run/docker.sock:/var/run/docker.sock
command: --interval 86400 ollama open-webui
volumes:
ollama_data:
open_webui_data:
networks:
ai-net:
driver: bridge
EOF
docker compose up -d
Open WebUI will be reachable at http://localhost:3000 from the server itself. The OLLAMA_BASE_URL points to the ollama container by its service name — Docker's internal DNS takes care of resolution on the ai-net bridge network.
WEBUI_SECRET_KEY to a long random string before you go live. You can generate one with openssl rand -hex 32. Open WebUI uses this key to sign session tokens, so leaving it as the default is a real security risk.Pulling and Managing Models
With the stack running, you can pull models either through the Open WebUI interface or directly via the Ollama CLI inside the container:
# Pull models inside the running container
docker exec -it ollama ollama pull mistral:7b
docker exec -it ollama ollama pull qwen2.5:14b
docker exec -it ollama ollama pull llama3.1:8b
# List what you have
docker exec -it ollama ollama list
# Remove a model to free space
docker exec -it ollama ollama rm mistral:7b
# Check how much disk the model store is using
docker exec -it ollama du -sh /root/.ollama/models
Model files are stored in the ollama_data Docker volume, which persists across container restarts and image upgrades. When Watchtower pulls a new ollama/ollama image and recreates the container, your downloaded models stay exactly where they are.
Putting Caddy in Front for HTTPS Access
I prefer Caddy as a reverse proxy for Ollama stacks because it handles Let's Encrypt certificates automatically with zero configuration. Add this to your Caddyfile (assuming Caddy is already running on the same host):
# /etc/caddy/Caddyfile snippet — add this block
ai.yourdomain.com {
reverse_proxy localhost:3000
}
Reload Caddy with systemctl reload caddy and within seconds you'll have HTTPS-protected access to Open WebUI at https://ai.yourdomain.com. Caddy renews the certificate automatically. If you're on Tailscale, you can also use Tailscale's built-in HTTPS feature and skip the public domain entirely — I do this for my homelab instances that I don't want internet-facing at all.
Hardware Recommendations and Model Choices
Based on my own testing, here's what I'd recommend depending on your budget:
- Entry-level (8–16 GB RAM, CPU-only): Stick to 3B–7B models.
llama3.2:3bandmistral:7bwork well. Expect 5–15 tokens/second on a modern CPU. - Mid-range (32 GB RAM or a 12 GB VRAM GPU like RTX 3060/4060):
llama3.1:8b,qwen2.5:14b, orphi4:14ball run comfortably. GPU inference is 10–30x faster than CPU. - High-end (64–128 GB RAM or 24+ GB VRAM):
llama3.1:70bat Q4 quantization is genuinely excellent and competes with GPT-4 on many tasks.
If you're on a VPS specifically, Hetzner's GPU instances (GX2 with NVIDIA A16 GPUs) are a solid choice. RackNerd's budget KVM VPS plans work fine for smaller models if you're comfortable with CPU-only inference.
Next Steps
Once your stack is up and you've had a chance to chat with a few models, I'd suggest two natural next steps. First, check out the Open WebUI documentation on RAG (retrieval-augmented generation) — you can upload your own documents and have the model answer questions specifically about them, which transforms it from a generic chatbot into a genuinely useful personal knowledge tool. Second, explore the Ollama API to integrate your local LLM with other self-hosted apps: n8n for workflow automation, Nextcloud's AI assistant integration, or even a simple Python script that summarizes your daily RSS feed every morning. The REST API is OpenAI-compatible, so most tools that support the OpenAI SDK work out of the box by just changing the base URL to http://localhost:11434/v1.
Running your own LLM stack is one of those homelab projects that starts as a weekend experiment and quickly becomes infrastructure you rely on daily. Get the stack running, spend some time with the models, and then start automating.
Discussion