Running Open-Source AI Models Privately with Ollama and Open WebUI
We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.
Every prompt you type into a commercial AI service gets logged, used for training, and stored on someone else's server. I got tired of that reality about eighteen months ago, and since then I've been running every AI workload I can locally or on hardware I control. Ollama combined with Open WebUI gives you a ChatGPT-style experience where your data never leaves your machine — and the setup is genuinely straightforward once you know the gotchas. In this tutorial I'll walk you through a complete Docker Compose deployment that gets Ollama and Open WebUI running in under thirty minutes, with optional GPU passthrough for real-world inference speeds.
Why This Stack, and What Hardware You Actually Need
Ollama handles model management, quantisation, and inference. Open WebUI is a polished React frontend that talks to Ollama's REST API — it supports multi-user accounts, conversation history, model switching, and even image generation if you wire up Stable Diffusion. Together they replace a $20/month ChatGPT subscription for most everyday writing and coding tasks.
Hardware matters here, so I'll be honest. For CPU-only inference on a 7B model like llama3.2 or mistral, you need at least 8 GB of RAM and you'll get 3–6 tokens per second on a modern Intel or AMD processor — usable, but slow. If you have an NVIDIA GPU with 8 GB VRAM (a used RTX 3070 or better), you'll hit 40–80 tokens per second on the same model, which feels instant. I run this on an RTX 4070 Ti in my homelab tower and it's indistinguishable from the cloud services in terms of speed.
If you're thinking about running this on a VPS rather than local hardware, DigitalOcean's GPU Droplets are a reasonable option for testing before you invest in local hardware — you pay only for what you use.
Prerequisites
- Docker Engine 24+ and Docker Compose v2 installed
- At least 16 GB of disk space free for models (plan on 5–8 GB per 7B model)
- For GPU: NVIDIA drivers 525+ and
nvidia-container-toolkitinstalled - Ubuntu 22.04 / 24.04 or Debian 12 (these instructions target those distros)
If you haven't installed the NVIDIA container toolkit yet, do that first:
# Install NVIDIA Container Toolkit (Ubuntu/Debian)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# Configure the Docker runtime and restart
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify — should print your GPU name
sudo docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi
The Docker Compose File
I keep this stack in /opt/ollama/. Create that directory and drop in the following compose.yml. I've included both a GPU variant (the default) and a CPU-only fallback — just comment out the deploy block if you don't have a GPU.
mkdir -p /opt/ollama/data/open-webui
cd /opt/ollama
# /opt/ollama/compose.yml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
volumes:
- ollama_data:/root/.ollama
ports:
- "127.0.0.1:11434:11434" # bind to localhost only — do NOT expose publicly
environment:
- OLLAMA_KEEP_ALIVE=5m # keep model loaded in VRAM for 5 minutes after last request
- OLLAMA_NUM_PARALLEL=2 # allow 2 concurrent inference requests
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
# CPU-only: remove the entire 'deploy' block above
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
depends_on:
- ollama
volumes:
- ./data/open-webui:/app/backend/data
ports:
- "127.0.0.1:3000:8080" # also localhost-only; put Caddy or Nginx in front
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_SECRET_KEY=change-this-to-a-random-string-before-deploying
- ENABLE_SIGNUP=true # set to false after creating your admin account
- DEFAULT_MODELS=llama3.2:latest
volumes:
ollama_data:
127.0.0.1 in this config — meaning they're not reachable from other machines. This is intentional. If you want to access Open WebUI from your network or the internet, put a reverse proxy (I use Caddy) in front of port 3000. Never expose port 11434 directly to the internet; the Ollama API has no authentication.Bring the stack up:
cd /opt/ollama
docker compose up -d
# Watch the logs to confirm both containers started cleanly
docker compose logs -f
Pulling Your First Model
Open WebUI will prompt you to pull a model on first login, but I prefer doing it from the command line so I can see download progress. Ollama uses a model library at ollama.com/library. My go-to recommendations:
- llama3.2:3b — 2 GB, extremely fast on CPU, great for quick tasks
- llama3.1:8b — 5 GB, best all-around quality/speed balance on 8 GB VRAM
- mistral:7b-instruct-q4_K_M — excellent for coding and structured output
- deepseek-r1:8b — strong reasoning model, roughly matches GPT-4o on benchmarks
- nomic-embed-text — for RAG/embeddings, not chat
# Pull models through the running container
docker exec -it ollama ollama pull llama3.2:3b
docker exec -it ollama ollama pull llama3.1:8b
docker exec -it ollama ollama pull mistral:7b-instruct-q4_K_M
# List downloaded models
docker exec -it ollama ollama list
# Quick sanity test from the command line
docker exec -it ollama ollama run llama3.2:3b "Explain TCP/IP in one sentence"
First Login and Hardening Open WebUI
Navigate to http://localhost:3000 (or your server's IP if you're accessing remotely through a reverse proxy). The first account you create automatically becomes the admin. After that, go straight into Admin Panel → Settings and:
- Set
ENABLE_SIGNUP=falsein yourcompose.ymland rundocker compose up -dagain — or toggle it off in the UI — so random users can't self-register. - Create additional user accounts manually for anyone else in your household.
- Under Models, set a default model so new conversations don't start with a blank selector.
- If you want document uploads and RAG (retrieval-augmented generation), enable it under Documents and pull the
nomic-embed-textmodel.
./data/open-webui/webui.db. Back that file up regularly — it's your entire conversation history. A simple daily cron that copies it to a different disk or to your Nextcloud is enough.Putting Caddy in Front (Recommended)
I prefer Caddy because automatic HTTPS requires zero configuration. Add this to your existing Caddyfile or create a new one:
ai.yourdomain.com {
reverse_proxy localhost:3000
# Caddy automatically obtains and renews a Let's Encrypt cert
}
Then reload Caddy with sudo systemctl reload caddy. If you don't have a public domain and you're using this on a private network, you can use Tailscale to get a stable hostname and combine it with Caddy's internal DNS challenge — that keeps everything off the public internet entirely.
Managing Models and Disk Space
Models accumulate fast. A few housekeeping commands I use regularly:
# See what's downloaded and how large each model is
docker exec -it ollama ollama list
# Remove a model you no longer use
docker exec -it ollama ollama rm mistral:7b-instruct-q4_K_M
# Check total size of the model volume
docker system df -v | grep ollama_data
# Update all running containers (including ollama and open-webui) with Watchtower
docker run --rm \
-v /var/run/docker.sock:/var/run/docker.sock \
containrrr/watchtower --run-once ollama open-webui
I have Watchtower set up on a weekly schedule to keep both images current. Ollama ships model format updates fairly often, and Open WebUI gets new features every few weeks — staying current matters for this stack more than most.
Running This on a VPS
If you don't have suitable local hardware, a GPU-equipped VPS is a viable option. Create a DigitalOcean account and look at their GPU Droplets — an H100 instance is overkill for personal use, but even a CPU-optimised Droplet with 16 GB RAM will handle 3B and 7B models for light usage. The main tradeoff is cost: you'll pay hourly, so spin it down when you're not using it, or stick with local hardware for always-on deployments.
Wrapping Up
With this setup you have a fully private, self-hosted AI assistant that rivals the commercial offerings for most day-to-day tasks — writing, summarising, coding, brainstorming — with zero data leaving your infrastructure. The entire stack costs nothing to run on hardware you already own, and you're not subject to rate limits, subscription price increases, or policy changes from a third party.
My suggested next steps: pull deepseek-r1:8b for anything that needs reasoning, enable the Documents feature in Open WebUI and experiment with uploading PDFs for RAG queries, and look at wiring Open WebUI into your WireGuard VPN so you can reach it securely from anywhere without exposing it to the public internet.
Discussion