Setting Up Ollama with Docker: A Complete Guide to Local LLM Deployment
We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.
I started running Ollama in my homelab six months ago, and it completely changed how I think about AI workloads. Instead of sending my data to OpenAI or Claude every time I need to write, code, or brainstorm, everything stays local. With Docker, deployment is straightforward—and if you want to expose it securely to the internet, a $40/year VPS from providers like RackNerd gives you a proper public endpoint without touching your home network.
In this guide, I'll walk you through deploying Ollama in Docker, configuring GPU acceleration if you have it, and running it alongside Open WebUI for a ChatGPT-like interface—all completely offline and under your control.
Why Docker for Ollama?
Before Docker, I installed Ollama directly on my Ubuntu server. Updates were fine, but managing dependencies, isolating resource usage, and switching between hardware setups became a mess. Docker solved that.
With Docker, Ollama becomes portable. I can spin up a new instance in seconds, run it on my NAS, my homelab VPS, or my local machine without recompiling anything. The official Ollama image handles all the heavy lifting. Plus, if you want a public-facing API endpoint (instead of just homelab access), you can quickly deploy the same container on a cheap VPS—no rebuilding required.
Prerequisites
- Docker and Docker Compose installed (I'm using version 25.x)
- At least 8 GB RAM (16+ GB recommended for larger models like Llama 2 13B)
- A GPU is optional but transforms performance. I use an RTX 4070, but even a GTX 1660 makes a difference
- Basic familiarity with Docker and compose files
If you're on a VPS and want to run this remotely, RackNerd's New Year deals offer solid specs—typically around $40 annually for a 2-core VPS with 4–8 GB RAM, which is enough for smaller models like Mistral or Phi. Check their current offerings at racknerd.com for available deals.
Simple Docker Setup (CPU Only)
Let's start with the simplest approach—running Ollama in a container without GPU acceleration. This works fine for small models or testing, though inference is slower.
docker run -d \
--name ollama \
-p 11434:11434 \
-v ollama:/root/.ollama \
ollama/ollama:latest
That's it. Ollama is now listening on port 11434. The volume mount persists downloaded models across container restarts.
Pull a model and test it:
docker exec ollama ollama pull mistral
docker exec ollama ollama run mistral "What is containerization?"
The first run downloads the model (Mistral is about 4 GB). Subsequent runs load it from cache.
docker exec ollama ollama list. This is useful when managing storage—models accumulate quickly.GPU-Accelerated Setup with Docker Compose
If you have an NVIDIA GPU (AMD support is experimental), let's enable CUDA acceleration. This dramatically speeds up inference. I tested Mistral 7B on my RTX 4070: CPU mode took 8 seconds per response, GPU mode took 1.2 seconds. Huge difference.
First, ensure NVIDIA Docker runtime is installed:
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.3.1-runtime-ubuntu22.04 nvidia-smi
If that returns GPU info, you're good. If not, install the NVIDIA Container Toolkit following their official docs.
Now, here's my Docker Compose setup for Ollama with GPU support and Open WebUI for a web interface:
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0:11434
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
networks:
- ollama_net
open-webui:
image: ghcr.io/open-webui/open-webui:latest
container_name: open-webui
ports:
- "3000:8080"
volumes:
- webui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
restart: unless-stopped
networks:
- ollama_net
volumes:
ollama_data:
driver: local
webui_data:
driver: local
networks:
ollama_net:
driver: bridge
Save this as docker-compose.yml and deploy:
docker compose up -d
Wait 30–60 seconds for services to start. Then open your browser to http://localhost:3000. Open WebUI will ask for a username and password on first login—set these once, and you'll have a ChatGPT-like interface talking to your local Ollama instance.
Pull a model via the web UI (Models → Pull from Ollama) or from the CLI:
docker exec ollama ollama pull llama2
docker exec ollama ollama pull neural-chat
I typically run Mistral (7B, fastest for my hardware) and Llama 2 (13B, more capable but slower). Both fit comfortably on a 24 GB GPU.
OLLAMA_BASE_URL=http://ollama:11434 uses the service name, not localhost. If you change the network setup, update this URL.Memory and Resource Management
LLMs are resource-hungry. Here's what I've learned:
- Model size: A 7B model typically needs 7–8 GB VRAM. A 13B model needs 13–15 GB. Larger models spill to system RAM, which kills performance
- Context window: Running with a long context (8K tokens) uses significantly more memory than defaults (2K tokens)
- Concurrent requests: By default, Ollama serializes requests. If you want parallel inference, set
OLLAMA_NUM_PARALLEL=2in the compose file, but add memory headroom
Monitor usage with:
docker stats ollama open-webui
I set memory limits in compose if I'm sharing hardware with other services:
deploy:
resources:
limits:
memory: 20G
reservations:
memory: 16G
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Exposing Ollama to the Internet (Safely)
If you want to access Ollama from outside your home network without exposing your homelab directly, deploy it on a cheap public VPS. RackNerd's annual plans (around $40/year for a decent 2-core VPS) are perfect for this.
On the VPS, deploy the same compose setup, then add a reverse proxy (Nginx or Caddy) with authentication. I prefer Caddy:
apt install -y caddy
# Edit /etc/caddy/Caddyfile
api.yourdomain.com {
basicauth / {
username bcrypt_hash_here
}
reverse_proxy localhost:11434
}
caddy reload
Generate the bcrypt hash with caddy hash-password. Now only authenticated requests reach Ollama. This keeps your data on your own hardware while giving you flexible remote access.
Updating and Maintenance
Pull the latest Ollama image regularly:
docker compose pull ollama
docker compose up -d
This restarts the service with the new image. Models persist in the volume, so you don't lose them.
To delete old, unused models and reclaim disk space:
docker exec ollama ollama rm mistral:7b
I store models on a dedicated 2 TB drive mounted at /var/lib/docker/volumes, so they don't clutter my system drive.
Performance Tuning
A few tweaks I've found helpful:
- Quantization: Use Q4 or Q5 quantized models (e.g.,
mistral:7b-instruct-q4_0) to reduce VRAM while keeping quality reasonable - Temperature and top-k: In Open WebUI, lower temperature (0.3–0.5) for deterministic outputs, higher (0.8–1.0) for creativity
- Batch size: Ollama's default batch size is conservative. If you have headroom, set
OLLAMA_NUM_PARALLEL=2to handle concurrent requests
What's Next?
Once Ollama is running, consider integrating it with other tools. I use it with Gitea (self-hosted Git) for code review automation, and I'm building a retrieval-augmented generation (RAG) system with Nextcloud for document Q&A. The API endpoint at port 11434 works with any application that speaks HTTP JSON.
If you want finer control over inference (model-specific parameters, batching strategies, or load balancing across multiple GPUs), explore vLLM or Text Generation WebUI. But for most homelab use cases, Ollama in Docker is the sweet spot—simple to deploy, plenty capable, and fully under your control.
Discussion