Setting Up Ollama with Docker: A Complete Guide to Local LLM Deployment

CompactHost · March 28, 2026

We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.

I started running Ollama in my homelab six months ago, and it completely changed how I think about AI workloads. Instead of sending my data to OpenAI or Claude every time I need to write, code, or brainstorm, everything stays local. With Docker, deployment is straightforward—and if you want to expose it securely to the internet, a $40/year VPS from providers like RackNerd gives you a proper public endpoint without touching your home network.

In this guide, I'll walk you through deploying Ollama in Docker, configuring GPU acceleration if you have it, and running it alongside Open WebUI for a ChatGPT-like interface—all completely offline and under your control.

Why Docker for Ollama?

Before Docker, I installed Ollama directly on my Ubuntu server. Updates were fine, but managing dependencies, isolating resource usage, and switching between hardware setups became a mess. Docker solved that.

With Docker, Ollama becomes portable. I can spin up a new instance in seconds, run it on my NAS, my homelab VPS, or my local machine without recompiling anything. The official Ollama image handles all the heavy lifting. Plus, if you want a public-facing API endpoint (instead of just homelab access), you can quickly deploy the same container on a cheap VPS—no rebuilding required.

Prerequisites

Docker and Docker Compose installed (I'm using version 25.x)
At least 8 GB RAM (16+ GB recommended for larger models like Llama 2 13B)
A GPU is optional but transforms performance. I use an RTX 4070, but even a GTX 1660 makes a difference
Basic familiarity with Docker and compose files

If you're on a VPS and want to run this remotely, RackNerd's New Year deals offer solid specs—typically around $40 annually for a 2-core VPS with 4–8 GB RAM, which is enough for smaller models like Mistral or Phi. Check their current offerings at racknerd.com for available deals.

Simple Docker Setup (CPU Only)

Let's start with the simplest approach—running Ollama in a container without GPU acceleration. This works fine for small models or testing, though inference is slower.

docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama:latest

That's it. Ollama is now listening on port 11434. The volume mount persists downloaded models across container restarts.

Pull a model and test it:

docker exec ollama ollama pull mistral
docker exec ollama ollama run mistral "What is containerization?"

The first run downloads the model (Mistral is about 4 GB). Subsequent runs load it from cache.

Tip: To see what models you've downloaded, run docker exec ollama ollama list. This is useful when managing storage—models accumulate quickly.

GPU-Accelerated Setup with Docker Compose

If you have an NVIDIA GPU (AMD support is experimental), let's enable CUDA acceleration. This dramatically speeds up inference. I tested Mistral 7B on my RTX 4070: CPU mode took 8 seconds per response, GPU mode took 1.2 seconds. Huge difference.

First, ensure NVIDIA Docker runtime is installed:

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.3.1-runtime-ubuntu22.04 nvidia-smi

If that returns GPU info, you're good. If not, install the NVIDIA Container Toolkit following their official docs.

Now, here's my Docker Compose setup for Ollama with GPU support and Open WebUI for a web interface:

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
    networks:
      - ollama_net

  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: unless-stopped
    networks:
      - ollama_net

volumes:
  ollama_data:
    driver: local
  webui_data:
    driver: local

networks:
  ollama_net:
    driver: bridge

Save this as docker-compose.yml and deploy:

docker compose up -d

Wait 30–60 seconds for services to start. Then open your browser to http://localhost:3000. Open WebUI will ask for a username and password on first login—set these once, and you'll have a ChatGPT-like interface talking to your local Ollama instance.

Pull a model via the web UI (Models → Pull from Ollama) or from the CLI:

docker exec ollama ollama pull llama2
docker exec ollama ollama pull neural-chat

I typically run Mistral (7B, fastest for my hardware) and Llama 2 (13B, more capable but slower). Both fit comfortably on a 24 GB GPU.

Watch out: If Open WebUI can't find Ollama, check that both containers are on the same Docker network. The environment variable OLLAMA_BASE_URL=http://ollama:11434 uses the service name, not localhost. If you change the network setup, update this URL.

Memory and Resource Management

LLMs are resource-hungry. Here's what I've learned:

Model size: A 7B model typically needs 7–8 GB VRAM. A 13B model needs 13–15 GB. Larger models spill to system RAM, which kills performance
Context window: Running with a long context (8K tokens) uses significantly more memory than defaults (2K tokens)
Concurrent requests: By default, Ollama serializes requests. If you want parallel inference, set OLLAMA_NUM_PARALLEL=2 in the compose file, but add memory headroom

Monitor usage with:

docker stats ollama open-webui

I set memory limits in compose if I'm sharing hardware with other services:

    deploy:
      resources:
        limits:
          memory: 20G
        reservations:
          memory: 16G
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Exposing Ollama to the Internet (Safely)

If you want to access Ollama from outside your home network without exposing your homelab directly, deploy it on a cheap public VPS. RackNerd's annual plans (around $40/year for a decent 2-core VPS) are perfect for this.

On the VPS, deploy the same compose setup, then add a reverse proxy (Nginx or Caddy) with authentication. I prefer Caddy:

apt install -y caddy

# Edit /etc/caddy/Caddyfile
api.yourdomain.com {
  basicauth / {
    username bcrypt_hash_here
  }
  reverse_proxy localhost:11434
}

caddy reload

Generate the bcrypt hash with caddy hash-password. Now only authenticated requests reach Ollama. This keeps your data on your own hardware while giving you flexible remote access.

Updating and Maintenance

Pull the latest Ollama image regularly:

docker compose pull ollama
docker compose up -d

This restarts the service with the new image. Models persist in the volume, so you don't lose them.

To delete old, unused models and reclaim disk space:

docker exec ollama ollama rm mistral:7b

I store models on a dedicated 2 TB drive mounted at /var/lib/docker/volumes, so they don't clutter my system drive.

Performance Tuning

A few tweaks I've found helpful:

Quantization: Use Q4 or Q5 quantized models (e.g., mistral:7b-instruct-q4_0) to reduce VRAM while keeping quality reasonable
Temperature and top-k: In Open WebUI, lower temperature (0.3–0.5) for deterministic outputs, higher (0.8–1.0) for creativity
Batch size: Ollama's default batch size is conservative. If you have headroom, set OLLAMA_NUM_PARALLEL=2 to handle concurrent requests

What's Next?

Once Ollama is running, consider integrating it with other tools. I use it with Gitea (self-hosted Git) for code review automation, and I'm building a retrieval-augmented generation (RAG) system with Nextcloud for document Q&A. The API endpoint at port 11434 works with any application that speaks HTTP JSON.

If you want finer control over inference (model-specific parameters, batching strategies, or load balancing across multiple GPUs), explore vLLM or Text Generation WebUI. But for most homelab use cases, Ollama in Docker is the sweet spot—simple to deploy, plenty capable, and fully under your control.

Setting Up Ollama with Docker: A Complete Guide to Local LLM Deployment

Why Docker for Ollama?

Prerequisites

Simple Docker Setup (CPU Only)

GPU-Accelerated Setup with Docker Compose

Memory and Resource Management

Exposing Ollama to the Internet (Safely)

Updating and Maintenance

Performance Tuning

What's Next?

Discussion