Running Local LLMs with Ollama in Docker Containers

Running Local LLMs with Ollama in Docker Containers

I've been running Ollama with local language models on my homelab for six months now, and it's genuinely one of the most useful self-hosted setups I've deployed. The ability to run Llama 2, Mistral, or even Dolphin models entirely on your own hardware—without relying on OpenAI or Claude APIs—gives you complete privacy and zero recurring costs. Docker makes this almost trivially easy to set up and manage.

In this guide, I'll walk you through setting up Ollama in Docker, integrating it with Open WebUI for a ChatGPT-like interface, and optimizing it for your homelab hardware.

Why Ollama + Docker?

Before I containerized Ollama, I ran it bare-metal on Ubuntu. The problem? Model files scattered everywhere, version conflicts with dependencies, and scaling to multiple machines became a nightmare. Docker solved all of that.

Ollama is purpose-built for running LLMs locally. It downloads pre-quantized models (Llama 2 7B runs in ~4GB VRAM), handles GPU acceleration automatically if your hardware supports it, and exposes a simple REST API that other services can consume. When combined with Open WebUI, you get a fully functional AI chat interface comparable to ChatGPT—but it's yours.

I prefer Ollama over trying to run raw transformers or llama.cpp because the model management is dead simple: ollama pull mistral and it's ready. No wrestling with GGUF files or quantization formats.

Hardware Requirements

You don't need cutting-edge GPU hardware. I'm running this successfully on:

Even without a GPU, modern quantized models (7B parameters) are genuinely usable. For non-realtime tasks like document analysis or code generation, CPU inference is fine.

Docker Compose Setup

Here's my complete production Docker Compose file. I'm running this on a dedicated VM from RackNerd with a single GPU, which keeps costs reasonable while giving me solid performance.

version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: always
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - CUDA_VISIBLE_DEVICES=0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - ollama_network

  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    restart: always
    ports:
      - "8080:8080"
    environment:
      - OLLAMA_API_BASE_URL=http://ollama:11434/api
      - WEBUI_SECRET_KEY=your_random_secret_key_here
    volumes:
      - webui_data:/app/backend/data
    depends_on:
      - ollama
    networks:
      - ollama_network

volumes:
  ollama_data:
  webui_data:

networks:
  ollama_network:
    driver: bridge

Save this as docker-compose.yml in a new directory (I use /opt/ollama/). Then start the stack:

cd /opt/ollama
docker-compose up -d

# Pull a model (this will download ~4GB for Mistral 7B)
docker exec ollama ollama pull mistral

# Check it's running
docker logs ollama

Within 30 seconds, Open WebUI will be available at http://your-ip:8080. Create an account, and you're chatting with a local LLM.

Tip: If you don't have NVIDIA GPU support in Docker, remove the deploy.resources.reservations.devices section entirely. Ollama will fall back to CPU inference—slower, but it works. On Linux with an NVIDIA GPU, make sure nvidia-container-runtime is installed: apt install nvidia-container-runtime.

Pulling and Managing Models

Ollama's model library is extensive. Here are the ones I actually use regularly:

Pull models directly into the container:

docker exec ollama ollama pull neural-chat
docker exec ollama ollama pull llama2:13b
docker exec ollama ollama list

Models are stored in the ollama_data volume and persist across container restarts. I typically run 2-3 models loaded at once; they share VRAM if you switch between them, and Ollama unloads unused models automatically after five minutes of inactivity.

Integrating with Your Homelab

The real power comes from consuming Ollama's API in other services. I have integrations running in:

The API is straightforward. Here's a quick example in Python:

import requests
import json

OLLAMA_URL = "http://ollama:11434/api/generate"

def chat_with_llm(prompt, model="neural-chat"):
    payload = {
        "model": model,
        "prompt": prompt,
        "stream": False
    }
    
    response = requests.post(OLLAMA_URL, json=payload)
    result = response.json()
    return result.get("response", "No response")

# Usage
answer = chat_with_llm("Explain Docker in one sentence")
print(answer)

This is how I build intelligent features into my self-hosted apps without external API dependencies.

Performance Tuning and Monitoring

GPU memory is precious. Here's what I monitor and optimize:

Check current model usage:

docker exec ollama ollama ps

This shows which models are loaded and their VRAM footprint. If you're running multiple services competing for GPU memory, you'll see performance degradation.

Unload a model manually:

docker exec ollama ollama pull model_name:latest && docker exec ollama ollama show model_name

Control model timeouts in docker-compose:

By default, Ollama keeps a model in VRAM for 5 minutes after it's used. If you're memory-constrained, reduce this in the Ollama environment:

environment:
  - OLLAMA_HOST=0.0.0.0:11434
  - OLLAMA_KEEP_ALIVE=2m

Set OLLAMA_KEEP_ALIVE=1m to unload unused models faster, or 0 to unload immediately (trades startup latency for VRAM).

Watch out: If you're running Ollama on shared GPU hardware (e.g., a system where you also game or run video encoding), be aware that CUDA errors can cause Ollama to crash silently. Monitor docker logs ollama regularly. I've had mysterious hangs traced back to GPU driver updates—keep your NVIDIA drivers current.

Reverse Proxy Considerations

I expose Open WebUI through Caddy behind my home network, but the Ollama API itself stays internal-only. If you need to expose Ollama's API externally, be very careful about rate limiting and authentication—the default Ollama API has no built-in auth.

For Open WebUI, my Caddy config looks like:

chat.lab {
  reverse_proxy localhost:8080
  encode gzip
}

Simple, secure, and the WebUI handles its own login.

Troubleshooting

Model downloads hang: Ollama's default timeout for downloads is 5 minutes. Large models can exceed this. Check your network, and if you're stuck, restart the container.

Open WebUI shows "Ollama unavailable": Verify connectivity between containers: docker exec open-webui ping ollama. If you changed the network name, update the OLLAMA_API_BASE_URL environment variable.

Slow inference on CPU: Completely normal. A 7B model on CPU gives 2-5 tokens per second. If you need speed, invest in a used GPU. Even a GTX 1080 Ti (~$150-200 used) gives you 20+ tokens/sec.

Next Steps

Now that you have a local LLM running, the next logical move is integrating it with your other self-hosted services. I'd recommend building a simple API wrapper that handles authentication, rate limiting, and model selection—especially if you're exposing it to multiple services on your network.

If you're planning to expand your homelab infrastructure and need reliable, affordable hosting for your other services (documentation, backups, monitoring), RackNerd's KVM VPS options offer excellent performance at scale. I currently run several production workloads on their infrastructure.

The real payoff comes when you stop thinking of Ollama as just a ChatGPT replacement and start using it as a building block for intelligent automation across your entire homelab.

Discussion