Setting Up Ollama Locally: Running LLMs on Your Homelab

CompactHost · March 29, 2026

We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.

I've been running Ollama in my homelab for the past eight months, and it's transformed how I work with AI. No more API calls, no more rate limits, no more sending my prompts to OpenAI's servers. Instead, I have privacy-respecting language models running on hardware I control—and the setup is far simpler than most people think.

In this guide, I'll walk you through installing Ollama, configuring it for your hardware, and setting up a web interface so you can actually use it. By the end, you'll have a local LLM server that rivals ChatGPT for most real-world tasks.

Why Ollama? Privacy, Cost, and Control

Before I dive into the technical setup, let me be clear about why this matters. Every prompt you send to ChatGPT, Claude, or Gemini is logged. You're paying per token, and usage adds up quickly if you're iterating on ideas, running analysis, or working with code generation. Ollama changes that equation entirely.

When I run Llama 2 or Mistral locally, my prompts stay on my hardware. There's no subscription. There's no API bill. The only cost is electricity and the hardware investment—which, if you already have a homelab, is essentially zero marginal cost.

I'm also not bound by rate limits or context windows that larger providers enforce. I can run 70-billion-parameter models if my hardware supports it. I can fine-tune models on my own data. I can run inference 24/7 without worrying about someone's terms of service.

Hardware Requirements: What You Actually Need

Here's the honest truth: Ollama works on surprisingly modest hardware. I started on an older Ryzen 5 with 32GB of RAM, and it was usable. But here's what I learned through trial and error:

CPU inference: 8+ CPU cores, 16GB RAM minimum. Slow but works. Expect 2–5 tokens per second on smaller models.
GPU acceleration (better): NVIDIA GPU with 6GB+ VRAM. Quadro P2000 and up, or RTX series. 10–50 tokens per second depending on model size.
Production sweet spot: RTX 3070 or better (8GB+ VRAM), 32GB system RAM, modern CPU. This gives me 50–100+ tokens per second on Mistral, which feels instant.

If you don't have a GPU, don't be discouraged. CPU inference is viable for smaller models like Mistral 7B or Phi 3. It's slower, but it works.

Tip: You can also rent a GPU-equipped VPS for around $40/year from providers like RackNerd if you want to experiment without buying hardware. This is a cheap way to test Ollama before investing in local GPU hardware.

Installing Ollama: The Simple Way

I recommend running Ollama in Docker rather than installing it directly on your host. This keeps your system clean and makes updates painless. Here's what I do:

mkdir -p ~/ollama-setup
cd ~/ollama-setup

# Create a docker-compose.yml file
cat > docker-compose.yml << 'EOF'
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama-server
    ports:
      - "11434:11434"
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped
    # Uncomment the line below if you have an NVIDIA GPU
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: 1
    #           capabilities: [gpu]

volumes:
  ollama_data:
EOF

# Start the service
docker-compose up -d

That's it. Ollama is now running on port 11434. You can check if it's working:

curl http://localhost:11434/api/tags

If you see an empty list of models, that's correct—we haven't downloaded any yet.

Pulling Your First Model

Ollama has a model library similar to Docker Hub. I typically start with Mistral 7B because it's fast, smart, and only 4.1GB:

# Pull the model (this downloads ~4GB)
docker exec ollama-server ollama pull mistral

# Run a quick test
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral",
    "prompt": "What is Ollama?",
    "stream": false
  }' | jq .response

The first time you do this, it'll take a few minutes. Ollama is downloading the model weights. After that, subsequent queries are instant.

Other models I've tested and recommend:

Mistral 7B: My daily driver. Fast, coherent, 4.1GB. Best overall for homelabs.
Llama 2 7B: Solid all-rounder, slightly slower than Mistral. 3.8GB.
Neural Chat 7B: Optimized for conversation. Smaller footprint, good for resource-constrained setups.
Mixtral 8x7B: If you have the VRAM, exceptional quality. 45GB total. Overkill for most homelabs.

Watch out: Model sizes compound. If you pull three 7B models, that's 12–15GB of storage. Check your available disk space before pulling large models. I recommend keeping your ollama_data volume on fast storage (NVMe if possible).

Adding a Web Interface with Open WebUI

The curl method works, but it's not user-friendly. I use Open WebUI (formerly Ollama WebUI) as my interface. Add it to your docker-compose setup:

cat >> docker-compose.yml << 'EOF'

  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_API_BASE_URL=http://ollama-server:11434/api
    depends_on:
      - ollama
    volumes:
      - webui_data:/app/backend/data
    restart: unless-stopped

volumes:
  webui_data:
EOF

docker-compose up -d

Now visit http://localhost:3000 in your browser. You'll see a ChatGPT-like interface. Create an account (it's local, just for access control), select your model from the dropdown, and start chatting.

This is where Ollama really shines. You get a polished interface without paying API fees, without rate limits, without anyone tracking your conversations.

GPU Acceleration (NVIDIA)

If you have an NVIDIA GPU, enable it. I saw a 20x speed improvement when I added my RTX 3070. First, install nvidia-docker on your host:

# For Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

# Test GPU access
docker run --rm --gpus all nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi

Then uncomment the GPU section in your docker-compose.yml (I showed it above), and restart:

docker-compose down
docker-compose up -d

Check the logs to confirm GPU is loaded:

docker logs ollama-server | grep -i gpu

You should see something like "NVIDIA GPU detected" in the output. Now your models will run at full speed.

Integrating Ollama with Your Existing Stack

The real power of Ollama is using it in your other self-hosted services. I've integrated it with:

Nextcloud with AI features: Use Ollama for smart document summarization and tagging.
Home automation: Process voice commands with transcription + Ollama inference.
Personal wiki: Auto-generate table of contents, search summaries, and Q&A sections.
Chat applications: Any app that calls an OpenAI-compatible endpoint can use Ollama. The API is drop-in compatible.

Most integrations use the standard OpenAI API format. Point them to http://localhost:11434/v1/ (note the `/v1/` endpoint), and they'll work as if you're using GPT.

Storage and Performance Tuning

I made one mistake early on: storing ollama_data on a spinning hard drive. Model loading was painful. Moved it to an NVMe SSD, and startup time dropped from 30 seconds to 3 seconds.

Also, monitor VRAM usage. If your GPU doesn't have enough memory, Ollama falls back to CPU inference silently (and slowly). I check this regularly:

docker exec ollama-server nvidia-smi

If VRAM is maxed out and performance tanks, either reduce your model size or add more VRAM.

Backing Up Your Models

Your models are in the ollama_data Docker volume. Back it up like any other important data:

# Create a backup
docker run --rm -v ollama_data:/data -v $(pwd):/backup alpine tar czf /backup/ollama_backup.tar.gz -C /data .

# Restore if needed
docker run --rm -v ollama_data:/data -v $(pwd):/backup alpine tar xzf /backup/ollama_backup.tar.gz -C /data

This way, if you need to rebuild your system, you don't have to redownload 20GB+ of models.

Wrapping Up

Ollama has given me independence from OpenAI, privacy by default, and zero API costs. The setup takes maybe 15 minutes, and it integrates seamlessly into any homelab. Whether you're building a personal AI assistant, automating workflows, or just experimenting with LLMs, running them locally is now genuinely practical.

Start with Mistral 7B, add Open WebUI for a polished interface, and if you have GPU hardware, enable it for instant results. From there, you can layer it into the rest of your self-hosted ecosystem.

Next steps: Set up Ollama today, then explore integrating it with Open WebUI (which I covered above). Once you're comfortable, try adding it to a personal project—maybe a smart document summarizer or a chat bot for your homelab documentation.

Setting Up Ollama Locally: Running LLMs on Your Homelab

Why Ollama? Privacy, Cost, and Control

Hardware Requirements: What You Actually Need

Installing Ollama: The Simple Way

Pulling Your First Model

Adding a Web Interface with Open WebUI

GPU Acceleration (NVIDIA)

Integrating Ollama with Your Existing Stack

Storage and Performance Tuning

Backing Up Your Models

Wrapping Up

Discussion