Setting Up Ollama Locally: Run Open-Source LLMs on Your Homelab
If you've been curious about running large language models without sending your data to OpenAI or relying on cloud APIs that cost money every month, Ollama is the answer. I've spent the last few weeks running Mistral, Llama 2, and even the new Phi models directly on my homelab hardware, and I can tell you: it actually works, it's surprisingly fast with the right setup, and it changes how you think about local AI.
This guide walks you through installing Ollama, choosing the right model for your hardware, integrating it with Open WebUI for a ChatGPT-like interface, and troubleshooting the gotchas I hit along the way.
Why Run Ollama at Home?
Before we dive into installation, let me be clear about why this matters. Running models locally means:
- No API costs: Zero dollars per token. Run inference as much as you want.
- Complete privacy: Your prompts never leave your network. No data collection, no logs at OpenAI's servers.
- Full control: Fine-tune models, create custom context, run specialized variants without restrictions.
- Offline capability: The model works even if your internet goes down.
The trade-off? You need decent hardware. A modern CPU works, but a GPU with VRAM (8GB minimum, 12GB+ ideal) will give you response times that actually feel snappy. I'm running this on a used RTX 3060 in my homelab, and I get responses in 2–4 seconds for most queries. Without GPU acceleration, expect 10–30 seconds per response depending on the model and your CPU.
Hardware Requirements
Ollama runs on Linux, macOS, and Windows. For a homelab, I strongly prefer Linux—either a dedicated server or inside Docker on an existing machine.
Minimum: 8GB RAM, 4-core CPU, 20GB disk space
Recommended: 16GB+ RAM, 8-core CPU, GPU with 8GB+ VRAM, 50GB+ disk space
GPU support is the big one. If you have an NVIDIA card, Ollama will detect it automatically and use CUDA. AMD users can use ROCm. Intel Arc also works. If you don't have a GPU, Ollama will fall back to CPU inference—slower, but functional for smaller models like Phi or a quantized Mistral.
mistral or neural-chat—both are fast and fit on modest hardware.Installing Ollama
Installation is straightforward. On Linux, grab the installer from ollama.ai (now ollama.com). I prefer the containerized approach because I already run Docker on my homelab, but you can also install it as a system service.
Option 1: Direct Installation (Linux/macOS)
curl -fsSL https://ollama.ai/install.sh | sh
This downloads and installs Ollama, sets up a systemd service, and exposes the API on localhost:11434.
Option 2: Docker (Recommended for Homelab)
If you're already running Docker, this is cleaner. Create a docker-compose.yml:
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0:11434
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
volumes:
ollama_data:
Then start it:
docker-compose up -d
The GPU line is optional—remove it if you don't have NVIDIA and Docker's GPU support installed. If you do have an NVIDIA GPU and haven't set up Docker GPU support, that's a separate step: install nvidia-docker and the NVIDIA Container Toolkit first.
Verify Ollama is running:
curl http://localhost:11434/api/tags
You should get a JSON response with an empty models array. Good—now we download a model.
Choosing and Running a Model
Ollama's model library is at ollama.com/library. Each model has different sizes and quantizations. The format is modelname:variant. For example:
mistral:latest– 7B model, optimized for speedllama2:13b– 13B model, better reasoningneural-chat:7b– Chat-optimized Mistral variantphi:latest– Tiny 2.7B model, CPU-friendlyopenchat:latest– Good for conversation
Pull a model (this downloads it and stores it in the volume):
curl http://localhost:11434/api/pull -d '{"name":"mistral"}'
Or if Ollama is installed locally:
ollama pull mistral
This downloads roughly 4GB. Grab a coffee. Once it's done, run a quick test:
ollama run mistral "What is the capital of France?"
Or via API:
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Why is the sky blue?",
"stream": false
}'
You'll get a JSON response with the generated text. If this works, Ollama is alive and working.
ollama_data directory). If you run out of space mid-download, Ollama may hang. Delete incomplete models with ollama rm modelname and retry.Add Open WebUI for a Chat Interface
The raw API is powerful, but if you want something that feels like ChatGPT, add Open WebUI. It's a beautiful, self-hosted web interface that connects to Ollama and gives you chat history, model selection, and more.
Update your docker-compose.yml:
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0:11434
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
networks:
- webui
open-webui:
image: ghcr.io/open-webui/open-webui:latest
container_name: open-webui
ports:
- "8080:8080"
volumes:
- webui_data:/app/backend/data
environment:
- OLLAMA_API_BASE_URL=http://ollama:11434/api
depends_on:
- ollama
restart: unless-stopped
networks:
- webui
volumes:
ollama_data:
webui_data:
networks:
webui:
driver: bridge
Bring it up:
docker-compose up -d
Open your browser to http://localhost:8080. Sign up (first user becomes admin), and you'll see a ChatGPT-like interface. Select your model from the dropdown and start chatting. The interface includes chat history, model parameters (temperature, context length), and markdown rendering for code blocks.
Performance Tuning and Memory Management
After running this for a few weeks, here's what I learned about keeping it stable:
Context Length: Larger context windows are slower. Start with 2048 tokens and increase only if you need it. The default is usually fine.
Temperature: Lower values (0.3–0.5) give more consistent, focused answers. Higher values (0.8–1.0) make responses more creative but less reliable.
Keep Alive: By default, Ollama unloads models from VRAM after 5 minutes of inactivity. If you're running multiple models, you can control this with the OLLAMA_KEEP_ALIVE environment variable in Docker. Set it to "24h" to keep models loaded longer (useful if you swap between two models frequently), but this uses more memory.
GPU Memory: Monitor with nvidia-smi. If models are slow, check if VRAM is maxed out. Smaller quantizations (3-bit or 4-bit) use less memory than full precision (8-bit).
Integration with Other Services
Once Ollama is running, you can integrate it into other homelab apps:
- ChatGPT-style UI: Use Open WebUI (covered above).
- Discord Bot: Query Ollama from a Discord bot using the API endpoint.
- Nextcloud: Use plugins that call Ollama for text summarization or autocomplete.
- Smart Home: Feed Ollama prompts from Home Assistant for natural language commands.
- Documentation: Build a local search tool that uses embeddings from Ollama to find similar docs.
The API is simple REST—any app that can make HTTP requests can use Ollama. Just point it to http://ollama:11434/api/generate and send a JSON payload with your model and prompt.
Backing Up Your Models
Models live in the Docker volume (ollama_data in the compose file). If you lose this volume, you'll need to re-download everything. I back it up weekly to my NAS:
docker run --rm -v ollama_data:/data -v /mnt/backup:/backup \
alpine tar czf /backup/ollama_$(date +%Y%m%d).tar.gz -C /data .
This creates a compressed archive of all your models. Restore by reversing the process.
Troubleshooting Common Issues
Models downloading but never finishing: Check disk space. Ollama needs free space roughly 2x the model size during download. If you're out of space, clear old models or increase your volume.
Slow responses even with GPU: Check if CUDA is actually being used. Run docker logs ollama and look for lines mentioning GPU or CUDA. If you see "CPU only" or no GPU mentions, NVIDIA Docker isn't properly configured. Reinstall the NVIDIA Container Toolkit and verify with docker run --rm --gpus all nvidia/cuda:12.0.0-runtime-ubuntu22.04 nvidia-smi.
Out of memory errors: Reduce context length or switch to a smaller, quantized model. You can also set memory limits on the container in your compose file.
Port 11434 already in use: Change the port mapping in docker-compose (e.g., 11435:11434) and update the Open WebUI environment variable to match.
Next Steps: Going Deeper
Once you have Ollama running smoothly, here are some things I'm exploring next:
- Fine-tuning models on my own data for domain-specific knowledge.
- Running embeddings models alongside generative models for semantic search.
- Building a retrieval-augmented generation (RAG) pipeline that fetches documents from my Nextcloud and feeds them to Ollama.
- Exposing Ollama through a reverse proxy (Caddy) so I can access it securely from outside my network.
For now, this setup gives you a fully private, cost-free, and remarkably capable AI assistant running entirely on your hardware. That's powerful.
If you're serious about self-hosting and don't have good hardware yet, consider a dedicated server. RackNerd offers affordable KVM VPS and dedicated servers