Installing and Configuring Ollama Locally: A Complete Step-by-Step Guide
I've been running Ollama on my homelab for months now, and it's transformed how I think about AI infrastructure. Instead of querying OpenAI's API and paying per token, I'm running Mistral, Llama 2, and other powerful open-source models right on my hardware—completely offline, completely private. This guide covers everything I learned the hard way: installation, GPU acceleration, Docker integration, and real production gotchas.
Why Run Ollama Locally?
Before diving into setup, let me be honest about why this matters. Cloud APIs are expensive at scale. If you're experimenting with LLMs for automation, local testing, or privacy-sensitive work, running Ollama locally makes sense. I preferred this approach because:
- Zero inference costs — Once you download a model, running it is free.
- Complete privacy — Your prompts never leave your network.
- Instant responses — No API latency, just your hardware's speed.
- Fine-tuning potential — You can customize models for your use case.
The trade-off? You need decent hardware. A modern CPU works, but a GPU (Nvidia or AMD) dramatically improves performance. I'm using a used RTX 3060 in my setup, which cost about £150 and handles most models comfortably.
System Requirements
Ollama works on macOS, Linux, and Windows. I prefer Linux (Ubuntu 22.04 LTS) for server deployments, but the process is similar across platforms. Here's what I recommend:
- CPU: 4+ cores (8+ preferred)
- RAM: 16GB minimum (32GB if running larger models simultaneously)
- Disk: 100GB SSD (models vary in size; a large model like Llama 2 70B needs ~40GB)
- GPU (optional but recommended): Nvidia CUDA-capable GPU with 6GB+ VRAM, or AMD GPU
If you're looking to upgrade or add hardware to your homelab, I've had excellent experiences with RackNerd's KVM VPS offerings—they're affordable and powerful enough to run Ollama if you prefer not to use local hardware.
Installing Ollama on Linux
The official installation is straightforward. I prefer the curl method because it handles all dependencies automatically:
curl -fsSL https://ollama.ai/install.sh | sh
This installs the Ollama binary, sets up a systemd service, and creates the ollama user. Verify the installation:
ollama --version
systemctl status ollama
If you're on a system without systemd or prefer manual control, download the binary directly from ollama.ai. The service should start automatically and listen on http://localhost:11434.
ollama show-gpu. If using Nvidia, ensure CUDA drivers are installed: nvidia-smi should show your GPU. For AMD GPUs, install ROCm libraries first.Pulling and Running Your First Model
Ollama uses a model registry similar to Docker Hub. I started with Mistral 7B—small, fast, and surprisingly capable. Pull it with:
ollama pull mistral
The first pull downloads the model (about 4GB for Mistral 7B). Subsequent pulls are instant since Ollama caches layers. List your models:
ollama list
Now run an interactive session:
ollama run mistral
Type your prompt and press Enter. The first response will be slower (models are loaded into memory), but subsequent responses are faster. Exit with Ctrl+D.
Running Ollama in Docker for Production
For my homelab, I containerize Ollama because it plays nicely with Docker Compose, reverse proxies, and monitoring. Here's my production setup:
version: '3.9'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
environment:
- OLLAMA_HOST=0.0.0.0:11434
volumes:
- ollama_data:/root/.ollama
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:latest
container_name: open-webui
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama_data:
I include Open WebUI (the web interface) because it's far better than raw API calls. Save this as docker-compose.yml and start:
docker-compose up -d
Verify both services are running:
docker-compose ps
Access Open WebUI at http://localhost:3000. On first load, create a user account. The web interface lets you select from pulled models, adjust temperature and context length, and manage conversation history—all locally.
nvidia-docker and update your /etc/docker/daemon.json to use the Nvidia runtime.GPU Optimization and VRAM Management
Running models efficiently is about understanding VRAM. A 7B parameter model needs roughly 14GB of VRAM in fp16 precision (half-precision floats). My RTX 3060 has 12GB, so I use quantized models. Ollama handles this automatically—when you pull a model, it downloads the quantized version by default, which reduces size and VRAM usage with minimal quality loss.
Check available VRAM:
nvidia-smi
If a model won't load, Ollama will use system RAM (CPU inference), which is much slower. For my setup, I keep models under 13GB and stick to 7B or 13B parameter sizes. If you want larger models like Llama 2 70B, you'll need more VRAM or a more powerful GPU.
To manually control which GPU is used, set environment variables before running:
export CUDA_VISIBLE_DEVICES=0
ollama run mistral
API Integration and Programmatic Access
The real power emerges when you integrate Ollama into scripts or applications. The API is OpenAI-compatible, so switching between Ollama and OpenAI's API requires only a URL change. Here's a Python example:
#!/usr/bin/env python3
import requests
import json
def query_ollama(prompt, model="mistral"):
url = "http://localhost:11434/api/generate"
payload = {
"model": model,
"prompt": prompt,
"stream": False,
"temperature": 0.7,
"top_p": 0.9
}
response = requests.post(url, json=payload)
if response.status_code == 200:
data = response.json()
return data.get("response", "")
else:
print(f"Error: {response.status_code}")
return None
if __name__ == "__main__":
result = query_ollama("Explain quantum computing in 50 words")
print(result)
This script sends a prompt to Ollama and retrieves the response. Perfect for automation, chatbots, or analysis pipelines. I use similar scripts for summarizing logs, generating documentation, and testing NLP workflows—all without API costs.
Model Selection for Your Use Case
Ollama hosts dozens of models. Here's what I've found works best:
- Mistral 7B: Best balance of speed and quality. 4GB, responds in 1-2 seconds on CPU.
- Neural Chat 7B: Optimized for conversation, slightly slower but more natural responses.
- Llama 2 7B/13B: Great for tasks, slightly slower than Mistral but excellent output quality.
- Code Llama: Specialized for code generation and explanation. Requires 13GB+ VRAM.
- Phi 2.5B: Ultra-lightweight, runs on minimal hardware, surprisingly capable.
For my homelab, I keep three models: Mistral (general purpose), Code Llama (development), and Phi (embedded/lightweight tasks). Mix and match based on your hardware and workload.
Connecting Through Caddy Reverse Proxy
I expose Ollama through Caddy for HTTPS and load balancing across my homelab services. Here's my Caddyfile configuration:
ollama.home.local {
reverse_proxy localhost:11434 {
header_up X-Forwarded-For {http.request.remote.host}
header_up X-Forwarded-Proto {http.request.proto}
}
}
webui.home.local {
reverse_proxy localhost:3000
}
Now I access Ollama through https://ollama.home.local with automatic HTTPS certificates (via Caddy's internal CA). This keeps my API calls encrypted and allows me to layer authentication with Authelia if needed.
Persistence and Model Caching
By default, Ollama stores models in ~/.ollama. When running in Docker, mount this directory to persist models across container restarts:
volumes:
- /data/ollama:/root/.ollama
This way, if you restart the container, your models are already cached and don't need to be re-downloaded. I keep my models on a fast SSD to minimize loading time.
Monitoring and Resource Limits
In Docker Compose, set resource limits to prevent Ollama from consuming all system resources:
deploy:
resources:
limits:
cpus: '6'
memory: 24G
reservations:
cpus: '4'
memory: 16G
This ensures Ollama doesn't starve other services. Monitor actual usage with docker stats and adjust limits based on your observations.
Troubleshooting Common Issues
Model won't load / OOM errors: Your model exceeds available VRAM. Switch to a smaller model or quantized version. Ollama handles quantization automatically, but verify with ollama show mistral.
Slow inference: You're likely using CPU inference. Verify GPU is detected with nvidia-smi`. Install CUDA drivers if missing. For AMD, install ROCm.
API timeouts: Increase the timeout in your client code. Large models may take 30+ seconds on first inference.
Port 11434 already in use: Change the port mapping in Docker Compose or manually run Ollama on a different port: OLLAMA_HOST=localhost:11435 ollama serve.
Next Steps: Integration and Scaling
With Ollama running, you can now:
- Build a local RAG (Retrieval Augmented Generation) pipeline using Ollama + a vector database like Milvus or Weaviate.
- Integrate with Nextcloud or Gitea for document summarization and code review automation.
- Set up multi-GPU inference by running multiple Ollama instances on separate GPUs.
- Deploy Open WebUI with authentication using Authelia for secure team access.
I'm working on all of these in my homelab, and the cost savings compared to cloud APIs are significant. If you need additional server capacity for hybrid deployments (edge inference + cloud fallback), RackNerd's VPS options offer excellent value.
The future of AI in homelab infrastructure isn't waiting for cloud providers—it's running locally, privately, and affordably on hardware you control.