Installing and Configuring Ollama for Local LLM Inference on Your Homelab
Running large language models locally on your homelab transforms how you interact with AI—no cloud subscriptions, no API rate limits, and complete data privacy. I've spent the last two months configuring Ollama across various hardware setups, and I want to walk you through the entire process from installation through optimization. Whether you're starting with a Raspberry Pi or running a full GPU-accelerated server, this guide will get Ollama running reliably.
Why Run Ollama Locally?
When I first started experimenting with local LLMs, my concern was performance. But Ollama changed that equation entirely. Instead of sending every query to OpenAI or Claude, I can now run Mistral, Llama 2, and other models right on my hardware. The privacy angle matters too—no chat histories floating in corporate databases, no third-party training on your queries.
The cost analysis is straightforward: a $40/year VPS from RackNerd paired with a used GPU-equipped server can replace dozens of API calls. For my documentation generation and code review workflows, I've already saved hundreds in API costs.
Checking Your Hardware Requirements
Ollama is surprisingly flexible. I've successfully deployed it on:
- Bare metal Linux (x86_64): My primary workstation with 32GB RAM and RTX 3060 Ti
- Docker containers: Running on both CPU and GPU configurations
- Older hardware: Even a 2-core, 4GB RAM Raspberry Pi can run Mistral 7B (slowly)
For practical use, I recommend at least 8GB of system RAM. If you're using GPU acceleration (which I strongly suggest), your VRAM directly limits which models run smoothly. A 6GB GPU can handle Mistral 7B, while 8GB opens up Llama 2 13B. For 24GB-sized models like Mixtral, you'll need at least 16GB VRAM.
Installing Ollama on Linux
Installation is refreshingly simple. The official installer handles nearly everything, but I always prefer understanding what goes where on my systems.
# Download and run the official installer
curl -fsSL https://ollama.ai/install.sh | sh
# Verify installation
ollama --version
# Start the Ollama service
systemctl start ollama
systemctl enable ollama
# Check status
systemctl status ollama
The installer creates a systemd service that runs Ollama as the `ollama` user. It listens on `127.0.0.1:11434` by default, which is perfect for local development. The service includes automatic restart on failure, which matters if you're running this 24/7 like I do.
If you want to access Ollama from other machines on your network, you'll need to modify the configuration. I create a small systemd override:
# Create override directory
sudo mkdir -p /etc/systemd/system/ollama.service.d/
# Create environment override file
cat << 'EOF' | sudo tee /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF
# Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart ollama
Installing Your First Model
Now that the service is running, let's download and run a model. I start everyone with Mistral 7B—it's small enough to fit on modest hardware but smart enough for real work.
# Pull the Mistral model (first run ~4GB download)
ollama pull mistral
# Run it interactively
ollama run mistral
# Example prompt:
# >>> Explain Docker in one sentence.
# Docker packages applications with all dependencies into isolated containers.
The first pull downloads the model quantized weights—Ollama uses GGUF format, which allows efficient inference on consumer hardware. The model stays local in `~/.ollama/models/` (or your configured model directory).
For a more production-like setup, I run the model in the background and interact via API:
# Start Ollama service (already running if you enabled systemd)
# Make an API request
curl -X POST http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "What is self-hosting?",
"stream": false
}'
The API returns JSON with the generated text. Perfect for automation—I've hooked this into my documentation system to auto-generate answers to common questions.
Running Ollama in Docker
If you prefer containerized deployments (and honestly, most of my production stuff runs in containers), Ollama has excellent Docker support.
# GPU-accelerated version (NVIDIA)
docker run -d \
--name ollama \
--gpus all \
-p 11434:11434 \
-v ollama:/root/.ollama \
ollama/ollama:latest
# CPU-only version
docker run -d \
--name ollama \
-p 11434:11434 \
-v ollama:/root/.ollama \
ollama/ollama:latest
# Pull a model inside the container
docker exec ollama ollama pull mistral
# Test the API
curl -X POST http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Hello"
}'
I prefer this approach because it isolates Ollama from my system dependencies. The volume mount persists models across container restarts, and the port forwarding works seamlessly with reverse proxies.
Optimizing Performance and VRAM Usage
Running LLMs locally means managing memory carefully. Ollama has several parameters I tune based on hardware:
# Set context window and threads via environment variables
export OLLAMA_NUM_THREAD=8 # CPU threads to use
export OLLAMA_NUM_GPU=1 # GPU layers (higher = faster but more VRAM)
export OLLAMA_FLASH_ATTENTION=1 # Enable flash attention if supported
ollama run mistral
# Or within systemd service:
# Edit /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_NUM_THREAD=8"
Environment="OLLAMA_FLASH_ATTENTION=1"
The `OLLAMA_NUM_GPU` parameter controls how many layers of the model run on GPU versus CPU. On my RTX 3060 Ti with 8GB VRAM, I set this to 25-30 for Mistral, which keeps VRAM around 6GB while maintaining responsiveness.
I monitor resource usage with a simple script that runs periodically:
#!/bin/bash
# ollama-monitor.sh
while true; do
echo "=== $(date) ==="
ps aux | grep ollama | grep -v grep | awk '{print $2, $3, $4, $6}' | \
awk '{print "PID: "$1" CPU: "$2"% MEM: "$3"% RSS: "$4"KB"}'
# GPU stats (NVIDIA)
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total \
--format=csv,nounits,noheader | \
awk '{print "GPU Util: "$1"% | VRAM: "$3"MB / "$4"MB ("$2"%)"}'
sleep 5
done
Integrating Ollama with Open WebUI
The raw API works, but for daily use, I run Open WebUI—a beautiful ChatGPT-like interface that talks to Ollama.
# Using Docker Compose
cat << 'EOF' > docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_NUM_THREAD=8
- OLLAMA_FLASH_ATTENTION=1
restart: unless-stopped
networks:
- ollama-net
open-webui:
image: ghcr.io/open-webui/open-webui:latest
container_name: open-webui
ports:
- "3000:8080"
environment:
- OLLAMA_API_BASE_URL=http://ollama:11434/api
depends_on:
- ollama
volumes:
- webui_data:/app/backend/data
restart: unless-stopped
networks:
- ollama-net
volumes:
ollama_data:
webui_data:
networks:
ollama_net:
driver: bridge
EOF
# Start the stack
docker-compose up -d
Now visit `http://localhost:3000` and you'll see a clean interface. Create an account (stored locally), select your model, and start chatting. This is exactly how I prototype before moving queries into production code.
Securing Ollama on Your Network
If you're exposing Ollama to your home network, I strongly recommend authentication. I use Caddy as a reverse proxy with basic auth:
# Caddy configuration (save as Caddyfile)
ollama.home.local {
basicauth / {
user password123456789
}
reverse_proxy localhost:11434
}
# Start Caddy
caddy start
# Now access via http://user:[email protected]
For more paranoid setups, I run Ollama behind a full Authelia zero-trust proxy. But for home networks, basic auth over HTTPS (Caddy handles this automatically) is sufficient.
Next Steps: Moving Beyond Basics
Once Ollama is stable, I recommend:
- Experiment with different models: Try Llama 2, Neural Chat, Orca. Each has different strengths for different tasks.
- Set up monitoring: Use Prometheus to track inference times and resource usage. This data helps you rightsize hardware.
- Create custom model files: Ollama supports Modelfiles that let you tune parameters, system prompts, and temperature for specific workflows.
- Integrate with applications: Hook Ollama into your self-hosted tools—I've added it to my documentation system, note-taking app, and code review workflow.
Running local LLMs requires patience during setup, but once operational, the privacy, cost savings, and creative control are worth it. Start with a single model on modest hardware, get comfortable with the API, then scale up as your needs evolve.
Discussion