Installing and Configuring Ollama for Local LLM Inference

Installing and Configuring Ollama for Local LLM Inference

I've been running Ollama on my homelab for the better part of a year now, and honestly, it's transformed how I work with AI. No rate limits, no API costs, no data leaving my network. If you're tired of paying OpenAI per token or waiting for ChatGPT rate limits to reset, local LLM inference is the move. This guide walks you through installation, GPU acceleration, and integrating it with Open WebUI for a ChatGPT-like experience on your own hardware.

Why Run Ollama Locally?

Before we dig into the setup, let me be honest about why this matters. Cloud AI is convenient, but it's expensive and creates privacy concerns. When I started pulling my AI workloads local, my monthly API costs dropped to essentially zero—just electricity. Ollama gives you access to models like Llama 2, Mistral, Neural Chat, and Code Llama, all running on hardware you control.

The trade-off? Hardware requirements. You need a reasonably modern CPU, and ideally a GPU. I prefer running this on a dedicated machine or a beefy homelab box, not on the same system hosting critical services. If you're looking to run this on a VPS, that's possible too—budget around $40/year for a machine with 8GB RAM and room to grow; providers like RackNerd offer seasonal deals that make this very affordable.

System Requirements and Hardware Considerations

Ollama runs on Linux, macOS, and Windows, but Linux (especially Ubuntu 22.04 or newer) is my preferred target. Here's what I look for:

I run Ollama on an old RTX 2070 Super and a Ryzen 5 3600, and it comfortably runs Mistral 7B and Llama 2 13B models. If you only have CPU, start with smaller models like Mistral 7B or Orca Mini.

Tip: If your current homelab doesn't have headroom, a $40–50/year VPS with 4 vCPU and 8GB RAM from providers like RackNerd (check their NewYear deals) is a cost-effective way to run a dedicated Ollama instance without affecting your home infrastructure.

Installation on Ubuntu/Linux

Let's install Ollama. I'll walk through Ubuntu 22.04, but the process is similar on other Debian-based distros.

#!/bin/bash
# Update system packages
sudo apt update && sudo apt upgrade -y

# Install required dependencies
sudo apt install -y curl git

# Download and install Ollama
curl https://ollama.ai/install.sh | sh

# Verify installation
ollama --version

# Start Ollama service
sudo systemctl start ollama
sudo systemctl enable ollama

# Check service status
sudo systemctl status ollama

The install script sets up Ollama as a systemd service, so it'll start on boot automatically. You can verify it's running by checking `localhost:11434`:

curl http://localhost:11434/api/tags

You should get a JSON response. If you see `curl: (7) Failed to connect`, the service didn't start. Check logs with:

journalctl -u ollama -n 50 -f

GPU Acceleration Setup

This is where the magic happens. Without GPU acceleration, inference on larger models is painfully slow. I'm an NVIDIA person (CUDA), but I'll cover AMD ROCm briefly too.

NVIDIA CUDA Setup

First, make sure your GPU is detected and has CUDA drivers installed:

nvidia-smi

If that fails, install the NVIDIA drivers:

sudo apt install -y nvidia-driver-545
sudo reboot

# After reboot, verify
nvidia-smi

Next, install CUDA Toolkit (Ollama supports CUDA 11.8+):

#!/bin/bash
# Add NVIDIA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
sudo apt update

# Install CUDA Toolkit
sudo apt install -y cuda-toolkit-12-3

# Add to PATH
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Verify
nvcc --version

Once CUDA is installed, restart the Ollama service and it'll automatically detect and use your GPU:

sudo systemctl restart ollama

Check that CUDA is being used by pulling a model and watching the GPU:

ollama pull mistral &
# In another terminal
watch nvidia-smi

You should see GPU memory being consumed.

Watch out: If Ollama doesn't detect your GPU after CUDA setup, check that CUDA libraries are in the correct path. Ollama needs `libcuda.so.1` to be findable. You can verify with `ldconfig -p | grep libcuda`.

AMD ROCm Setup

For AMD GPUs (RDNA, RDNA 2, etc.), install ROCm:

wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update
sudo apt install -y rocm-hip-sdk

Then restart Ollama. ROCm support is newer, so if you hit issues, check the Ollama GitHub for your specific GPU model.

Downloading and Running Models

Now comes the fun part. Ollama's model library is hosted at ollama.ai/library. I'll pull a couple of my go-to models:

#!/bin/bash
# Download Mistral 7B (fast, good quality)
ollama pull mistral

# Download Llama 2 13B (larger, slower, higher quality)
ollama pull llama2

# Download a smaller model for quick responses
ollama pull neural-chat

# List downloaded models
ollama list

Each model is quantized to fit in memory. `mistral:latest` is actually `mistral:7b-instruct-q4_K_M`—the `q4_K_M` part means 4-bit quantization, which cuts size roughly in half with minimal quality loss. For my RTX 2070, 7B and 13B models fit comfortably; with more VRAM, you can run 30B+ models.

Running the Ollama Server and Testing Inference

The Ollama daemon runs on `localhost:11434` by default. Test it directly with the REST API:

#!/bin/bash
# Simple inference request
curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "What is the capital of France?",
  "stream": false
}' | jq .response

# Or with streaming (watch the response come in real-time)
curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Explain quantum computing in 3 sentences.",
  "stream": true
}'

The API is straightforward JSON. The `stream: true` option gives you tokens as they're generated—useful for interactive applications. Performance depends on your hardware; expect 10–50 tokens/second on a typical home GPU.

Integrating with Open WebUI for a ChatGPT-Like Experience

Raw API calls are great for automation, but I prefer Open WebUI for interactive use. It's a self-hosted Gradio-based interface that connects to Ollama and gives you a ChatGPT-style chat, document uploads, and more.

I run Open WebUI in Docker (Ollama runs natively on the host). Here's my Docker Compose setup:

version: '3.8'
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_API_BASE_URL=http://host.docker.internal:11434/api
      - WEBUI_SECRET_KEY=your-secret-key-here
    volumes:
      - ./open-webui-data:/app/backend/data
    restart: unless-stopped
    networks:
      - ollama-net

networks:
  ollama-net:
    driver: bridge

On macOS/Docker Desktop, `host.docker.internal` resolves to the host IP. On Linux with Docker, use the actual host IP (e.g., `172.17.0.1` or your LAN IP). If you're unsure, check:

ip route show | grep default

Start it with:

docker-compose up -d

Then navigate to `http://localhost:3000`, create an account, and select Mistral or another model. You'll get a full chat interface that feels like ChatGPT.

Performance Tuning and Configuration

Ollama respects environment variables for fine-tuning. Here's what I typically adjust:

#!/bin/bash
# Edit the Ollama systemd service
sudo systemctl edit ollama

# Add these environment variables in the [Service] section:
# Environment="OLLAMA_NUM_PARALLEL=2"
# Environment="OLLAMA_NUM_GPU=1"
# Environment="OLLAMA_MAX_LOADED_MODELS=2"

# Then restart
sudo systemctl restart ollama

OLLAMA_NUM_PARALLEL controls concurrent requests (default 4; reduce if you have limited VRAM). OLLAMA_MAX_LOADED_MODELS limits how many models stay loaded in memory simultaneously—useful if you run multiple models and want to avoid swapping.

For remote access, bind Ollama to your LAN IP instead of localhost:

sudo systemctl edit ollama
# Add: Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl restart ollama

I prefer keeping Ollama on localhost and using a reverse proxy (Caddy or Nginx) with authentication for remote access, but the choice is yours.

Monitoring and Resource Management

Watch GPU usage while running inference:

watch -n 1 nvidia-smi

Check system memory and disk:

free -h
df -h /path/to/ollama/models

By default, Ollama stores models in `~/.ollama/models`. If you're running out of space, symlink that directory to a larger drive:

sudo systemctl stop ollama
sudo mv ~/.ollama/models /mnt/storage/ollama-models
ln -s /mnt/storage/ollama-models ~/.ollama/models
sudo systemctl start ollama

Common Issues and Troubleshooting

Ollama service won't start: Check `journalctl -u ollama -n 50`. Common causes: missing CUDA libraries, permission issues, or port 11434 already in use.

GPU not detected: Verify `nvidia-smi` works first. Then restart Ollama. Check logs for "CUDA device not found".

Models download very slowly: Ollama's CDN is usually fast, but if you're throttled, pull one model at a time. Each model downloads once and stays cached.

Inference is slow: This is often normal on CPU. If you have a GPU, verify it's actually being used by watching `nvidia-smi` during