Installing and Configuring Ollama with GPU Acceleration on Ubuntu

Installing and Configuring Ollama with GPU Acceleration on Ubuntu

Running large language models locally used to require enterprise hardware and deep expertise. Ollama has changed that entirely. I've spent the last few weeks pushing Ollama through its paces on a modest Ubuntu machine with an NVIDIA GPU, and the results are genuinely impressive—inference speeds that rival cloud APIs, zero latency, and complete privacy. This guide walks you through the real installation process, the GPU setup that actually works, and the gotchas I've encountered along the way.

Why GPU Acceleration Matters for Ollama

Ollama can run on CPU alone, but the difference is night and day. A 7B parameter model running on CPU might take 8–12 seconds per token. With GPU acceleration via CUDA, you're looking at 100–400ms per token—fast enough for genuine real-time interaction. I tested this with Mistral 7B on both a 6-core CPU and an older NVIDIA GTX 1070, and the GPU was roughly 15× faster. If you're considering self-hosting an LLM, GPU acceleration isn't optional—it's the difference between a usable system and an academic exercise.

Prerequisites: What You Actually Need

You'll need an NVIDIA GPU (AMD and Intel arc support is coming, but NVIDIA is mature). The minimum I'd recommend is something from the GTX 10-series or newer, with at least 4GB VRAM. I'm using an RTX 3070 with 8GB, which handles 13B models comfortably. Check your GPU's VRAM first—an 8GB card can hold a quantized 7B model with breathing room, but 4GB is tight. You'll also need Ubuntu 20.04 LTS or newer, and about 50GB free disk space (models themselves vary from 3GB to 50GB+).

Before we start, verify your GPU:

lspci | grep -i nvidia
nvidia-smi  # If this command exists, you may already have drivers installed
Tip: If nvidia-smi doesn't work, don't worry—we'll install the driver stack as part of this process. A fresh Ubuntu install without any NVIDIA tools is actually cleaner to work with.

Step 1: Install NVIDIA Drivers and CUDA Toolkit

The critical part here is getting CUDA installed correctly. Many guides skip this or hand-wave it. I prefer using apt because it ties CUDA to your system's package manager, making updates painless.

sudo apt update
sudo apt install -y nvidia-driver-550

# Reboot to load the driver
sudo reboot

After reboot, verify the driver loaded:

nvidia-smi

You should see GPU memory and driver version. Now install CUDA 12.1 (current stable as of early 2026):

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-1

# Add CUDA to PATH
echo 'export PATH=/usr/local/cuda-12.1/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

Verify CUDA installation:

nvcc --version

You should see "Cuda compilation tools, release 12.1". If this fails, check your PATH or run source ~/.bashrc again.

Watch out: Different NVIDIA driver versions ship with different CUDA compatibility. Driver 550 works with CUDA 12.1+. If you installed an older driver, you may need to downgrade CUDA or upgrade the driver. Always check the NVIDIA compatibility matrix before mixing versions.

Step 2: Install Ollama

Ollama's installation is refreshingly simple. Their official installer handles most of the heavy lifting:

curl -fsSL https://ollama.ai/install.sh | sh

This creates an `ollama` systemd service. Start it immediately:

sudo systemctl start ollama
sudo systemctl enable ollama

Check that it's running and listening on port 11434:

curl http://localhost:11434/api/tags

You should get back a JSON response with an empty models list. If you get a connection refused error, the service didn't start. Check logs with sudo journalctl -u ollama -n 50.

Step 3: Verify GPU Acceleration is Active

This is where many people fail—Ollama installs fine, but GPU acceleration silently fails and nobody notices until they're frustrated. Let's verify it's working. Pull a small model first:

ollama pull mistral

This downloads Mistral 7B (about 4.1GB). While it's downloading, open another terminal and run:

watch -n 1 nvidia-smi

This shows your GPU stats every second. During the download and model loading, you should see GPU memory filling up and utilization increasing. If you see all zeros, GPU acceleration isn't working.

Once Mistral finishes downloading, prompt it:

ollama run mistral "What is the capital of France?"

Watch nvidia-smi in the other terminal. You should see GPU utilization spike to 80–100% and memory increase. If GPU memory stays flat, the model is running on CPU. In that case, check:

# Check if CUDA libraries are accessible
ldconfig -p | grep libcuda

If nothing shows up, CUDA isn't properly installed. Go back and verify the CUDA installation steps.

Step 4: Pull and Configure Models

Ollama's library includes everything from tiny 1B models to massive 70B models. I recommend starting with Mistral 7B (solid reasoning, good speed) or Neural Chat 7B (fine-tuned for conversation). For a 4GB GPU, quantized variants like Neural Chat 7B run smoothly. For 8GB+, you can try 13B models like Llama 2 13B.

# Pull a few models to test
ollama pull neural-chat
ollama pull neural-chat:7b-v3.2-q4_0  # Quantized variant

The `q4_0` suffix means 4-bit quantization, which uses ~50% less VRAM with minimal quality loss. I've tested both, and honestly, the 4-bit versions are impressive—you lose almost nothing in reasoning while gaining major speed.

To list all pulled models:

ollama list

By default, Ollama stores models in ~/.ollama/models. If you're on a system with multiple fast disks, you can symlink this directory to a faster drive or NVMe.

Step 5: Configure Ollama for Your Use Case

Ollama's API listens only on localhost (127.0.0.1:11434) by default, which is secure but limits access. If you want to access it from other machines on your network—or run it on a VPS (around $40/year from providers like RackNerd)—you need to bind it to 0.0.0.0.

Edit the systemd service:

sudo systemctl edit ollama

Add these lines in the [Service] section:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/opt/ollama/models"
Environment="CUDA_VISIBLE_DEVICES=0"

The CUDA_VISIBLE_DEVICES=0 line explicitly pins Ollama to GPU 0 (useful if you have multiple GPUs). Save and reload:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Test the API endpoint from the terminal:

curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral",
    "prompt": "Explain GPU acceleration in one sentence",
    "stream": false
  }'

You'll get back a JSON response with the model's output. If you're accessing from another machine, replace localhost with your machine's IP address.

Watch out: Exposing port 11434 to the internet or untrusted networks is a security risk—anyone can call your API and use your GPU resources. Always use a firewall rule (ufw) to restrict access, or run Ollama behind a reverse proxy like Caddy with authentication enabled.

Performance Tuning and Monitoring

Once everything's running, I monitor two key metrics: VRAM usage and tokens per second. Run a simple prompt and watch nvidia-smi:

watch -n 0.5 nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu,utilization.memory --format=csv,noheader,nounits

A well-tuned Mistral 7B should use 5–6GB VRAM and generate 80–120 tokens/second on an RTX 3070. If you're seeing lower speeds, check for CPU bottlenecks or thermal throttling (GPU power limit). High GPU utilization combined with low token throughput suggests you're hitting vRAM limits and the model is swapping to system RAM.

For production use on a VPS or persistent system, I also recommend setting up memory limits to prevent OOM kills:

sudo systemctl edit ollama
# Add to [Service] section:
# MemoryMax=16G
# CPUQuota=80%

Next Steps: Integration and APIs

Now that Ollama's running with GPU acceleration, you can integrate it with frontend tools. Open WebUI is the obvious choice—it's a beautiful, self-hosted ChatGPT alternative that connects directly to Ollama. Deploy it with Docker in minutes (I run it alongside Ollama on the same machine). You can also call Ollama's REST API from any application: Python scripts, Node.js services, even simple bash loops.

For serious homelab setups, consider running Ollama on a dedicated machine (even a used laptop with a decent GPU works) and accessing it from other devices on your network. This separates inference load from your main systems and lets you experiment with different models without affecting production services.

If you're running on a resource-constrained VPS and want remote Ollama access, you can tunnel it securely through Tailscale or a reverse proxy. The API is stateless, so scaling becomes straightforward once you have one instance working smoothly.

Discussion