Installing and Configuring Ollama with GPU Acceleration on Ubuntu
Running large language models locally used to require enterprise hardware and deep expertise. Ollama has changed that entirely. I've spent the last few weeks pushing Ollama through its paces on a modest Ubuntu machine with an NVIDIA GPU, and the results are genuinely impressive—inference speeds that rival cloud APIs, zero latency, and complete privacy. This guide walks you through the real installation process, the GPU setup that actually works, and the gotchas I've encountered along the way.
Why GPU Acceleration Matters for Ollama
Ollama can run on CPU alone, but the difference is night and day. A 7B parameter model running on CPU might take 8–12 seconds per token. With GPU acceleration via CUDA, you're looking at 100–400ms per token—fast enough for genuine real-time interaction. I tested this with Mistral 7B on both a 6-core CPU and an older NVIDIA GTX 1070, and the GPU was roughly 15× faster. If you're considering self-hosting an LLM, GPU acceleration isn't optional—it's the difference between a usable system and an academic exercise.
Prerequisites: What You Actually Need
You'll need an NVIDIA GPU (AMD and Intel arc support is coming, but NVIDIA is mature). The minimum I'd recommend is something from the GTX 10-series or newer, with at least 4GB VRAM. I'm using an RTX 3070 with 8GB, which handles 13B models comfortably. Check your GPU's VRAM first—an 8GB card can hold a quantized 7B model with breathing room, but 4GB is tight. You'll also need Ubuntu 20.04 LTS or newer, and about 50GB free disk space (models themselves vary from 3GB to 50GB+).
Before we start, verify your GPU:
lspci | grep -i nvidia
nvidia-smi # If this command exists, you may already have drivers installed
Step 1: Install NVIDIA Drivers and CUDA Toolkit
The critical part here is getting CUDA installed correctly. Many guides skip this or hand-wave it. I prefer using apt because it ties CUDA to your system's package manager, making updates painless.
sudo apt update
sudo apt install -y nvidia-driver-550
# Reboot to load the driver
sudo reboot
After reboot, verify the driver loaded:
nvidia-smi
You should see GPU memory and driver version. Now install CUDA 12.1 (current stable as of early 2026):
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-1
# Add CUDA to PATH
echo 'export PATH=/usr/local/cuda-12.1/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
Verify CUDA installation:
nvcc --version
You should see "Cuda compilation tools, release 12.1". If this fails, check your PATH or run source ~/.bashrc again.
Step 2: Install Ollama
Ollama's installation is refreshingly simple. Their official installer handles most of the heavy lifting:
curl -fsSL https://ollama.ai/install.sh | sh
This creates an `ollama` systemd service. Start it immediately:
sudo systemctl start ollama
sudo systemctl enable ollama
Check that it's running and listening on port 11434:
curl http://localhost:11434/api/tags
You should get back a JSON response with an empty models list. If you get a connection refused error, the service didn't start. Check logs with sudo journalctl -u ollama -n 50.
Step 3: Verify GPU Acceleration is Active
This is where many people fail—Ollama installs fine, but GPU acceleration silently fails and nobody notices until they're frustrated. Let's verify it's working. Pull a small model first:
ollama pull mistral
This downloads Mistral 7B (about 4.1GB). While it's downloading, open another terminal and run:
watch -n 1 nvidia-smi
This shows your GPU stats every second. During the download and model loading, you should see GPU memory filling up and utilization increasing. If you see all zeros, GPU acceleration isn't working.
Once Mistral finishes downloading, prompt it:
ollama run mistral "What is the capital of France?"
Watch nvidia-smi in the other terminal. You should see GPU utilization spike to 80–100% and memory increase. If GPU memory stays flat, the model is running on CPU. In that case, check:
# Check if CUDA libraries are accessible
ldconfig -p | grep libcuda
If nothing shows up, CUDA isn't properly installed. Go back and verify the CUDA installation steps.
Step 4: Pull and Configure Models
Ollama's library includes everything from tiny 1B models to massive 70B models. I recommend starting with Mistral 7B (solid reasoning, good speed) or Neural Chat 7B (fine-tuned for conversation). For a 4GB GPU, quantized variants like Neural Chat 7B run smoothly. For 8GB+, you can try 13B models like Llama 2 13B.
# Pull a few models to test
ollama pull neural-chat
ollama pull neural-chat:7b-v3.2-q4_0 # Quantized variant
The `q4_0` suffix means 4-bit quantization, which uses ~50% less VRAM with minimal quality loss. I've tested both, and honestly, the 4-bit versions are impressive—you lose almost nothing in reasoning while gaining major speed.
To list all pulled models:
ollama list
By default, Ollama stores models in ~/.ollama/models. If you're on a system with multiple fast disks, you can symlink this directory to a faster drive or NVMe.
Step 5: Configure Ollama for Your Use Case
Ollama's API listens only on localhost (127.0.0.1:11434) by default, which is secure but limits access. If you want to access it from other machines on your network—or run it on a VPS (around $40/year from providers like RackNerd)—you need to bind it to 0.0.0.0.
Edit the systemd service:
sudo systemctl edit ollama
Add these lines in the [Service] section:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/opt/ollama/models"
Environment="CUDA_VISIBLE_DEVICES=0"
The CUDA_VISIBLE_DEVICES=0 line explicitly pins Ollama to GPU 0 (useful if you have multiple GPUs). Save and reload:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Test the API endpoint from the terminal:
curl -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "mistral",
"prompt": "Explain GPU acceleration in one sentence",
"stream": false
}'
You'll get back a JSON response with the model's output. If you're accessing from another machine, replace localhost with your machine's IP address.
Performance Tuning and Monitoring
Once everything's running, I monitor two key metrics: VRAM usage and tokens per second. Run a simple prompt and watch nvidia-smi:
watch -n 0.5 nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu,utilization.memory --format=csv,noheader,nounits
A well-tuned Mistral 7B should use 5–6GB VRAM and generate 80–120 tokens/second on an RTX 3070. If you're seeing lower speeds, check for CPU bottlenecks or thermal throttling (GPU power limit). High GPU utilization combined with low token throughput suggests you're hitting vRAM limits and the model is swapping to system RAM.
For production use on a VPS or persistent system, I also recommend setting up memory limits to prevent OOM kills:
sudo systemctl edit ollama
# Add to [Service] section:
# MemoryMax=16G
# CPUQuota=80%
Next Steps: Integration and APIs
Now that Ollama's running with GPU acceleration, you can integrate it with frontend tools. Open WebUI is the obvious choice—it's a beautiful, self-hosted ChatGPT alternative that connects directly to Ollama. Deploy it with Docker in minutes (I run it alongside Ollama on the same machine). You can also call Ollama's REST API from any application: Python scripts, Node.js services, even simple bash loops.
For serious homelab setups, consider running Ollama on a dedicated machine (even a used laptop with a decent GPU works) and accessing it from other devices on your network. This separates inference load from your main systems and lets you experiment with different models without affecting production services.
If you're running on a resource-constrained VPS and want remote Ollama access, you can tunnel it securely through Tailscale or a reverse proxy. The API is stateless, so scaling becomes straightforward once you have one instance working smoothly.
Discussion