Resource Optimization: Running Ollama and Docker Services on Limited VPS Resources

Resource Optimization: Running Ollama and Docker Services on Limited VPS Resources

When I started running Ollama and Docker services on budget VPS instances, I quickly learned that you can't just deploy production configurations and expect them to work. A $40/year RackNerd box with 1–2 GB RAM and a single vCPU is completely different from a developer laptop. But with the right tuning, quantization strategies, and container orchestration, I've gotten full LLM inference pipelines running smoothly on machines that would normally be considered too weak. This post covers what actually works.

Understanding Your Resource Constraints

Before you optimize, you need to know what you're working with. SSH into your VPS and run these commands to establish a baseline.


# Check available memory
free -h

# Check CPU count and specs
nproc
lscpu

# Check disk space
df -h

# Monitor real-time resource usage
top -b -n 1 | head -20

When I checked my RackNerd instance, I saw exactly 1.9 GB total RAM, with about 1.3 GB available after the OS and basic services. Disk was 20 GB. This is genuinely tight for Ollama—the default Llama 2 7B model alone is 4 GB unquantized.

The key insight: you need to quantize your models and tune kernel parameters. You can't run full-precision models on 2 GB RAM. Period.

Quantization: The Game Changer

Model quantization reduces the precision of weights and activations, shrinking model size dramatically while keeping inference quality acceptable. Instead of 32-bit floats, you use 8-bit or 4-bit integers. Ollama handles this automatically if you pull the right model tags.

When I run Ollama, I always use quantized variants. Here's what I actually use:

Pull quantized models like this:


# Pull a quantized model (Q4 = 4-bit)
ollama pull phi:2-q4_0

# Pull an even smaller one for fallback
ollama pull tinyllama:latest

# List what you have
ollama list

The difference is real. Phi 2 unquantized is 9.3 GB. Phi 2 Q4 is 3.3 GB. You get 65% size reduction with minimal quality loss. On a 2 GB VPS, this is the difference between "possible" and "impossible."

Tip: Always start with Q4_0 quantization. It's the sweet spot between size and quality. Q5_K and Q6_K give better quality but take up more space and are slower. Save those for when you have 8+ GB RAM.

Kernel Tuning and Swap Configuration

Even with quantization, you'll hit memory pressure. Linux can use swap (disk-based virtual memory) to handle this, but it needs proper tuning. By default, the kernel waits too long to start swapping, and then thrashes badly.

I tune these parameters on every budget VPS I set up:


# Check current swap
free -h

# If no swap exists, create a 4-8GB file-based swap
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Make it persistent
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

# Verify
swapon --show

# Now tune kernel parameters
# Lower vm.swappiness to avoid thrashing (default is 60)
echo 'vm.swappiness = 10' | sudo tee -a /etc/sysctl.conf

# Allow more aggressive page caching
echo 'vm.vfs_cache_pressure = 50' | sudo tee -a /etc/sysctl.conf

# Increase max memory map areas (Ollama needs this)
echo 'vm.max_map_count = 262144' | sudo tee -a /etc/sysctl.conf

# Apply changes immediately
sudo sysctl -p

What these do: vm.swappiness = 10 tells the kernel to prefer RAM over swap (avoid thrashing), vfs_cache_pressure = 50 lets the cache stay in memory longer, and max_map_count prevents memory mapping errors when Ollama loads large models.

After tuning, I can run inference with occasional swap usage without the system freezing. It's slower than pure RAM, but it works reliably.

Docker Compose for Ollama + Services

Here's my minimal production setup: Ollama in a container, Open WebUI for the interface, and Watchtower to auto-update. Everything fits in 2 GB.


# Create working directory
mkdir -p ~/ollama-stack
cd ~/ollama-stack

# Create the compose file
cat > docker-compose.yml << 'EOF'
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_NUM_PARALLEL=1
      - OLLAMA_NUM_THREAD=2
      - OLLAMA_MODELS=/root/.ollama/models
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: '0.8'
          memory: 1200M
        reservations:
          cpus: '0.5'
          memory: 800M

  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_API_BASE_URL=http://ollama:11434/api
    volumes:
      - webui_data:/app/backend/data
    depends_on:
      - ollama
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 300M
        reservations:
          cpus: '0.2'
          memory: 150M

  watchtower:
    image: containrrr/watchtower:latest
    container_name: watchtower
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    command: --interval 86400 ollama open-webui
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: '0.2'
          memory: 100M

volumes:
  ollama_data:
  webui_data:
EOF

The key optimization here: I set memory limits and CPU shares. Ollama gets 1.2 GB max (with 800 MB reserved), Open WebUI gets 300 MB, Watchtower gets 100 MB. This prevents any single service from consuming all RAM and crashing the system.

Start it with:


# Start the stack
docker-compose up -d

# Wait ~30 seconds for Ollama to initialize
sleep 30

# Pull a model from inside the container
docker exec ollama ollama pull phi:2-q4_0

# Check logs
docker-compose logs ollama

Access the web UI at http://your-vps-ip:3000.

Monitoring and Preventing OOM Kills

The Linux kernel's Out-Of-Memory (OOM) killer terminates processes when RAM runs out. This is brutal and silent—your Ollama container just dies. I monitor this constantly.

Check if you're getting OOM kills:


# Check kernel logs for OOM killer activity
dmesg | grep -i "killed process" | tail -10

# Or watch in real-time
dmesg -w | grep -i "killed"

If you see OOM kills, immediately:

  1. Use a smaller quantized model (Q4 instead of Q6)
  2. Increase swap size
  3. Reduce OLLAMA_NUM_PARALLEL to 1 (run one request at a time)
  4. Set tighter memory limits on other services

Watch out: Never assume "it's fine" just because the VPS boots. Monitor for 48 hours under load. OOM kills often happen during spikes—nights when cron jobs run, or if multiple inference requests hit simultaneously.

Real Performance Numbers

Here's what I actually measure on my 2 GB, 1-vCPU RackNerd VPS:

These aren't fast by cloud GPU standards, but they're usable for text summarization, simple Q&A, and automation tasks. I use this setup to generate blog content drafts, write emails, and debug code snippets. Not real-time applications, but genuinely helpful.

Budget VPS Recommendation

If you're shopping for a VPS to run this setup, RackNerd offers excellent value. For around $40/year, you can get a VPS with 1 GB RAM, 1 vCPU, 20–25 GB SSD, and 1 Gbps bandwidth. It's tight, but it works. Check RackNerd's latest promotions (they run deals especially around holidays and new-year events)—I've seen better specs at similar prices during sales.

If you can stretch to $60–80/year, get 2 GB RAM. The difference between 1 GB and 2 GB is the difference between "technically possible" and "actually pleasant to use."

Next Steps

Start with the Docker Compose setup above. Pull a Q4 model, let it run for a week, and watch your resource usage with docker stats and kernel logs. You'll learn exactly where your limits are. If you hit OOM kills, quantize further. If you have spare headroom, try running two models simultaneously (one fast, one accurate).

The real skill in self-hosting on a budget isn't buying bigger hardware—it's understanding your constraints deeply enough to optimize around them. After you get Ollama stable, you can add more Docker services (Nextcloud, Gitea, Vaultwarden) on top of the same stack, sharing kernel memory carefully.

Discussion