Building a Resilient Homelab Infrastructure with Load Balancing and Failover

CompactHost · March 28, 2026

We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.

Most homelabbers run everything on a single box. When that machine hiccups, your entire stack goes dark. I've been there—3 AM downtime, wife angry, no Jellyfin. After building out multiple servers and learning the hard way, I've designed a homelab that actually stays up. Here's how to add load balancing and failover to your self-hosted setup without needing enterprise hardware or complexity.

Why Resilience Matters at Home

A homelab isn't just for tinkering anymore. I run Nextcloud for family documents, Vaultwarden for passwords, and media services that people actually rely on. When infrastructure fails, it's not just downtime—it's credibility gone. Resilience means:

Zero-downtime deployments: Update services without users noticing
Hardware fault tolerance: One server dies, traffic shifts automatically
Graceful degradation: Services degrade instead of vanishing entirely
Confidence to experiment: Break things on one node while others handle traffic

The good news? You don't need a three-node Kubernetes cluster. Docker Compose, Caddy, and a second cheap VPS can get you 99% of the way there.

The Architecture I'm Running

My current setup spans two systems:

Primary homelab: Intel NUC with 32GB RAM running Docker Compose, Nextcloud, Immich, Gitea, Vaultwarden
Secondary VPS: RackNerd public VPS for around $40/year, running failover instances and edge services
Reverse proxy: Caddy on both nodes with health checks and automatic failover
DNS: Pi-hole with weighted round-robin pointing to both endpoints

The VPS handles traffic when the homelab reboots or crashes. It's not pretty, but it works. More importantly, the family doesn't notice when my lab goes down for maintenance.

Load Balancing with Caddy

Caddy makes reverse proxy load balancing almost trivial. I prefer Caddy over Traefik here because the health check logic is cleaner and the configuration is readable.

Here's a real Caddyfile that balances three backend services across my primary and secondary nodes:

nextcloud.home.local {
  reverse_proxy localhost:8080 backup.vpn.internal:8080 {
    health_uri /status.php
    health_port 8080
    health_interval 5s
    health_timeout 2s
  }
}

vaultwarden.home.local {
  reverse_proxy localhost:8000 backup.vpn.internal:8000 {
    health_uri /alive
    health_port 8000
    health_interval 5s
    health_timeout 2s
  }
}

immich.home.local {
  reverse_proxy localhost:3001 backup.vpn.internal:3001 {
    health_uri /api/server/health
    health_port 3001
    health_interval 10s
    health_timeout 3s
  }
}

:80 {
  respond 200
}

Let me break this down:

reverse_proxy localhost:8080 backup.vpn.internal:8080 — Routes to the local instance first, then fails over to the VPS
backup — Tells Caddy to only use this upstream if the primary is down
health_uri — Caddy checks this endpoint to verify the service is alive
health_interval 5s — Check every 5 seconds (tune based on your tolerance for false positives)
health_timeout 2s — If the check doesn't respond in 2 seconds, consider it down

Tip: Use application-specific health endpoints when possible. /status.php for Nextcloud, /api/server/health for Immich. Avoid generic container health checks—they don't tell you if your app is actually working.

Docker Compose Failover Setup

On my VPS, I run a minimal Docker Compose stack with just the critical services. These act as warm standbys, pulling data from persistent storage shared between nodes.

version: '3.8'

services:
  nextcloud:
    image: nextcloud:latest
    ports:
      - "8080:80"
    environment:
      NEXTCLOUD_TRUSTED_DOMAINS: "nextcloud.home.local"
      POSTGRES_HOST: "postgres"
      POSTGRES_DB: "nextcloud"
      POSTGRES_USER: "nextcloud"
      POSTGRES_PASSWORD: "${DB_PASSWORD}"
    volumes:
      - /mnt/shared/nextcloud:/var/www/html
      - /mnt/shared/nextcloud-data:/var/www/html/data
    depends_on:
      - postgres
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost/status.php"]
      interval: 10s
      timeout: 5s
      retries: 3

  vaultwarden:
    image: vaultwarden/server:latest
    ports:
      - "8000:80"
    environment:
      DOMAIN: "https://vaultwarden.home.local"
      ADMIN_TOKEN: "${ADMIN_TOKEN}"
      DATABASE_URL: "postgresql://vaultwarden:${DB_PASSWORD}@postgres/vaultwarden"
    volumes:
      - /mnt/shared/vaultwarden:/data
    depends_on:
      - postgres
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost/alive"]
      interval: 10s
      timeout: 5s
      retries: 3

  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_PASSWORD: "${DB_PASSWORD}"
      POSTGRES_INITDB_ARGS: "--encoding=UTF8"
    volumes:
      - /mnt/shared/postgres:/var/lib/postgresql/data
    restart: always
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 3

  immich-server:
    image: ghcr.io/immich-app/immich-server:latest
    ports:
      - "3001:3001"
    environment:
      DB_HOSTNAME: "postgres"
      DB_USERNAME: "immich"
      DB_PASSWORD: "${DB_PASSWORD}"
      DB_NAME: "immich"
      REDIS_HOSTNAME: "redis"
    volumes:
      - /mnt/shared/immich:/usr/src/app/upload
    depends_on:
      - postgres
      - redis
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3001/api/server/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  redis:
    image: redis:7-alpine
    restart: always
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3

volumes:
  postgres:
  shared:

Key decisions I made here:

Shared storage: /mnt/shared is NFS or Samba mounted from the primary homelab. This keeps the VPS in sync without constant syncing
Single PostgreSQL instance: I route all writes to the primary homelab database. The VPS reads are eventually consistent
Health checks: Every service has a Compose health check so Docker knows when to restart
No volumes on local disk: Everything points to /mnt/shared so if the VPS reboots, apps resume with current data

Watch out: Database conflicts happen if you run writes on both nodes. I keep the primary as the source of truth. The VPS is read-mostly, and I manually failover writes only during total primary failure. If you need true multi-primary replication, you're in Kubernetes territory.

Networking: Tailscale for Secure Failover

I use Tailscale to connect the homelab and VPS securely without opening ports. Caddy on each node reaches the other over the Tailscale mesh:

Homelab Caddy: Runs on the primary, tries localhost first, fails over to backup.vpn.internal (VPS Tailscale IP)
VPS Caddy: Runs on the secondary, tries backup.vpn.internal first, fails over to homelab.vpn.internal (primary Tailscale IP)

Traffic flows like this:

User requests nextcloud.home.local
DNS resolves to homelab public IP (or VPS if using DNS failover)
Caddy on that node checks localhost:8080 (local Nextcloud)
If up, traffic stays local. If down, Caddy forwards to VPS over Tailscale tunnel
VPS Nextcloud container serves the request from shared storage

Setup is simple:

# On both homelab and VPS
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

# Note the assigned IP addresses
tailscale ip -4

# In your Caddyfile, use the Tailscale IPs
# Homelab: 100.x.x.1 (backup.vpn.internal)
# VPS: 100.x.x.2 (homelab.vpn.internal)

Tailscale handles encryption, NAT traversal, and access control automatically. No port forwarding needed.

Health Checks and Monitoring

Failover only works if you know when services are down. I monitor three layers:

Layer 1: Container health checks — Docker Compose's built-in checks restart dead services.

Layer 2: Caddy health checks — Caddy probes endpoints and routes around failures.

Layer 3: Application monitoring — Prometheus + Grafana alerts me when services degrade. I have an Uptime Kuma instance that checks critical endpoints every 60 seconds and pages me if anything is down for more than 5 minutes.

For a minimal setup, Uptime Kuma alone is enough. It's self-hosted, lightweight, and integrates with Discord/Telegram for alerts.

Testing Failover (Critical!)

Never assume your failover works until you've tested it. Here's my checklist:

Stop the primary service: docker-compose stop nextcloud. Does Caddy route to the VPS within 10 seconds?
Kill the primary container: docker kill nextcloud. Does Docker restart it? Does failover work while it's restarting?
Network partition: Unplug the homelab Ethernet. Does DNS failover kick in? Can you still access services on the VPS?
Database failure: Stop PostgreSQL. Do dependent services gracefully degrade or crash? Can you recover without manual intervention?
Storage failure: Unmount NFS. Do services retry or fail fast? How long before Caddy notices?

I run these tests monthly. Every time, I find something. Last month, Nextcloud was configured with a hardcoded localhost DB connection—it never failed over because it couldn't reach the VPS database at all. That would have been bad on a real outage.

Cost Considerations

Your failover VPS doesn't need to be powerful. Mine is a $40/year RackNerd box with 2 vCores and 4GB RAM. It's enough to run Nextcloud, Vaultwarden, and a few smaller services. Check RackNerd's current deals—they often have promotions that undercut their standard pricing.

Breakdown of my annual costs:

Homelab NUC: $500 (one-time, already owned)
Failover VPS: $40/year
Tailscale: Free (I use the personal tier; even the paid tier is $60/year)
Electricity: ~$150/year for the homelab

Total: ~$200/year for real HA. Compare that to managed Nextcloud hosting ($99/month), Vaultwarden hosting ($5/month), and Immich hosting