Building a Resilient Homelab Infrastructure with Load Balancing and Failover

Building a Resilient Homelab Infrastructure with Load Balancing and Failover

We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.

Most homelabbers run everything on a single box. When that machine hiccups, your entire stack goes dark. I've been there—3 AM downtime, wife angry, no Jellyfin. After building out multiple servers and learning the hard way, I've designed a homelab that actually stays up. Here's how to add load balancing and failover to your self-hosted setup without needing enterprise hardware or complexity.

Why Resilience Matters at Home

A homelab isn't just for tinkering anymore. I run Nextcloud for family documents, Vaultwarden for passwords, and media services that people actually rely on. When infrastructure fails, it's not just downtime—it's credibility gone. Resilience means:

The good news? You don't need a three-node Kubernetes cluster. Docker Compose, Caddy, and a second cheap VPS can get you 99% of the way there.

The Architecture I'm Running

My current setup spans two systems:

The VPS handles traffic when the homelab reboots or crashes. It's not pretty, but it works. More importantly, the family doesn't notice when my lab goes down for maintenance.

Load Balancing with Caddy

Caddy makes reverse proxy load balancing almost trivial. I prefer Caddy over Traefik here because the health check logic is cleaner and the configuration is readable.

Here's a real Caddyfile that balances three backend services across my primary and secondary nodes:

nextcloud.home.local {
  reverse_proxy localhost:8080 backup.vpn.internal:8080 {
    health_uri /status.php
    health_port 8080
    health_interval 5s
    health_timeout 2s
  }
}

vaultwarden.home.local {
  reverse_proxy localhost:8000 backup.vpn.internal:8000 {
    health_uri /alive
    health_port 8000
    health_interval 5s
    health_timeout 2s
  }
}

immich.home.local {
  reverse_proxy localhost:3001 backup.vpn.internal:3001 {
    health_uri /api/server/health
    health_port 3001
    health_interval 10s
    health_timeout 3s
  }
}

:80 {
  respond 200
}

Let me break this down:

Tip: Use application-specific health endpoints when possible. /status.php for Nextcloud, /api/server/health for Immich. Avoid generic container health checks—they don't tell you if your app is actually working.

Docker Compose Failover Setup

On my VPS, I run a minimal Docker Compose stack with just the critical services. These act as warm standbys, pulling data from persistent storage shared between nodes.

version: '3.8'

services:
  nextcloud:
    image: nextcloud:latest
    ports:
      - "8080:80"
    environment:
      NEXTCLOUD_TRUSTED_DOMAINS: "nextcloud.home.local"
      POSTGRES_HOST: "postgres"
      POSTGRES_DB: "nextcloud"
      POSTGRES_USER: "nextcloud"
      POSTGRES_PASSWORD: "${DB_PASSWORD}"
    volumes:
      - /mnt/shared/nextcloud:/var/www/html
      - /mnt/shared/nextcloud-data:/var/www/html/data
    depends_on:
      - postgres
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost/status.php"]
      interval: 10s
      timeout: 5s
      retries: 3

  vaultwarden:
    image: vaultwarden/server:latest
    ports:
      - "8000:80"
    environment:
      DOMAIN: "https://vaultwarden.home.local"
      ADMIN_TOKEN: "${ADMIN_TOKEN}"
      DATABASE_URL: "postgresql://vaultwarden:${DB_PASSWORD}@postgres/vaultwarden"
    volumes:
      - /mnt/shared/vaultwarden:/data
    depends_on:
      - postgres
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost/alive"]
      interval: 10s
      timeout: 5s
      retries: 3

  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_PASSWORD: "${DB_PASSWORD}"
      POSTGRES_INITDB_ARGS: "--encoding=UTF8"
    volumes:
      - /mnt/shared/postgres:/var/lib/postgresql/data
    restart: always
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 3

  immich-server:
    image: ghcr.io/immich-app/immich-server:latest
    ports:
      - "3001:3001"
    environment:
      DB_HOSTNAME: "postgres"
      DB_USERNAME: "immich"
      DB_PASSWORD: "${DB_PASSWORD}"
      DB_NAME: "immich"
      REDIS_HOSTNAME: "redis"
    volumes:
      - /mnt/shared/immich:/usr/src/app/upload
    depends_on:
      - postgres
      - redis
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3001/api/server/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  redis:
    image: redis:7-alpine
    restart: always
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3

volumes:
  postgres:
  shared:

Key decisions I made here:

Watch out: Database conflicts happen if you run writes on both nodes. I keep the primary as the source of truth. The VPS is read-mostly, and I manually failover writes only during total primary failure. If you need true multi-primary replication, you're in Kubernetes territory.

Networking: Tailscale for Secure Failover

I use Tailscale to connect the homelab and VPS securely without opening ports. Caddy on each node reaches the other over the Tailscale mesh:

Traffic flows like this:

  1. User requests nextcloud.home.local
  2. DNS resolves to homelab public IP (or VPS if using DNS failover)
  3. Caddy on that node checks localhost:8080 (local Nextcloud)
  4. If up, traffic stays local. If down, Caddy forwards to VPS over Tailscale tunnel
  5. VPS Nextcloud container serves the request from shared storage

Setup is simple:

# On both homelab and VPS
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

# Note the assigned IP addresses
tailscale ip -4

# In your Caddyfile, use the Tailscale IPs
# Homelab: 100.x.x.1 (backup.vpn.internal)
# VPS: 100.x.x.2 (homelab.vpn.internal)

Tailscale handles encryption, NAT traversal, and access control automatically. No port forwarding needed.

Health Checks and Monitoring

Failover only works if you know when services are down. I monitor three layers:

Layer 1: Container health checks — Docker Compose's built-in checks restart dead services.

Layer 2: Caddy health checks — Caddy probes endpoints and routes around failures.

Layer 3: Application monitoring — Prometheus + Grafana alerts me when services degrade. I have an Uptime Kuma instance that checks critical endpoints every 60 seconds and pages me if anything is down for more than 5 minutes.

For a minimal setup, Uptime Kuma alone is enough. It's self-hosted, lightweight, and integrates with Discord/Telegram for alerts.

Testing Failover (Critical!)

Never assume your failover works until you've tested it. Here's my checklist:

I run these tests monthly. Every time, I find something. Last month, Nextcloud was configured with a hardcoded localhost DB connection—it never failed over because it couldn't reach the VPS database at all. That would have been bad on a real outage.

Cost Considerations

Your failover VPS doesn't need to be powerful. Mine is a $40/year RackNerd box with 2 vCores and 4GB RAM. It's enough to run Nextcloud, Vaultwarden, and a few smaller services. Check RackNerd's current deals—they often have promotions that undercut their standard pricing.

Breakdown of my annual costs:

Total: ~$200/year for real HA. Compare that to managed Nextcloud hosting ($99/month), Vaultwarden hosting ($5/month), and Immich hosting