Building a Resilient Homelab Infrastructure with Load Balancing and Failover
We earn commissions when you shop through the links on this page, at no additional cost to you. Learn more.
Most homelabbers run everything on a single box. When that machine hiccups, your entire stack goes dark. I've been there—3 AM downtime, wife angry, no Jellyfin. After building out multiple servers and learning the hard way, I've designed a homelab that actually stays up. Here's how to add load balancing and failover to your self-hosted setup without needing enterprise hardware or complexity.
Why Resilience Matters at Home
A homelab isn't just for tinkering anymore. I run Nextcloud for family documents, Vaultwarden for passwords, and media services that people actually rely on. When infrastructure fails, it's not just downtime—it's credibility gone. Resilience means:
- Zero-downtime deployments: Update services without users noticing
- Hardware fault tolerance: One server dies, traffic shifts automatically
- Graceful degradation: Services degrade instead of vanishing entirely
- Confidence to experiment: Break things on one node while others handle traffic
The good news? You don't need a three-node Kubernetes cluster. Docker Compose, Caddy, and a second cheap VPS can get you 99% of the way there.
The Architecture I'm Running
My current setup spans two systems:
- Primary homelab: Intel NUC with 32GB RAM running Docker Compose, Nextcloud, Immich, Gitea, Vaultwarden
- Secondary VPS: RackNerd public VPS for around $40/year, running failover instances and edge services
- Reverse proxy: Caddy on both nodes with health checks and automatic failover
- DNS: Pi-hole with weighted round-robin pointing to both endpoints
The VPS handles traffic when the homelab reboots or crashes. It's not pretty, but it works. More importantly, the family doesn't notice when my lab goes down for maintenance.
Load Balancing with Caddy
Caddy makes reverse proxy load balancing almost trivial. I prefer Caddy over Traefik here because the health check logic is cleaner and the configuration is readable.
Here's a real Caddyfile that balances three backend services across my primary and secondary nodes:
nextcloud.home.local {
reverse_proxy localhost:8080 backup.vpn.internal:8080 {
health_uri /status.php
health_port 8080
health_interval 5s
health_timeout 2s
}
}
vaultwarden.home.local {
reverse_proxy localhost:8000 backup.vpn.internal:8000 {
health_uri /alive
health_port 8000
health_interval 5s
health_timeout 2s
}
}
immich.home.local {
reverse_proxy localhost:3001 backup.vpn.internal:3001 {
health_uri /api/server/health
health_port 3001
health_interval 10s
health_timeout 3s
}
}
:80 {
respond 200
}
Let me break this down:
reverse_proxy localhost:8080 backup.vpn.internal:8080— Routes to the local instance first, then fails over to the VPSbackup— Tells Caddy to only use this upstream if the primary is downhealth_uri— Caddy checks this endpoint to verify the service is alivehealth_interval 5s— Check every 5 seconds (tune based on your tolerance for false positives)health_timeout 2s— If the check doesn't respond in 2 seconds, consider it down
/status.php for Nextcloud, /api/server/health for Immich. Avoid generic container health checks—they don't tell you if your app is actually working.Docker Compose Failover Setup
On my VPS, I run a minimal Docker Compose stack with just the critical services. These act as warm standbys, pulling data from persistent storage shared between nodes.
version: '3.8'
services:
nextcloud:
image: nextcloud:latest
ports:
- "8080:80"
environment:
NEXTCLOUD_TRUSTED_DOMAINS: "nextcloud.home.local"
POSTGRES_HOST: "postgres"
POSTGRES_DB: "nextcloud"
POSTGRES_USER: "nextcloud"
POSTGRES_PASSWORD: "${DB_PASSWORD}"
volumes:
- /mnt/shared/nextcloud:/var/www/html
- /mnt/shared/nextcloud-data:/var/www/html/data
depends_on:
- postgres
restart: always
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost/status.php"]
interval: 10s
timeout: 5s
retries: 3
vaultwarden:
image: vaultwarden/server:latest
ports:
- "8000:80"
environment:
DOMAIN: "https://vaultwarden.home.local"
ADMIN_TOKEN: "${ADMIN_TOKEN}"
DATABASE_URL: "postgresql://vaultwarden:${DB_PASSWORD}@postgres/vaultwarden"
volumes:
- /mnt/shared/vaultwarden:/data
depends_on:
- postgres
restart: always
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost/alive"]
interval: 10s
timeout: 5s
retries: 3
postgres:
image: postgres:15-alpine
environment:
POSTGRES_PASSWORD: "${DB_PASSWORD}"
POSTGRES_INITDB_ARGS: "--encoding=UTF8"
volumes:
- /mnt/shared/postgres:/var/lib/postgresql/data
restart: always
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 3
immich-server:
image: ghcr.io/immich-app/immich-server:latest
ports:
- "3001:3001"
environment:
DB_HOSTNAME: "postgres"
DB_USERNAME: "immich"
DB_PASSWORD: "${DB_PASSWORD}"
DB_NAME: "immich"
REDIS_HOSTNAME: "redis"
volumes:
- /mnt/shared/immich:/usr/src/app/upload
depends_on:
- postgres
- redis
restart: always
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3001/api/server/health"]
interval: 30s
timeout: 10s
retries: 3
redis:
image: redis:7-alpine
restart: always
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 3
volumes:
postgres:
shared:
Key decisions I made here:
- Shared storage:
/mnt/sharedis NFS or Samba mounted from the primary homelab. This keeps the VPS in sync without constant syncing - Single PostgreSQL instance: I route all writes to the primary homelab database. The VPS reads are eventually consistent
- Health checks: Every service has a Compose health check so Docker knows when to restart
- No volumes on local disk: Everything points to
/mnt/sharedso if the VPS reboots, apps resume with current data
Networking: Tailscale for Secure Failover
I use Tailscale to connect the homelab and VPS securely without opening ports. Caddy on each node reaches the other over the Tailscale mesh:
- Homelab Caddy: Runs on the primary, tries
localhostfirst, fails over tobackup.vpn.internal(VPS Tailscale IP) - VPS Caddy: Runs on the secondary, tries
backup.vpn.internalfirst, fails over tohomelab.vpn.internal(primary Tailscale IP)
Traffic flows like this:
- User requests
nextcloud.home.local - DNS resolves to homelab public IP (or VPS if using DNS failover)
- Caddy on that node checks
localhost:8080(local Nextcloud) - If up, traffic stays local. If down, Caddy forwards to VPS over Tailscale tunnel
- VPS Nextcloud container serves the request from shared storage
Setup is simple:
# On both homelab and VPS
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up
# Note the assigned IP addresses
tailscale ip -4
# In your Caddyfile, use the Tailscale IPs
# Homelab: 100.x.x.1 (backup.vpn.internal)
# VPS: 100.x.x.2 (homelab.vpn.internal)
Tailscale handles encryption, NAT traversal, and access control automatically. No port forwarding needed.
Health Checks and Monitoring
Failover only works if you know when services are down. I monitor three layers:
Layer 1: Container health checks — Docker Compose's built-in checks restart dead services.
Layer 2: Caddy health checks — Caddy probes endpoints and routes around failures.
Layer 3: Application monitoring — Prometheus + Grafana alerts me when services degrade. I have an Uptime Kuma instance that checks critical endpoints every 60 seconds and pages me if anything is down for more than 5 minutes.
For a minimal setup, Uptime Kuma alone is enough. It's self-hosted, lightweight, and integrates with Discord/Telegram for alerts.
Testing Failover (Critical!)
Never assume your failover works until you've tested it. Here's my checklist:
- Stop the primary service:
docker-compose stop nextcloud. Does Caddy route to the VPS within 10 seconds? - Kill the primary container:
docker kill nextcloud. Does Docker restart it? Does failover work while it's restarting? - Network partition: Unplug the homelab Ethernet. Does DNS failover kick in? Can you still access services on the VPS?
- Database failure: Stop PostgreSQL. Do dependent services gracefully degrade or crash? Can you recover without manual intervention?
- Storage failure: Unmount NFS. Do services retry or fail fast? How long before Caddy notices?
I run these tests monthly. Every time, I find something. Last month, Nextcloud was configured with a hardcoded localhost DB connection—it never failed over because it couldn't reach the VPS database at all. That would have been bad on a real outage.
Cost Considerations
Your failover VPS doesn't need to be powerful. Mine is a $40/year RackNerd box with 2 vCores and 4GB RAM. It's enough to run Nextcloud, Vaultwarden, and a few smaller services. Check RackNerd's current deals—they often have promotions that undercut their standard pricing.
Breakdown of my annual costs:
- Homelab NUC: $500 (one-time, already owned)
- Failover VPS: $40/year
- Tailscale: Free (I use the personal tier; even the paid tier is $60/year)
- Electricity: ~$150/year for the homelab
Total: ~$200/year for real HA. Compare that to managed Nextcloud hosting ($99/month), Vaultwarden hosting ($5/month), and Immich hosting