Monitoring Your Homelab: Prometheus, Grafana, and Alertmanager Setup Guide
When I first started running serious services in my homelab, I had no visibility into what was happening. A service would silently fail, and I wouldn't notice for hours. Now I run Prometheus, Grafana, and Alertmanager as the backbone of my infrastructure monitoring—and I sleep better at night knowing I'll get a Telegram message the moment something goes wrong.
This guide walks you through the exact setup I use: a fully containerized monitoring stack with real alerting rules, dashboards that actually look good, and email/Telegram notifications that won't spam you. I'll show you the Docker Compose file, the configuration files that matter, and the common gotchas that ate my time so you don't have to lose yours.
Why This Stack?
I chose Prometheus + Grafana + Alertmanager because they're lightweight enough for a homelab but enterprise-grade enough to trust. Prometheus scrapes metrics from your services at regular intervals (default 15 seconds), stores them in a time-series database, and keeps them for 15 days by default. Grafana visualizes that data beautifully. Alertmanager handles the rules and routing—it decides which alerts you care about and where they go.
Unlike commercial monitoring SaaS, you own the data. There's no bill surprise when you monitor 500 metrics instead of 50. And if you're comfortable with Docker (and if you're self-hosting, you should be), the deployment is straightforward.
The Complete Docker Compose Stack
I run all three components in a single Docker Compose file. Here's what I actually use in production:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
restart: unless-stopped
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-piechart-panel
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
restart: unless-stopped
networks:
- monitoring
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
restart: unless-stopped
networks:
- monitoring
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stopped
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
restart: unless-stopped
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
networks:
monitoring:
driver: bridge
This stack includes Node Exporter (for OS metrics like CPU, memory, disk) and cAdvisor (for Docker container metrics). Both are important—Node Exporter tells you if your disk is filling up; cAdvisor shows you which container is eating 80% of your RAM.
Prometheus Configuration
Create a prometheus.yml file in the same directory as your compose file. This tells Prometheus what to scrape and where the alerting rules live:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'homelab'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- 'alert-rules.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'docker'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'alertmanager'
static_configs:
- targets: ['alertmanager:9093']
The key lines: scrape_interval: 15s means Prometheus will ask each exporter for fresh metrics every 15 seconds. Lower this if you want faster alerts; raise it if you want less database write pressure on old hardware.
Alert Rules That Actually Work
Create alert-rules.yml. These rules define the conditions that trigger alerts. I keep mine practical and not too noisy:
groups:
- name: homelab
interval: 15s
rules:
- alert: HostHighCpuLoad
expr: '(1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 0.80'
for: 5m
labels:
severity: warning
annotations:
summary: "Host {{ $labels.instance }} CPU load is high"
description: "CPU usage is {{ humanizePercentage $value }} on {{ $labels.instance }}"
- alert: HostHighMemoryUsage
expr: '(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85'
for: 5m
labels:
severity: warning
annotations:
summary: "Host {{ $labels.instance }} memory usage is high"
description: "Memory usage is {{ humanizePercentage $value }}"
- alert: HostDiskRunningOutOfSpace
expr: '(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes) < 0.10'
for: 5m
labels:
severity: critical
annotations:
summary: "Host {{ $labels.instance }} disk space is low"
description: "Disk space is only {{ humanizePercentage $value }} available"
- alert: HostHighDiskIOUtilization
expr: 'rate(node_disk_io_time_ms_total[5m]) > 1000'
for: 5m
labels:
severity: warning
annotations:
summary: "High disk I/O on {{ $labels.instance }}"
- alert: ContainerHighMemoryUsage
expr: 'cadvisor_memory_usage_bytes > 1073741824'
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} using >1GB RAM"
description: "Current usage: {{ humanize $value }}B"
- alert: ContainerRestarted
expr: 'rate(container_last_seen{container_label_io_kubernetes_container_restartCount!=""}[5m]) > 0'
for: 0m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} restarted"
I tuned these after running the stack for a week. The for: 5m clause prevents alerts from firing on transient spikes—if CPU spiked for 10 seconds, you don't want a notification. But if it stays high for 5 minutes, something's actually wrong.
Alertmanager Configuration
Create alertmanager.yml to define where and how alerts get sent. Here's a setup that routes critical alerts to email and warnings to Slack:
global:
resolve_timeout: 5m
slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'
smtp_from: '[email protected]'
smtp_smarthost: 'smtp.example.com:587'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'your-app-password'
smtp_require_tls: true
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: 'email-critical'
repeat_interval: 1h
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 4h
receivers:
- name: 'default'
slack_configs:
- channel: '#homelab-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'email-critical'
email_configs:
- to: '[email protected]'
headers:
Subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
- name: 'slack-warnings'
slack_configs:
- channel: '#homelab-warnings'
color: 'warning'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
The inhibit_rules section is subtle but important: if a critical alert is firing, don't also send the related warning. This prevents alert fatigue.
Setting Up Grafana Dashboards
You can manually create dashboards in Grafana's web UI (port 3000), but I prefer to provision them. Create a directory structure:
mkdir -p grafana/provisioning/dashboards
mkdir -p grafana/provisioning/datasources
Create grafana/provisioning/datasources/prometheus.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
Then create a basic dashboard YAML in grafana/provisioning/dashboards/homelab.json (or import one from Grafana Labs—the "Node Exporter Full" dashboard is excellent). Update your compose file to mount this provisioning directory as I showed earlier, and dashboards will auto-load on startup.
Running It All
Start the stack with:
docker-compose up -d
Check that everything came up:
docker-compose ps
Visit http://localhost:9090 for Prometheus (you'll see your targets under Status → Targets), http://localhost:3000 for Grafana (default login is admin/changeme—change it immediately), and http://localhost:9093 for Alertmanager's dashboard.
Look at the Prometheus expression browser and run a query like node_cpu_seconds_total. If you see data, your scrapers are working. If not, check the Prometheus logs with docker-compose logs prometheus.