Monitoring Your Homelab: Prometheus, Grafana, and Alertmanager Setup Guide

Monitoring Your Homelab: Prometheus, Grafana, and Alertmanager Setup Guide

When I first started running serious services in my homelab, I had no visibility into what was happening. A service would silently fail, and I wouldn't notice for hours. Now I run Prometheus, Grafana, and Alertmanager as the backbone of my infrastructure monitoring—and I sleep better at night knowing I'll get a Telegram message the moment something goes wrong.

This guide walks you through the exact setup I use: a fully containerized monitoring stack with real alerting rules, dashboards that actually look good, and email/Telegram notifications that won't spam you. I'll show you the Docker Compose file, the configuration files that matter, and the common gotchas that ate my time so you don't have to lose yours.

Why This Stack?

I chose Prometheus + Grafana + Alertmanager because they're lightweight enough for a homelab but enterprise-grade enough to trust. Prometheus scrapes metrics from your services at regular intervals (default 15 seconds), stores them in a time-series database, and keeps them for 15 days by default. Grafana visualizes that data beautifully. Alertmanager handles the rules and routing—it decides which alerts you care about and where they go.

Unlike commercial monitoring SaaS, you own the data. There's no bill surprise when you monitor 500 metrics instead of 50. And if you're comfortable with Docker (and if you're self-hosting, you should be), the deployment is straightforward.

The Complete Docker Compose Stack

I run all three components in a single Docker Compose file. Here's what I actually use in production:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
    restart: unless-stopped
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_INSTALL_PLUGINS=grafana-piechart-panel
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    restart: unless-stopped
    networks:
      - monitoring
    depends_on:
      - prometheus

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    restart: unless-stopped
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    restart: unless-stopped
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    restart: unless-stopped
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

networks:
  monitoring:
    driver: bridge

This stack includes Node Exporter (for OS metrics like CPU, memory, disk) and cAdvisor (for Docker container metrics). Both are important—Node Exporter tells you if your disk is filling up; cAdvisor shows you which container is eating 80% of your RAM.

Prometheus Configuration

Create a prometheus.yml file in the same directory as your compose file. This tells Prometheus what to scrape and where the alerting rules live:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'homelab'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - 'alert-rules.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'docker'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'alertmanager'
    static_configs:
      - targets: ['alertmanager:9093']

The key lines: scrape_interval: 15s means Prometheus will ask each exporter for fresh metrics every 15 seconds. Lower this if you want faster alerts; raise it if you want less database write pressure on old hardware.

Watch out: The default retention is 15 days. If you're monitoring a homelab with lots of services, this can consume 10–20 GB of disk per month. I bumped mine to 30 days (shown in the compose file) because storage is cheap compared to the regret of losing a month of metrics when you need to debug a recurring issue.

Alert Rules That Actually Work

Create alert-rules.yml. These rules define the conditions that trigger alerts. I keep mine practical and not too noisy:

groups:
  - name: homelab
    interval: 15s
    rules:
      - alert: HostHighCpuLoad
        expr: '(1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 0.80'
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host {{ $labels.instance }} CPU load is high"
          description: "CPU usage is {{ humanizePercentage $value }} on {{ $labels.instance }}"

      - alert: HostHighMemoryUsage
        expr: '(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85'
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host {{ $labels.instance }} memory usage is high"
          description: "Memory usage is {{ humanizePercentage $value }}"

      - alert: HostDiskRunningOutOfSpace
        expr: '(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes) < 0.10'
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} disk space is low"
          description: "Disk space is only {{ humanizePercentage $value }} available"

      - alert: HostHighDiskIOUtilization
        expr: 'rate(node_disk_io_time_ms_total[5m]) > 1000'
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High disk I/O on {{ $labels.instance }}"

      - alert: ContainerHighMemoryUsage
        expr: 'cadvisor_memory_usage_bytes > 1073741824'
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} using >1GB RAM"
          description: "Current usage: {{ humanize $value }}B"

      - alert: ContainerRestarted
        expr: 'rate(container_last_seen{container_label_io_kubernetes_container_restartCount!=""}[5m]) > 0'
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} restarted"

I tuned these after running the stack for a week. The for: 5m clause prevents alerts from firing on transient spikes—if CPU spiked for 10 seconds, you don't want a notification. But if it stays high for 5 minutes, something's actually wrong.

Tip: Start with severity labels (critical, warning, info) and use them in your notification routing. Critical alerts should wake you up; warnings should go to Slack or email; info can be logged but not sent.

Alertmanager Configuration

Create alertmanager.yml to define where and how alerts get sent. Here's a setup that routes critical alerts to email and warnings to Slack:

global:
  resolve_timeout: 5m
  slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'
  smtp_from: '[email protected]'
  smtp_smarthost: 'smtp.example.com:587'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'your-app-password'
  smtp_require_tls: true

route:
  receiver: 'default'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  routes:
    - match:
        severity: critical
      receiver: 'email-critical'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#homelab-alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'email-critical'
    email_configs:
      - to: '[email protected]'
        headers:
          Subject: 'CRITICAL: {{ .GroupLabels.alertname }}'

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#homelab-warnings'
        color: 'warning'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

The inhibit_rules section is subtle but important: if a critical alert is firing, don't also send the related warning. This prevents alert fatigue.

Setting Up Grafana Dashboards

You can manually create dashboards in Grafana's web UI (port 3000), but I prefer to provision them. Create a directory structure:

mkdir -p grafana/provisioning/dashboards
mkdir -p grafana/provisioning/datasources

Create grafana/provisioning/datasources/prometheus.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

Then create a basic dashboard YAML in grafana/provisioning/dashboards/homelab.json (or import one from Grafana Labs—the "Node Exporter Full" dashboard is excellent). Update your compose file to mount this provisioning directory as I showed earlier, and dashboards will auto-load on startup.

Running It All

Start the stack with:

docker-compose up -d

Check that everything came up:

docker-compose ps

Visit http://localhost:9090 for Prometheus (you'll see your targets under Status → Targets), http://localhost:3000 for Grafana (default login is admin/changeme—change it immediately), and http://localhost:9093 for Alertmanager's dashboard.

Look at the Prometheus expression browser and run a query like node_cpu_seconds_total. If you see data, your scrapers are working. If not, check the Prometheus logs with docker-compose logs prometheus.

Beyond the Basics