Skip to content

Writing Prometheus alerting rules that fire on real problems, not on noise

A working set of alerting rules in Prometheus — file layout, recording rules to keep the queries fast, severity routing labels, and the `for` durations that separate flaps from incidents.

A Prometheus alerting setup goes wrong in one of two directions. Either the rules are so loose that one bad week trains the on-call to ignore the channel, or they are so strict and so few that real problems sail past for hours. The fix is mechanical: a small file layout, a discipline of recording rules feeding alerting rules, severity labels that map cleanly to receivers, and for clauses that match the timescale of the thing being measured. This article shows the shape we use for client clusters.

How to verify

Inspect what Prometheus actually has loaded — the file on disk and the file in memory can diverge if a reload was skipped.

promtool check rules /etc/prometheus/rules.d/*.yml
curl -fsS http://127.0.0.1:9090/api/v1/rules | jq '.data.groups[] | {name, interval, file, rules: (.rules | length)}'
curl -fsS http://127.0.0.1:9090/api/v1/alerts | jq '.data.alerts[] | {labels: .labels, state, activeAt}'
curl -fsS 'http://127.0.0.1:9090/api/v1/query?query=ALERTS{alertstate="firing"}' | jq '.data.result[] | .metric'

If promtool check rules passes but /api/v1/rules does not list the file, Prometheus did not pick up the change — reload via curl -XPOST http://127.0.0.1:9090/-/reload and check /api/v1/status/config.

What’s happening

Prometheus evaluates rule files every evaluation_interval (15s by default). Rules come in two kinds. Recording rules compute a value and write it back as a new time series — useful when the same expensive query is referenced by many alerts or dashboards. Alerting rules evaluate a boolean condition and, if true continuously for the for duration, fire an alert to Alertmanager.

The for clause is the lever that separates a flap from an incident. A condition true for 30 seconds during a deploy is not the same as a condition true for ten minutes during a saturation event. A reasonable default is 5–10 minutes for “service degraded” alerts, 1–2 minutes for “service down” alerts on critical paths, and 0 (immediate) only for events you cannot afford to miss for any sample (replication broken, lock file present).

Labels on the alerting rule shape the Alertmanager routing. We use three labels consistently: severity (critical / warning / info), team (the on-call rotation responsible), and runbook_url (a link the on-call can open at 3 a.m.). Alertmanager routes on severity and team; the runbook URL goes into the notification template.

The procedure

  1. Lay the files out by concern, not by team. A flat rules.d/ with one file per subsystem keeps the diffs reviewable.

    /etc/prometheus/rules.d/
      recording-node.yml
      recording-http.yml
      alerts-node.yml
      alerts-http.yml
      alerts-postgres.yml
      alerts-nginx.yml
      alerts-blackbox.yml

    Point prometheus.yml at the directory:

    rule_files:
      - /etc/prometheus/rules.d/*.yml
  2. Write recording rules for any expression used by more than one alert or panel. They are cheaper, and they give the alert a stable name to refer to.

    # /etc/prometheus/rules.d/recording-http.yml
    groups:
      - name: http-recording
        interval: 30s
        rules:
          - record: job:http_requests:rate5m
            expr: sum by (job, route) (rate(http_requests_total[5m]))
          - record: job:http_request_duration_seconds:p95_5m
            expr: |
              histogram_quantile(0.95,
                sum by (job, route, le) (
                  rate(http_request_duration_seconds_bucket[5m])
                )
              )
          - record: job:http_errors:rate5m
            expr: sum by (job, route) (rate(http_requests_total{status=~"5.."}[5m]))
  3. Write alerting rules that read from the recording rules. Each alert has labels (routing) and annotations (human-readable).

    # /etc/prometheus/rules.d/alerts-http.yml
    groups:
      - name: http-alerts
        interval: 30s
        rules:
          - alert: HighErrorRate
            expr: |
              (job:http_errors:rate5m / job:http_requests:rate5m) > 0.05
            for: 10m
            labels:
              severity: critical
              team: web
            annotations:
              summary: "{{ $labels.job }}/{{ $labels.route }} error rate > 5%"
              description: "5xx rate on {{ $labels.route }} sustained above 5% for 10 minutes."
              runbook_url: "https://runbooks.example.com/http-high-errors"
    
          - alert: HighLatencyP95
            expr: job:http_request_duration_seconds:p95_5m > 1.5
            for: 10m
            labels:
              severity: warning
              team: web
            annotations:
              summary: "{{ $labels.job }}/{{ $labels.route }} p95 > 1.5s"
              description: "p95 latency on {{ $labels.route }} above 1.5s for 10 minutes."
              runbook_url: "https://runbooks.example.com/http-high-latency"
  4. Match the for duration to the failure mode. A network blip causing one missed scrape should not page. Use up == 0 with for: 5m for “instance down,” not for: 0s. A backup that has not run in 36 hours should page; one that has not run in 30 minutes (because it runs nightly) should not.

    - alert: InstanceDown
      expr: up{job="node"} == 0
      for: 5m
      labels: { severity: critical, team: platform }
      annotations:
        summary: "{{ $labels.instance }} down"
        runbook_url: "https://runbooks.example.com/instance-down"
    
    - alert: BackupStale
      expr: time() - backup_last_success_timestamp_seconds > 36 * 3600
      for: 0s
      labels: { severity: warning, team: platform }
      annotations:
        summary: "Backup {{ $labels.job }} has not succeeded in >36h"
  5. Validate and reload. promtool check rules /etc/prometheus/rules.d/*.yml returns non-zero on any error; CI should run it on every change. Then curl -XPOST http://127.0.0.1:9090/-/reload — or restart, if --web.enable-lifecycle is not set.

  6. Unit-test the rules. Use promtool test rules with a YAML fixture so a refactor of the recording rule cannot silently change which alerts fire.

    # /etc/prometheus/rules-tests/http.yml
    rule_files:
      - ../rules.d/recording-http.yml
      - ../rules.d/alerts-http.yml
    evaluation_interval: 30s
    tests:
      - name: high error rate trips after 10 minutes
        interval: 30s
        input_series:
          - series: 'http_requests_total{job="web",route="/checkout",status="200"}'
            values: '1000+1000x40'
          - series: 'http_requests_total{job="web",route="/checkout",status="500"}'
            values: '100+100x40'
        alert_rule_test:
          - eval_time: 11m
            alertname: HighErrorRate
            exp_alerts:
              - exp_labels: { severity: critical, team: web, job: web, route: /checkout }

Common pitfalls

  • Putting business logic inside alerting expressions instead of in a recording rule; debugging a false page becomes “rewrite the PromQL in your head at 3 a.m.”
  • Using a for shorter than the scrape interval; the alert fires on the first sample anyway and there is no real protection.
  • Omitting severity or team labels and relying on Alertmanager regex to route; routing becomes spooky-action-at-a-distance.
  • A runbook_url that points to a missing wiki page; the on-call ends up paging the author for context, which is the failure mode the runbook was supposed to prevent.
  • expr that references a label like cluster without the corresponding by (cluster) grouping; alerts collapse across clusters and a single noisy environment swamps the receiver.

For Alertmanager routing tied to the labels you set here, see alertmanager-config.


Stack Harbor maintains the alerting rule set for clients in our managed operations tier — recording rules in version control, alerts tested with promtool test rules in CI, and every rule with a real runbook URL the on-call can open.