A Prometheus alerting setup goes wrong in one of two directions. Either the rules are so loose that one bad week trains the on-call to ignore the channel, or they are so strict and so few that real problems sail past for hours. The fix is mechanical: a small file layout, a discipline of recording rules feeding alerting rules, severity labels that map cleanly to receivers, and for clauses that match the timescale of the thing being measured. This article shows the shape we use for client clusters.

How to verify

Inspect what Prometheus actually has loaded — the file on disk and the file in memory can diverge if a reload was skipped.

promtool check rules /etc/prometheus/rules.d/*.yml
curl -fsS http://127.0.0.1:9090/api/v1/rules | jq '.data.groups[] | {name, interval, file, rules: (.rules | length)}'
curl -fsS http://127.0.0.1:9090/api/v1/alerts | jq '.data.alerts[] | {labels: .labels, state, activeAt}'
curl -fsS 'http://127.0.0.1:9090/api/v1/query?query=ALERTS{alertstate="firing"}' | jq '.data.result[] | .metric'

If promtool check rules passes but /api/v1/rules does not list the file, Prometheus did not pick up the change — reload via curl -XPOST http://127.0.0.1:9090/-/reload and check /api/v1/status/config.

What’s happening

Prometheus evaluates rule files every evaluation_interval (15s by default). Rules come in two kinds. Recording rules compute a value and write it back as a new time series — useful when the same expensive query is referenced by many alerts or dashboards. Alerting rules evaluate a boolean condition and, if true continuously for the for duration, fire an alert to Alertmanager.

The for clause is the lever that separates a flap from an incident. A condition true for 30 seconds during a deploy is not the same as a condition true for ten minutes during a saturation event. A reasonable default is 5–10 minutes for “service degraded” alerts, 1–2 minutes for “service down” alerts on critical paths, and 0 (immediate) only for events you cannot afford to miss for any sample (replication broken, lock file present).

Labels on the alerting rule shape the Alertmanager routing. We use three labels consistently: severity (critical / warning / info), team (the on-call rotation responsible), and runbook_url (a link the on-call can open at 3 a.m.). Alertmanager routes on severity and team; the runbook URL goes into the notification template.

The procedure

Lay the files out by concern, not by team. A flat rules.d/ with one file per subsystem keeps the diffs reviewable.

/etc/prometheus/rules.d/
  recording-node.yml
  recording-http.yml
  alerts-node.yml
  alerts-http.yml
  alerts-postgres.yml
  alerts-nginx.yml
  alerts-blackbox.yml

Point prometheus.yml at the directory:

rule_files:
  - /etc/prometheus/rules.d/*.yml

Write recording rules for any expression used by more than one alert or panel. They are cheaper, and they give the alert a stable name to refer to.

# /etc/prometheus/rules.d/recording-http.yml
groups:
  - name: http-recording
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job, route) (rate(http_requests_total[5m]))
      - record: job:http_request_duration_seconds:p95_5m
        expr: |
          histogram_quantile(0.95,
            sum by (job, route, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )
      - record: job:http_errors:rate5m
        expr: sum by (job, route) (rate(http_requests_total{status=~"5.."}[5m]))

Write alerting rules that read from the recording rules. Each alert has labels (routing) and annotations (human-readable).

# /etc/prometheus/rules.d/alerts-http.yml
groups:
  - name: http-alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          (job:http_errors:rate5m / job:http_requests:rate5m) > 0.05
        for: 10m
        labels:
          severity: critical
          team: web
        annotations:
          summary: "{{ $labels.job }}/{{ $labels.route }} error rate > 5%"
          description: "5xx rate on {{ $labels.route }} sustained above 5% for 10 minutes."
          runbook_url: "https://runbooks.example.com/http-high-errors"

      - alert: HighLatencyP95
        expr: job:http_request_duration_seconds:p95_5m > 1.5
        for: 10m
        labels:
          severity: warning
          team: web
        annotations:
          summary: "{{ $labels.job }}/{{ $labels.route }} p95 > 1.5s"
          description: "p95 latency on {{ $labels.route }} above 1.5s for 10 minutes."
          runbook_url: "https://runbooks.example.com/http-high-latency"

Match the for duration to the failure mode. A network blip causing one missed scrape should not page. Use up == 0 with for: 5m for “instance down,” not for: 0s. A backup that has not run in 36 hours should page; one that has not run in 30 minutes (because it runs nightly) should not.

- alert: InstanceDown
  expr: up{job="node"} == 0
  for: 5m
  labels: { severity: critical, team: platform }
  annotations:
    summary: "{{ $labels.instance }} down"
    runbook_url: "https://runbooks.example.com/instance-down"

- alert: BackupStale
  expr: time() - backup_last_success_timestamp_seconds > 36 * 3600
  for: 0s
  labels: { severity: warning, team: platform }
  annotations:
    summary: "Backup {{ $labels.job }} has not succeeded in >36h"

Validate and reload. promtool check rules /etc/prometheus/rules.d/*.yml returns non-zero on any error; CI should run it on every change. Then curl -XPOST http://127.0.0.1:9090/-/reload — or restart, if --web.enable-lifecycle is not set.

Unit-test the rules. Use promtool test rules with a YAML fixture so a refactor of the recording rule cannot silently change which alerts fire.

# /etc/prometheus/rules-tests/http.yml
rule_files:
  - ../rules.d/recording-http.yml
  - ../rules.d/alerts-http.yml
evaluation_interval: 30s
tests:
  - name: high error rate trips after 10 minutes
    interval: 30s
    input_series:
      - series: 'http_requests_total{job="web",route="/checkout",status="200"}'
        values: '1000+1000x40'
      - series: 'http_requests_total{job="web",route="/checkout",status="500"}'
        values: '100+100x40'
    alert_rule_test:
      - eval_time: 11m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels: { severity: critical, team: web, job: web, route: /checkout }

Common pitfalls

Putting business logic inside alerting expressions instead of in a recording rule; debugging a false page becomes “rewrite the PromQL in your head at 3 a.m.”
Using a for shorter than the scrape interval; the alert fires on the first sample anyway and there is no real protection.
Omitting severity or team labels and relying on Alertmanager regex to route; routing becomes spooky-action-at-a-distance.
A runbook_url that points to a missing wiki page; the on-call ends up paging the author for context, which is the failure mode the runbook was supposed to prevent.
expr that references a label like cluster without the corresponding by (cluster) grouping; alerts collapse across clusters and a single noisy environment swamps the receiver.

For Alertmanager routing tied to the labels you set here, see alertmanager-config.

Stack Harbor maintains the alerting rule set for clients in our managed operations tier — recording rules in version control, alerts tested with promtool test rules in CI, and every rule with a real runbook URL the on-call can open.