A Traefik install with no observability is a black box that occasionally serves traffic. The minimum you need in production is three things: Prometheus metrics scraped from a private endpoint, a JSON access log shipped to whatever you use for logs, and traces flowing to a collector so you can correlate slow upstream requests with their entrypoint hop. This article walks through wiring all three, plus the dashboards and alerts that make the data actionable.
How to verify
# Metrics endpoint should answer on a private interface only
curl -s http://127.0.0.1:9100/metrics | head -20
curl -s http://127.0.0.1:9100/metrics | grep -E '^traefik_entrypoint_requests_total' | head
# Access log should exist and be JSON
sudo tail -3 /opt/traefik/logs/access.log | jq .
# A test request should show up in the log within a second
curl -sI https://app.example.com/ >/dev/null
sudo tail -1 /opt/traefik/logs/access.log | jq '{ts: .StartUTC, code: .DownstreamStatus, dur_ms: (.Duration/1000000)}'
If /metrics answers on 0.0.0.0 instead of 127.0.0.1, the metrics entrypoint is on the wrong interface. If the access log is CommonLog (not JSON), shipping into Loki or any structured log store will be unhappy.
What’s happening
Traefik exposes three observability surfaces and they are configured independently.
Metrics come from a dedicated entrypoint. Best practice is to bind it on 127.0.0.1 (or a Kubernetes ClusterIP without external exposure) so Prometheus scrapes locally over an SSH tunnel or in-cluster, never via the public LB. The metric families that matter are traefik_entrypoint_* (per-entrypoint traffic, useful for total RPS and connection counts), traefik_service_* (per-backend, useful for upstream health), traefik_router_* (per-route, often noisy and high-cardinality), and traefik_config_* (reload counters — non-zero means something is churning).
Access logs are written to a file in either CLF or JSON. Always JSON in production. The default field set includes status code, duration, request path, upstream URL, and client IP. You can keep or drop specific headers per field via accessLog.fields.headers. The bufferingSize setting batches writes — useful at high RPS but means the most recent few lines are not on disk yet.
Tracing in Traefik v3 is via OpenTelemetry exclusively (the legacy Jaeger/Zipkin/Datadog blocks are gone). A tracing block points at an OTLP collector and emits spans for every request through Traefik, propagating the trace context to upstream services that are themselves instrumented.
The procedure
-
Add the metrics entrypoint and Prometheus producer to static config.
entryPoints: metrics: address: "127.0.0.1:9100" metrics: prometheus: entryPoint: metrics addEntryPointsLabels: true addServicesLabels: true addRoutersLabels: false # high cardinality if many routers buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]Set
addRoutersLabels: trueonly if you have fewer than a hundred routers; otherwise the histogram series count balloons. -
Configure JSON access log with sensible header handling.
accessLog: filePath: /var/log/traefik/access.log format: json bufferingSize: 100 fields: defaultMode: keep headers: defaultMode: keep names: Authorization: drop Cookie: redact X-Forwarded-For: keep -
Configure OTel tracing. Point at your collector — Tempo, Jaeger collector, Datadog agent, OTel collector itself.
tracing: otlp: grpc: endpoint: "otel-collector.observability.svc.cluster.local:4317" insecure: true sampleRate: 0.1 # 10% sampling at high RPS -
Ship the access log to Loki / Elasticsearch / S3. With Loki and promtail:
# /etc/promtail/config.yml scrape_configs: - job_name: traefik static_configs: - targets: [localhost] labels: job: traefik __path__: /opt/traefik/logs/access.log pipeline_stages: - json: expressions: status: DownstreamStatus method: RequestMethod host: RequestHost duration_ms: Duration - labels: status: method: host: -
Add the Prometheus scrape — pointing at the host’s localhost via a node_exporter sidecar pattern, or in-cluster via the ServiceMonitor.
# prometheus scrape config - job_name: traefik static_configs: - targets: ['127.0.0.1:9100'] scrape_interval: 15s -
Build the four alerts that actually fire usefully. Adjust thresholds for your traffic shape.
groups: - name: traefik rules: - alert: TraefikDown expr: up{job="traefik"} == 0 for: 2m - alert: Traefik5xxRate expr: | sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) / sum(rate(traefik_service_requests_total[5m])) > 0.02 for: 5m - alert: TraefikLatencyP99 expr: | histogram_quantile(0.99, sum by (le, service) (rate(traefik_service_request_duration_seconds_bucket[5m])) ) > 1.0 for: 10m - alert: TraefikConfigReloadFailures expr: increase(traefik_config_reloads_failure_total[10m]) > 0
Operational notes
- The
traefik_router_*series is high-cardinality — every router becomes a label value. Disable router labels on installs with many short-lived routes, or you will OOM Prometheus. - JSON access log fields include
RouterName,ServiceName,RequestAddr,OriginContentSize,DownstreamContentSize— all useful for traffic analysis; do not strip them. Cookie: redactkeeps the cookie name in the log but drops the value — useful for “did the session cookie come in” debugging without leaking the token.- TLS-encrypted access from Prometheus to the metrics endpoint is overkill for
127.0.0.1, but if you must scrape across the network, expose metrics behind the regular dashboard auth chain. - Tracing at 100% sample rate at high RPS costs both Traefik CPU and collector ingestion — start at 0.05-0.10 and only increase for incident reproduction.
Observability is what turns a Traefik install from “it serves traffic” into “we can answer why the p99 spiked at 14:03”. For the operational view of these signals — the dashboards that show whether the install is healthy at a glance, the alerts that go to a real on-call — we wire and own them as part of managed operations. The companion dashboard-side hardening lives in traefik-dashboard-secure.