Skip to content

Traefik observability — Prometheus metrics, access logs, and OpenTelemetry tracing

Wire Traefik for production observability — Prometheus on a private entrypoint, JSON access logs to Loki, OTel traces to a collector, with the dashboards that matter.

A Traefik install with no observability is a black box that occasionally serves traffic. The minimum you need in production is three things: Prometheus metrics scraped from a private endpoint, a JSON access log shipped to whatever you use for logs, and traces flowing to a collector so you can correlate slow upstream requests with their entrypoint hop. This article walks through wiring all three, plus the dashboards and alerts that make the data actionable.

How to verify

# Metrics endpoint should answer on a private interface only
curl -s http://127.0.0.1:9100/metrics | head -20
curl -s http://127.0.0.1:9100/metrics | grep -E '^traefik_entrypoint_requests_total' | head
# Access log should exist and be JSON
sudo tail -3 /opt/traefik/logs/access.log | jq .
# A test request should show up in the log within a second
curl -sI https://app.example.com/ >/dev/null
sudo tail -1 /opt/traefik/logs/access.log | jq '{ts: .StartUTC, code: .DownstreamStatus, dur_ms: (.Duration/1000000)}'

If /metrics answers on 0.0.0.0 instead of 127.0.0.1, the metrics entrypoint is on the wrong interface. If the access log is CommonLog (not JSON), shipping into Loki or any structured log store will be unhappy.

What’s happening

Traefik exposes three observability surfaces and they are configured independently.

Metrics come from a dedicated entrypoint. Best practice is to bind it on 127.0.0.1 (or a Kubernetes ClusterIP without external exposure) so Prometheus scrapes locally over an SSH tunnel or in-cluster, never via the public LB. The metric families that matter are traefik_entrypoint_* (per-entrypoint traffic, useful for total RPS and connection counts), traefik_service_* (per-backend, useful for upstream health), traefik_router_* (per-route, often noisy and high-cardinality), and traefik_config_* (reload counters — non-zero means something is churning).

Access logs are written to a file in either CLF or JSON. Always JSON in production. The default field set includes status code, duration, request path, upstream URL, and client IP. You can keep or drop specific headers per field via accessLog.fields.headers. The bufferingSize setting batches writes — useful at high RPS but means the most recent few lines are not on disk yet.

Tracing in Traefik v3 is via OpenTelemetry exclusively (the legacy Jaeger/Zipkin/Datadog blocks are gone). A tracing block points at an OTLP collector and emits spans for every request through Traefik, propagating the trace context to upstream services that are themselves instrumented.

The procedure

  1. Add the metrics entrypoint and Prometheus producer to static config.

    entryPoints:
      metrics:
        address: "127.0.0.1:9100"
    metrics:
      prometheus:
        entryPoint: metrics
        addEntryPointsLabels: true
        addServicesLabels: true
        addRoutersLabels: false  # high cardinality if many routers
        buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

    Set addRoutersLabels: true only if you have fewer than a hundred routers; otherwise the histogram series count balloons.

  2. Configure JSON access log with sensible header handling.

    accessLog:
      filePath: /var/log/traefik/access.log
      format: json
      bufferingSize: 100
      fields:
        defaultMode: keep
        headers:
          defaultMode: keep
          names:
            Authorization: drop
            Cookie: redact
            X-Forwarded-For: keep
  3. Configure OTel tracing. Point at your collector — Tempo, Jaeger collector, Datadog agent, OTel collector itself.

    tracing:
      otlp:
        grpc:
          endpoint: "otel-collector.observability.svc.cluster.local:4317"
          insecure: true
      sampleRate: 0.1   # 10% sampling at high RPS
  4. Ship the access log to Loki / Elasticsearch / S3. With Loki and promtail:

    # /etc/promtail/config.yml
    scrape_configs:
      - job_name: traefik
        static_configs:
          - targets: [localhost]
            labels:
              job: traefik
              __path__: /opt/traefik/logs/access.log
        pipeline_stages:
          - json:
              expressions:
                status: DownstreamStatus
                method: RequestMethod
                host: RequestHost
                duration_ms: Duration
          - labels:
              status:
              method:
              host:
  5. Add the Prometheus scrape — pointing at the host’s localhost via a node_exporter sidecar pattern, or in-cluster via the ServiceMonitor.

    # prometheus scrape config
    - job_name: traefik
      static_configs:
        - targets: ['127.0.0.1:9100']
      scrape_interval: 15s
  6. Build the four alerts that actually fire usefully. Adjust thresholds for your traffic shape.

    groups:
      - name: traefik
        rules:
          - alert: TraefikDown
            expr: up{job="traefik"} == 0
            for: 2m
          - alert: Traefik5xxRate
            expr: |
              sum(rate(traefik_service_requests_total{code=~"5.."}[5m]))
                / sum(rate(traefik_service_requests_total[5m])) > 0.02
            for: 5m
          - alert: TraefikLatencyP99
            expr: |
              histogram_quantile(0.99,
                sum by (le, service) (rate(traefik_service_request_duration_seconds_bucket[5m]))
              ) > 1.0
            for: 10m
          - alert: TraefikConfigReloadFailures
            expr: increase(traefik_config_reloads_failure_total[10m]) > 0

Operational notes

  • The traefik_router_* series is high-cardinality — every router becomes a label value. Disable router labels on installs with many short-lived routes, or you will OOM Prometheus.
  • JSON access log fields include RouterName, ServiceName, RequestAddr, OriginContentSize, DownstreamContentSize — all useful for traffic analysis; do not strip them.
  • Cookie: redact keeps the cookie name in the log but drops the value — useful for “did the session cookie come in” debugging without leaking the token.
  • TLS-encrypted access from Prometheus to the metrics endpoint is overkill for 127.0.0.1, but if you must scrape across the network, expose metrics behind the regular dashboard auth chain.
  • Tracing at 100% sample rate at high RPS costs both Traefik CPU and collector ingestion — start at 0.05-0.10 and only increase for incident reproduction.

Observability is what turns a Traefik install from “it serves traffic” into “we can answer why the p99 spiked at 14:03”. For the operational view of these signals — the dashboards that show whether the install is healthy at a glance, the alerts that go to a real on-call — we wire and own them as part of managed operations. The companion dashboard-side hardening lives in traefik-dashboard-secure.