A Traefik install with no observability is a black box that occasionally serves traffic. The minimum you need in production is three things: Prometheus metrics scraped from a private endpoint, a JSON access log shipped to whatever you use for logs, and traces flowing to a collector so you can correlate slow upstream requests with their entrypoint hop. This article walks through wiring all three, plus the dashboards and alerts that make the data actionable.

How to verify

# Metrics endpoint should answer on a private interface only
curl -s http://127.0.0.1:9100/metrics | head -20
curl -s http://127.0.0.1:9100/metrics | grep -E '^traefik_entrypoint_requests_total' | head
# Access log should exist and be JSON
sudo tail -3 /opt/traefik/logs/access.log | jq .
# A test request should show up in the log within a second
curl -sI https://app.example.com/ >/dev/null
sudo tail -1 /opt/traefik/logs/access.log | jq '{ts: .StartUTC, code: .DownstreamStatus, dur_ms: (.Duration/1000000)}'

If /metrics answers on 0.0.0.0 instead of 127.0.0.1, the metrics entrypoint is on the wrong interface. If the access log is CommonLog (not JSON), shipping into Loki or any structured log store will be unhappy.

What’s happening

Traefik exposes three observability surfaces and they are configured independently.

Metrics come from a dedicated entrypoint. Best practice is to bind it on 127.0.0.1 (or a Kubernetes ClusterIP without external exposure) so Prometheus scrapes locally over an SSH tunnel or in-cluster, never via the public LB. The metric families that matter are traefik_entrypoint_* (per-entrypoint traffic, useful for total RPS and connection counts), traefik_service_* (per-backend, useful for upstream health), traefik_router_* (per-route, often noisy and high-cardinality), and traefik_config_* (reload counters — non-zero means something is churning).

Access logs are written to a file in either CLF or JSON. Always JSON in production. The default field set includes status code, duration, request path, upstream URL, and client IP. You can keep or drop specific headers per field via accessLog.fields.headers. The bufferingSize setting batches writes — useful at high RPS but means the most recent few lines are not on disk yet.

Tracing in Traefik v3 is via OpenTelemetry exclusively (the legacy Jaeger/Zipkin/Datadog blocks are gone). A tracing block points at an OTLP collector and emits spans for every request through Traefik, propagating the trace context to upstream services that are themselves instrumented.

The procedure

Add the metrics entrypoint and Prometheus producer to static config.

entryPoints:
  metrics:
    address: "127.0.0.1:9100"
metrics:
  prometheus:
    entryPoint: metrics
    addEntryPointsLabels: true
    addServicesLabels: true
    addRoutersLabels: false  # high cardinality if many routers
    buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

Set addRoutersLabels: true only if you have fewer than a hundred routers; otherwise the histogram series count balloons.

Configure JSON access log with sensible header handling.

accessLog:
  filePath: /var/log/traefik/access.log
  format: json
  bufferingSize: 100
  fields:
    defaultMode: keep
    headers:
      defaultMode: keep
      names:
        Authorization: drop
        Cookie: redact
        X-Forwarded-For: keep

Configure OTel tracing. Point at your collector — Tempo, Jaeger collector, Datadog agent, OTel collector itself.

tracing:
  otlp:
    grpc:
      endpoint: "otel-collector.observability.svc.cluster.local:4317"
      insecure: true
  sampleRate: 0.1   # 10% sampling at high RPS

Ship the access log to Loki / Elasticsearch / S3. With Loki and promtail:

# /etc/promtail/config.yml
scrape_configs:
  - job_name: traefik
    static_configs:
      - targets: [localhost]
        labels:
          job: traefik
          __path__: /opt/traefik/logs/access.log
    pipeline_stages:
      - json:
          expressions:
            status: DownstreamStatus
            method: RequestMethod
            host: RequestHost
            duration_ms: Duration
      - labels:
          status:
          method:
          host:

Add the Prometheus scrape — pointing at the host’s localhost via a node_exporter sidecar pattern, or in-cluster via the ServiceMonitor.
```
# prometheus scrape config
- job_name: traefik
  static_configs:
    - targets: ['127.0.0.1:9100']
  scrape_interval: 15s
```

Build the four alerts that actually fire usefully. Adjust thresholds for your traffic shape.

groups:
  - name: traefik
    rules:
      - alert: TraefikDown
        expr: up{job="traefik"} == 0
        for: 2m
      - alert: Traefik5xxRate
        expr: |
          sum(rate(traefik_service_requests_total{code=~"5.."}[5m]))
            / sum(rate(traefik_service_requests_total[5m])) > 0.02
        for: 5m
      - alert: TraefikLatencyP99
        expr: |
          histogram_quantile(0.99,
            sum by (le, service) (rate(traefik_service_request_duration_seconds_bucket[5m]))
          ) > 1.0
        for: 10m
      - alert: TraefikConfigReloadFailures
        expr: increase(traefik_config_reloads_failure_total[10m]) > 0

Operational notes

The traefik_router_* series is high-cardinality — every router becomes a label value. Disable router labels on installs with many short-lived routes, or you will OOM Prometheus.
JSON access log fields include RouterName, ServiceName, RequestAddr, OriginContentSize, DownstreamContentSize — all useful for traffic analysis; do not strip them.
Cookie: redact keeps the cookie name in the log but drops the value — useful for “did the session cookie come in” debugging without leaking the token.
TLS-encrypted access from Prometheus to the metrics endpoint is overkill for 127.0.0.1, but if you must scrape across the network, expose metrics behind the regular dashboard auth chain.
Tracing at 100% sample rate at high RPS costs both Traefik CPU and collector ingestion — start at 0.05-0.10 and only increase for incident reproduction.

Observability is what turns a Traefik install from “it serves traffic” into “we can answer why the p99 spiked at 14:03”. For the operational view of these signals — the dashboards that show whether the install is healthy at a glance, the alerts that go to a real on-call — we wire and own them as part of managed operations. The companion dashboard-side hardening lives in traefik-dashboard-secure.