Skip to content

Traefik retry middleware — safe retries, idempotency, and what you should not retry

Configure the Traefik retry middleware — attempts, initial interval, and the methods/status codes where retries help versus where they cause duplicate writes.

The Traefik retry middleware retries failed backend calls at the edge, smoothing over single-replica blips without the client ever seeing them. It is the right primitive for read-heavy traffic against fleets that occasionally drop a connection. It is the wrong primitive — actively dangerous — for any request whose handler is not idempotent. A retried POST is a duplicate write. This article covers the configuration shape, what counts as a retry-worthy failure, and the safe vs unsafe patterns.

How to verify

# Check the middleware is wired
curl -s http://127.0.0.1:8082/api/http/middlewares | jq '.[] | select(.type=="retry")'
# Watch retries happen on a flaky backend
docker logs traefik 2>&1 | grep -i retry | tail
# Prometheus exposes retry counts per service
curl -s http://127.0.0.1:9100/metrics | grep -E 'traefik_service_retries_total'
# A request that ultimately succeeds after retries should show one entry in the access log
# with RetryAttempts > 0
sudo tail -1 /opt/traefik/logs/access.log | jq '{status: .DownstreamStatus, attempts: .RetryAttempts}'

What’s happening

The retry middleware sits between the router and the service. When a backend call fails with a network error — connection refused, connection reset, read timeout, EOF before any response — Traefik picks a different backend from the service’s load-balancer pool (if available) and retries. The number of total attempts and the initial interval between them are the two knobs.

Critically, Traefik retries only on transport failures. A backend that returns HTTP 500 is NOT retried; the 500 is a real response that completed the request cycle. This is the right default for safety — retrying a 500 in front of an idempotent endpoint helps; retrying it in front of a non-idempotent one corrupts data. If you want status-code-based retries you have to do it differently (a plugin, or backend-side handling).

The retry path picks a different backend on each attempt, so a fleet of three replicas where one is dead and two healthy will succeed on attempt 2. If all replicas are dead the retry middleware does not help — you get a 502 after burning the retry budget. The retry budget is per-request; there is no global rate limit on retries, which means a flaky backend that returns transport errors to every request burns 3x or 5x the backend connections.

The procedure

  1. Wire the middleware. Three attempts with a 100ms initial interval is the working default; the interval grows exponentially.

    http:
      middlewares:
        retry-safe:
          retry:
            attempts: 3
            initialInterval: 100ms
      routers:
        api-reads:
          rule: "Host(`api.example.com`) && Method(`GET`, `HEAD`, `OPTIONS`)"
          entryPoints: [websecure]
          service: api-backend
          middlewares: [retry-safe]
  2. Apply selectively by HTTP method. The Method(...) matcher on the router rule keeps retry off the write-side path entirely.

    http:
      routers:
        api-reads:
          rule: "Host(`api.example.com`) && Method(`GET`, `HEAD`, `OPTIONS`)"
          entryPoints: [websecure]
          service: api-backend
          middlewares: [retry-safe]
        api-writes:
          rule: "Host(`api.example.com`) && Method(`POST`, `PUT`, `PATCH`, `DELETE`)"
          entryPoints: [websecure]
          service: api-backend
          # no retry on writes
  3. For backends that expose explicit idempotency (an Idempotency-Key header), you can be braver and retry writes too — but only when the backend actually deduplicates by key. The retry middleware does not inspect headers; you trust the backend.

    http:
      middlewares:
        retry-idempotent:
          retry:
            attempts: 2
            initialInterval: 50ms
      routers:
        api-idempotent:
          rule: "Host(`api.example.com`) && Header(`Idempotency-Key`, `.+`)"
          entryPoints: [websecure]
          service: api-backend
          middlewares: [retry-idempotent]
  4. Set health checks on the service so dead backends are removed from the pool quickly and retries actually find a healthy replica.

    http:
      services:
        api-backend:
          loadBalancer:
            servers:
              - url: "http://10.0.1.10:8080"
              - url: "http://10.0.1.11:8080"
              - url: "http://10.0.1.12:8080"
            healthCheck:
              path: /healthz
              interval: 5s
              timeout: 2s
              scheme: http
  5. Monitor the retry counter. A sudden rise in traefik_service_retries_total against a single service is an early signal of upstream instability — alert on it.

    - alert: TraefikHighRetryRate
      expr: |
        sum by (service) (rate(traefik_service_retries_total[5m])) > 5
      for: 5m

Common pitfalls

  • Retrying POST/PUT/PATCH without idempotency guarantees — a payment request that times out gets billed twice; a user creation gets two rows.
  • High attempts (10+) — every transport error costs 10x the backend connections; the retry middleware can amplify a partial outage into a complete one.
  • Forgetting that retries pick different backends only if the service has multiple servers — a single-backend service retries against the same dead host; you get the same failure 3 times.
  • Combining retry with a tight client-side timeout — the client gives up before Traefik finishes retrying, then retries client-side, multiplying request volume during incidents.
  • Treating retry as a substitute for backend health checks — if your service has no healthCheck:, dead replicas stay in the pool and retries waste cycles on them.

Retry is one of the cheapest resilience knobs at the edge — when used correctly, end users see fewer transient failures. When used wrong it doubles writes and amplifies outages. The discipline of routing read versus write traffic to different routers with different middleware chains is part of how Stack Harbor runs managed operations. For rate limiting as the complementary middleware see traefik-rate-limit-middleware.