HAProxy can saturate a 100 Gbps NIC on commodity hardware when tuned correctly. Out of the box, the package defaults are conservative — fine for a few thousand requests per second, suboptimal at 50k+. This article covers the production knobs we touch: thread count, CPU pinning, maxconn, kernel sysctls, TLS session caching, and the measurement discipline that decides which knobs to turn before turning them.
How to verify
Before tuning, measure. The current ceiling is not what the docs say — it is what your hardware reports under load:
echo "show info" | sudo socat /run/haproxy/admin.sock - | grep -E 'Threads|Maxconn|CurrConns|CurrSslConns|Process_num'
echo "show stat" | sudo socat /run/haproxy/admin.sock - | head -2
ss -s
top -H -p $(pgrep -f 'haproxy.*-Ws')
sysctl net.core.somaxconn fs.file-max
show info exposes process limits and current concurrency. ss -s shows global socket counts; if TCP-Timewait is in the millions, you have a kernel ephemeral-port pressure problem. top -H shows per-thread CPU — if one thread is at 100% and others idle, you have a pinning or thread-distribution issue.
What’s happening
HAProxy 2.0+ is multi-threaded. One process, N threads, all sharing listener sockets. Each connection is handled by one thread for its lifetime. The defaults: nbthread matches detected CPU count; threads are spread across NUMA nodes if visible.
The major performance levers:
- Thread count (
nbthread) — more threads = more concurrent connections, up to the point where context-switch overhead exceeds throughput gain. Typically equals CPU count up to 8-16; beyond that, returns diminish unless you also pin. - CPU pinning (
cpu-map) — pin threads to specific cores. Stops the kernel from moving them across NUMA nodes and improves cache locality. - Maxconn — total concurrent connections. Bounded by
fs.file-max,ulimit -n, and physical memory. - Kernel sysctls —
somaxconn,tcp_max_syn_backlog, ephemeral port range,tcp_fin_timeout,tcp_tw_reuse. - TLS session caching —
tune.ssl.cachesize,tune.ssl.lifetime, ticket keys. Skip the full handshake for returning clients. - HTTP/2 multiplexing — one TCP connection from the client carries many requests. Reduces connection count but increases per-connection CPU.
The discipline: tune one variable at a time, measure, document. We have seen production HAProxy degraded by a “tuning” that turned on every flag from a blog post.
The procedure
-
Match nbthread to physical cores. Default is fine if your box is dedicated to HAProxy:
global nbthread 8On a shared box (HAProxy + some other workload), set nbthread below CPU count so the kernel has cores for other things.
lscputells you what is available. -
Pin threads to cores. Especially important on NUMA hardware. The pattern is
cpu-map auto:1/1-N 0-(N-1):global nbthread 8 cpu-map auto:1/1-8 0-7This pins each thread (numbered 1 to 8) to CPU cores 0 to 7. For NUMA, use
numactl --hardwareto see node topology and pin threads to a single node:cpu-map auto:1/1-4 0-3 cpu-map auto:1/5-8 8-11(Cores 0-3 + 8-11 on the same NUMA node, assuming hyper-threading at 4-core boundaries.)
-
Set maxconn intentionally. Default
maxconn 100in defaults is too low; the global default depends onulimit -n. Pick a number consistent with file-descriptor limits:global maxconn 200000sudo systemctl edit haproxy # add: # [Service] # LimitNOFILE=1048576 sudo systemctl daemon-reload sudo systemctl restart haproxyEach connection consumes at least 2 FDs (client + backend) plus a few for logging/sockets. 200,000 connections need at least ~500,000 FDs; bump to 1M to leave headroom.
-
Kernel sysctls. Add to
/etc/sysctl.d/99-haproxy.conf:net.core.somaxconn = 65535 net.ipv4.tcp_max_syn_backlog = 65535 net.ipv4.ip_local_port_range = 1024 65535 net.ipv4.tcp_fin_timeout = 15 net.ipv4.tcp_tw_reuse = 1 net.core.netdev_max_backlog = 65535 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_slow_start_after_idle = 0 fs.file-max = 2000000sudo sysctl -p /etc/sysctl.d/99-haproxy.confThe
tcp_tw_reuse = 1is the most important for high-concurrency outbound (HAProxy connecting to backends) — it lets the kernel reuse TIME_WAIT sockets safely. -
TLS session caching. The CPU cost of a full TLS handshake is large; resumption is small. Tune the session cache:
global tune.ssl.cachesize 100000 tune.ssl.lifetime 600 tune.ssl.default-dh-param 2048100k cached sessions at ~200 bytes each is ~20MB. The lifetime defaults to 5 minutes; 10 minutes is the sweet spot for typical web traffic.
-
HTTP/2 tuning. Per-stream concurrency matters:
frontend fe_https bind *:443 ssl crt /etc/haproxy/certs/site.pem alpn h2,http/1.1 tune.h2.max-concurrent-streams 100 tune.h2.max-frame-size 16384The defaults are typically fine; raise
max-concurrent-streamsonly if you confirm your clients open more than 100 streams per connection. -
Measure under load. Use a real load generator (wrk2, hey, vegeta) against a copy of production traffic. Watch
show infowhile the test runs:wrk -t12 -c10000 -d60s --latency https://lb.example.com/ watch -n1 'echo "show info" | sudo socat /run/haproxy/admin.sock - | grep -E "CurrConns|MaxConnRate|Sess"'MaxConnRateis the high-water mark — if it sits below your maxconn, you have headroom; if it pegs at maxconn, you have hit a ceiling.
Common pitfalls
- Setting
nbthreadhigher than CPU count produces context-switch storms and lower throughput. Match physical cores; do not over-subscribe. maxconnindefaultsis per-frontend; theglobalmaxconnis the process-wide cap. Adefaults maxconn 50000withglobal maxconn 30000means the global cap wins and you waste configured capacity.- Kernel sysctls in
/etc/sysctl.confare loaded only at boot. Aftersysctl -p, the values are live for the running kernel; persist them in a file under/etc/sysctl.d/so they survive reboot. - TLS session cache stats are visible via
show info:SslCacheLookupsandSslCacheMisses. If misses approach lookups, the cache is too small or the lifetime too short. - Connecting HAProxy → backend on a private IP with
tcp_tw_reuse = 0exhausts ephemeral ports under sustained connection rates >5k/sec. The1setting fixes this; the oldertcp_tw_recycleis dangerous behind NAT and was removed in Linux 4.12.
Stack Harbor establishes a tuning baseline per environment — measured under simulated production traffic — and tracks the knobs in a per-client document. We re-measure after major HAProxy upgrades because tuning that worked on 2.4 may be wrong for 3.0. This is part of the operational layer we maintain for Managed Operations clients.