HAProxy can saturate a 100 Gbps NIC on commodity hardware when tuned correctly. Out of the box, the package defaults are conservative — fine for a few thousand requests per second, suboptimal at 50k+. This article covers the production knobs we touch: thread count, CPU pinning, maxconn, kernel sysctls, TLS session caching, and the measurement discipline that decides which knobs to turn before turning them.

How to verify

Before tuning, measure. The current ceiling is not what the docs say — it is what your hardware reports under load:

echo "show info" | sudo socat /run/haproxy/admin.sock - | grep -E 'Threads|Maxconn|CurrConns|CurrSslConns|Process_num'
echo "show stat" | sudo socat /run/haproxy/admin.sock - | head -2
ss -s
top -H -p $(pgrep -f 'haproxy.*-Ws')
sysctl net.core.somaxconn fs.file-max

show info exposes process limits and current concurrency. ss -s shows global socket counts; if TCP-Timewait is in the millions, you have a kernel ephemeral-port pressure problem. top -H shows per-thread CPU — if one thread is at 100% and others idle, you have a pinning or thread-distribution issue.

What’s happening

HAProxy 2.0+ is multi-threaded. One process, N threads, all sharing listener sockets. Each connection is handled by one thread for its lifetime. The defaults: nbthread matches detected CPU count; threads are spread across NUMA nodes if visible.

The major performance levers:

Thread count (nbthread) — more threads = more concurrent connections, up to the point where context-switch overhead exceeds throughput gain. Typically equals CPU count up to 8-16; beyond that, returns diminish unless you also pin.
CPU pinning (cpu-map) — pin threads to specific cores. Stops the kernel from moving them across NUMA nodes and improves cache locality.
Maxconn — total concurrent connections. Bounded by fs.file-max, ulimit -n, and physical memory.
Kernel sysctls — somaxconn, tcp_max_syn_backlog, ephemeral port range, tcp_fin_timeout, tcp_tw_reuse.
TLS session caching — tune.ssl.cachesize, tune.ssl.lifetime, ticket keys. Skip the full handshake for returning clients.
HTTP/2 multiplexing — one TCP connection from the client carries many requests. Reduces connection count but increases per-connection CPU.

The discipline: tune one variable at a time, measure, document. We have seen production HAProxy degraded by a “tuning” that turned on every flag from a blog post.

The procedure

Match nbthread to physical cores. Default is fine if your box is dedicated to HAProxy:
```
global
    nbthread 8
```
On a shared box (HAProxy + some other workload), set nbthread below CPU count so the kernel has cores for other things. lscpu tells you what is available.
Pin threads to cores. Especially important on NUMA hardware. The pattern is cpu-map auto:1/1-N 0-(N-1):
```
global
    nbthread 8
    cpu-map auto:1/1-8 0-7
```
This pins each thread (numbered 1 to 8) to CPU cores 0 to 7. For NUMA, use numactl --hardware to see node topology and pin threads to a single node:
```
cpu-map auto:1/1-4 0-3
cpu-map auto:1/5-8 8-11
```
(Cores 0-3 + 8-11 on the same NUMA node, assuming hyper-threading at 4-core boundaries.)
Set maxconn intentionally. Default maxconn 100 in defaults is too low; the global default depends on ulimit -n. Pick a number consistent with file-descriptor limits:
```
global
    maxconn 200000
```
```
sudo systemctl edit haproxy
# add:
# [Service]
# LimitNOFILE=1048576
sudo systemctl daemon-reload
sudo systemctl restart haproxy
```
Each connection consumes at least 2 FDs (client + backend) plus a few for logging/sockets. 200,000 connections need at least ~500,000 FDs; bump to 1M to leave headroom.

Kernel sysctls. Add to /etc/sysctl.d/99-haproxy.conf:

net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_tw_reuse = 1
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_slow_start_after_idle = 0
fs.file-max = 2000000

sudo sysctl -p /etc/sysctl.d/99-haproxy.conf

The tcp_tw_reuse = 1 is the most important for high-concurrency outbound (HAProxy connecting to backends) — it lets the kernel reuse TIME_WAIT sockets safely.

TLS session caching. The CPU cost of a full TLS handshake is large; resumption is small. Tune the session cache:
```
global
    tune.ssl.cachesize 100000
    tune.ssl.lifetime 600
    tune.ssl.default-dh-param 2048
```
100k cached sessions at ~200 bytes each is ~20MB. The lifetime defaults to 5 minutes; 10 minutes is the sweet spot for typical web traffic.
HTTP/2 tuning. Per-stream concurrency matters:
```
frontend fe_https
    bind *:443 ssl crt /etc/haproxy/certs/site.pem alpn h2,http/1.1
    tune.h2.max-concurrent-streams 100
    tune.h2.max-frame-size 16384
```
The defaults are typically fine; raise max-concurrent-streams only if you confirm your clients open more than 100 streams per connection.
Measure under load. Use a real load generator (wrk2, hey, vegeta) against a copy of production traffic. Watch show info while the test runs:
```
wrk -t12 -c10000 -d60s --latency https://lb.example.com/
watch -n1 'echo "show info" | sudo socat /run/haproxy/admin.sock - | grep -E "CurrConns|MaxConnRate|Sess"'
```
MaxConnRate is the high-water mark — if it sits below your maxconn, you have headroom; if it pegs at maxconn, you have hit a ceiling.

Common pitfalls

Setting nbthread higher than CPU count produces context-switch storms and lower throughput. Match physical cores; do not over-subscribe.
maxconn in defaults is per-frontend; the global maxconn is the process-wide cap. A defaults maxconn 50000 with global maxconn 30000 means the global cap wins and you waste configured capacity.
Kernel sysctls in /etc/sysctl.conf are loaded only at boot. After sysctl -p, the values are live for the running kernel; persist them in a file under /etc/sysctl.d/ so they survive reboot.
TLS session cache stats are visible via show info: SslCacheLookups and SslCacheMisses. If misses approach lookups, the cache is too small or the lifetime too short.
Connecting HAProxy → backend on a private IP with tcp_tw_reuse = 0 exhausts ephemeral ports under sustained connection rates >5k/sec. The 1 setting fixes this; the older tcp_tw_recycle is dangerous behind NAT and was removed in Linux 4.12.

Stack Harbor establishes a tuning baseline per environment — measured under simulated production traffic — and tracks the knobs in a per-client document. We re-measure after major HAProxy upgrades because tuning that worked on 2.4 may be wrong for 3.0. This is part of the operational layer we maintain for Managed Operations clients.