A slow TLS handshake jammed our whole outbound queue

The outbound queue is supposed to be flat. On a normal day it sits near zero, with little spikes when a big sender hands us a batch, and then it drains in seconds. The graph is so boring we mostly don’t look at it.

One Tuesday it wasn’t flat. The queue depth started climbing around 9:40, a clean diagonal line going up and to the right, and it didn’t stop. By the time the alert fired we had tens of thousands of messages waiting that should have been long gone. None of them were bouncing. They were just sitting there, deferred, waiting for a delivery slot that never came.

What we saw

The first useful signal was that delivery wasn’t failing, it was slow. Our delivery workers pull a message off the queue, look up the recipient’s MX, open a connection, do the SMTP conversation, and move on. The workers were all busy. CPU was fine, memory was fine, the network was fine. They were busy waiting.

A quick look at what they were waiting on told the story. Almost every worker was stuck mid-delivery to the same destination domain. One mail provider, one of their inbound servers, was accepting our TCP connection, agreeing to start TLS, and then never finishing the handshake. The socket stayed open. Our worker sat there with the connection half-built, holding a delivery slot, waiting for a ServerHello that wasn’t coming.

We had a timeout for this. That was the frustrating part. We had set a generous timeout on the whole delivery attempt and a tighter one on the initial TCP connect. What we did not have was a timeout specifically on the TLS handshake. So the TCP connect succeeded fast, which satisfied that timeout, and then the handshake hung in a state the outer timeout was slow to catch because of how the library surfaced it. Each stuck delivery held its worker for far longer than it should have.

Head-of-line blocking, the email edition

Here’s the shape of the problem. We had a fixed pool of delivery workers, shared across every destination. Mail to a thousand different domains all draws from the same pool. When one destination started swallowing connections, every worker that happened to grab a message for that domain got stuck. Those workers were now unavailable for the other 999 domains.

It didn’t take many. With workers getting pinned one by one and not coming back, the pool drained over a few minutes. Once every worker was stuck on the bad destination, nothing else could be delivered at all, even though the rest of the internet was perfectly happy to accept our mail. A single misbehaving server had taken hostage a pool that was meant to be shared.

The messages for that one provider were a small fraction of our volume. They jammed everything anyway.

Getting delivery moving again

The immediate fix was blunt. We pushed a config change that told delivery to stop attempting that one destination for the next while, deferring its mail explicitly so it would retry later. The moment those attempts stopped grabbing workers, the pool freed up and the backlog for everyone else drained in about a minute. The graph went from a wall back to its boring flat line.

That bought us time to fix the actual bug instead of the symptom.

The real fix

Two changes went in over the next couple of days.

First, a real timeout on the TLS handshake itself, separate from the connect timeout and the overall attempt timeout. A handshake that doesn’t complete in a few seconds is a handshake that isn’t going to complete. We’d been treating “connected” as the milestone to protect, when the milestone that matters is “finished negotiating.”

# before: only the bookends were guarded
connect_timeout      = 10s
delivery_timeout     = 300s

# after: guard the step that actually hung
connect_timeout      = 10s
tls_handshake_timeout = 8s
delivery_timeout     = 300s

Second, and more important, per-destination concurrency limits. No single destination is allowed to occupy more than a set fraction of the delivery pool at once. If a domain starts misbehaving, it can tie up its own slice and no more. The other destinations keep flowing. This is the change that means a repeat of this exact incident is a minor blip instead of a full stop.

We also added an alert on per-destination delivery latency, not just queue depth. Queue depth told us we had a problem after it was already large. Watching how long deliveries to each destination are taking would have caught the one slow provider while the queue was still small.

What we took away from it

The bug wasn’t exotic. A missing timeout on an intermediate step, plus a shared resource with no per-tenant limit, is a combination that shows up everywhere once you know to look for it. Connection pools, thread pools, database clients, anything with a fixed number of slots handed out to work of varying duration. One slow consumer with no cap will eventually starve the fast ones.

The lesson we keep relearning, and apparently needed to relearn again: put the timeout on the operation that can actually hang, not on the convenient boundary nearby. And never let one customer, one destination, or one tenant draw from a shared pool without a ceiling. The internet will always send you at least one server having a bad day. Your job is to make sure its bad day stays its own.