Testing a mail server means lying to it convincingly

A web app gets a request, does some work, returns a response. You can test most of it with a fake request object and a few assertions.

A mail server is harder to pin down, because the thing it talks to is the rest of the internet, and the rest of the internet is full of senders that ignore the spec. The interesting behavior lives in how our system reacts to a connection that opens and then says nothing for ninety seconds, or sends a MAIL FROM with a domain that doesn’t resolve, or pipelines six commands before we’ve answered the first one. You can’t unit-test your way to confidence about any of that. You have to produce the bad behavior on purpose.

So a large share of our test code does exactly one thing: it pretends to be a sender, usually a rude one.

The cheapest useful test is a script

Before any framework, the workhorse is a tiny client that speaks raw SMTP and lets us control every byte and every pause. Something close to this:

import socket

def session(host, port, script):
    s = socket.create_connection((host, port), timeout=10)
    f = s.makefile("rwb")

    def expect(code):
        line = f.readline().decode()
        assert line.startswith(code), f"wanted {code}, got: {line!r}"
        return line

    def send(line, pause=0):
        f.write(line.encode() + b"\r\n")
        f.flush()
        if pause:
            time.sleep(pause)

    for step in script:
        step(expect, send)

    s.close()

With that in hand, a test is just a script of expect and send calls. The valuable ones are where we misbehave deliberately:

def slow_loris(expect, send):
    expect("220")
    send("HELO test.invalid")
    expect("250")
    # Open a transaction, then go quiet and see if the timeout fires.
    send("MAIL FROM:<probe@test.invalid>", pause=120)
    expect("421")  # we should hang up, not hold the connection open forever

If that test starts hanging instead of getting a 421, someone changed a timeout and didn’t mean to. We’ve caught real regressions this way that no amount of mocking would have surfaced, because the bug was in the timing, not the logic.

Greylisting is a clock, so test the clock

Greylisting works by giving a first-time sender a temporary 4xx rejection and trusting that a real mail server will retry a few minutes later while most junk won’t. The whole mechanism is a stopwatch with opinions, which makes it miserable to test against wall-clock time.

The fix is the usual one: never let the code read the clock directly. Our greylist asks an injected clock for the time, and in tests we hand it a fake we can advance by hand.

clock = FakeClock(start=0)
gl = Greylist(clock=clock, delay=300)

assert gl.check(triplet) == "DEFER"   # first sight, come back later
clock.advance(120)
assert gl.check(triplet) == "DEFER"   # too soon
clock.advance(200)
assert gl.check(triplet) == "PASS"    # 320s later, welcome in

A test that would have taken five real minutes runs in under a millisecond, and it’s deterministic, which matters more. Flaky time-based tests train people to hit rerun until they pass, and that habit eventually hides a real failure.

Fake the network, not the protocol

It’s tempting to mock out the SMTP layer entirely and test “the logic” in isolation. We’ve found that mocks at that level mostly test our assumptions about SMTP, not SMTP. When a sender does something we didn’t anticipate, the mock happily does the anticipated thing and the test stays green while production burns.

So we draw the line lower down. DNS, our reputation lookups, and outbound delivery get faked, because those are slow, flaky, or talk to the real world. The SMTP conversation itself runs for real against a server listening on localhost. The server doesn’t know it’s in a test. That gap, between what the server believes and what’s actually true, is where we get to be cruel to it.

The senders we keep on file

Over the years we’ve collected a small zoo of captured bad behavior: the client that sends bare line feeds instead of CRLF, the one that issues RSET after every recipient, the load balancer health check that connects and immediately drops without a QUIT. Each one is a fixture now. When a sender out there does something strange enough to cause a ticket, the fix isn’t done until that strangeness is a test that fails before the fix and passes after.

The suite grows slowly and on purpose. It isn’t a monument to coverage numbers. It’s a record of every way the internet has surprised us, kept around so it can’t surprise us the same way twice.