Heisenbugs in the Terminal: Could Simulation Testing Have Caught Ghostty?

Mitchell Hashimoto debugged Ghostty’s worst memory leak. A 37 GB leak. Ten days of uptime. The culprit hid in the scrollback buffer—a Heisenbug that only appeared under load, invisible during testing.

Mitchell’s work was surgical. VM tags. Malloc analysis. But the debugging happened after users reported it.

Could simulation testing have found it before then? In seconds?

Yes.

The Bug

Ghostty stores terminal lines in a PageList—pages of memory, doubly-linked. Two kinds:

Standard: Recycled from a pool. Fast.
Non-Standard: Large mmap blocks. For emoji-heavy lines. Rare.

When scrollback overflows, Ghostty takes the oldest page and reuses it as the newest.

Here’s the trap: a non-standard page gets pruned. Its metadata flips to ‘Standard Size.’ The underlying memory stays large. When that page dies, the allocator glances at the metadata, sees ‘Standard,’ and tosses it back into the pool. Never calls munmap. The OS never reclaims it.

The leak was born.

Why Testing Missed It

Non-standard pages are rare. You need one at the tail when scrollback wraps. The leak grows silently—a few MB unnoticed. The trigger was specific: Claude Code’s dense output forces non-standard pages at scale.

No test catches this. You don’t write: ‘Generate 100,000 emoji lines, overflow scrollback, loop the pages, verify OS reclamation.’ That’s not testing. That’s guessing.

Simulation Testing

Deterministic Simulation Testing doesn’t write test cases. It builds a universe and breaks it.

Compress time. Real users saw the leak after ten days. A simulator decouples clock time from logic. Instead of default configs, it fuzzes them.

Set scrollback_limit = 5. Page reuse fires every 5 milliseconds. The rare event becomes routine.

Fuzz chaos. Stream ASCII, Unicode, emojis, control sequences. Simulate Claude Code’s output naturally. Mix in rapid scrolling.

Assert invariants. The key. A DST harness wraps the allocator. Because the simulator owns everything, it knows what’s in the pool:

fn check_pool_integrity(pool: *MemoryPool) {
    for (node in pool.free_list) {
        assert(node.actual_size == standard_size,
               "Corrupted: non-standard page in pool");
    }
}

When the bug tries to recycle a non-standard page as standard, the assertion fires. Immediately. Deterministically. With a hex seed.

Replay

Traditional debugging: hope the bug shows up again. DST: run seed 0x8F3A2.

Set scrollback_limit = 50. Generate 500 lines of Unicode. Wrap a non-standard page. Free it. The assertion fires.

Runtime: 40 milliseconds. Stack trace points to the metadata mismatch. Done.

The Time Cost

The issue was reported three weeks ago. A fix landed 15 hours ago. Three weeks of users hitting 37GB leaks. Three weeks of investigation. Mitchell’s debugging was meticulous—it had to be.

Simulation would have caught it in minutes. Not because simulation is magical. Because it doesn’t wait for rare conditions. It creates them.

Type Confusion

Mitchell’s assumption made sense: ‘Standard pages are common. Optimize them.’ But assumptions hide bugs. DST forces verification: does the common-case optimization fail catastrophically when uncommon? Across millions of scenarios?

One HN commenter named it: type confusion. The system lied about a page’s type. DST catches lies.

Stop Debugging

Mitchell’s fix was clean—munmap non-standard pages instead of recycling. His debugging brilliant. But debugging is what happens when testing fails.

Stop waiting for users to report 37GB leaks. Build systems that find their own bugs.

The terminal is not a UI. It’s a distributed database with strict consistency. Simulate it like one.