Deterministic Simulation Testing

Someone asked me a simple question:

What is the difference between chaos testing and deterministic simulation testing?

Here is my answer.

Chaos testing attacks a live or staging system and asks, ‘Does it survive?’

Deterministic simulation testing builds a fake world around the system and asks, ‘Can I replay the exact failure?’

Both are useful. They answer different questions.

Chaos testing gives you confidence in a running system. DST gives you control over the causes of failure. It removes the usual hiding places: timing, scheduling, network jitter, retries, partial IO, bad inputs, and ‘works on my machine.’

I first wrote about deterministic software on 2025-04-22. Since then, the idea has become sharper for me.

DST is not random testing with a seed. The seed is just the handle.

DST is a debugging pipeline.

It builds a deterministic world. It injects faults. It checks invariants. It writes artifacts. It replays failures. It minimizes them. Then it promotes the stable failures into named regressions.

The real product is reproducibility.

DST Turns Bugs Into Coordinates

A flaky bug is usually a story:

‘The agent got stuck after a tool call, but only sometimes.’

A DST bug should be a coordinate:

‘Run seed 0x11, scenario seed 0x11, replay log entry 184.’

That coordinate is the point. Not randomness. Not ceremony. Not a bigger test suite for its own sake.

The best public example I have read is TigerBeetle’s src/vopr.zig. I rechecked it while writing this. The latest commit touching that file is 3aa232f, committed on 2026-04-26.

TigerBeetle takes a seed, creates a PRNG from it, and derives the world from that PRNG. That world is not a single switch like ‘drop packets.’ It includes cluster shape, client count, workload shape, network delay, packet loss, replayed packets, partitions, storage latency, storage faults, replica crashes, restarts, reformats, pauses, unpauses, and rolling upgrades.

That is the first lesson.

A serious simulator does not inject one failure. It generates a universe where failures interact.

A retry bug might need packet loss. A recovery bug might need storage corruption. An upgrade bug might need a replica to crash at the wrong moment. An agent bug might need a tool timeout, a partial provider response, a context-window decision, and a retry budget.

If each of those is controlled by normal wall-clock time and ambient IO, you get folklore.

If each is controlled by a seed, you get a coordinate.

The Simulator Needs A World Model

Most teams start too small.

They add a fake clock. They mock one API response. Then they call it deterministic.

That is not enough.

The simulator needs a world rich enough to produce the failures production produces.

For a distributed database, that means replicas, standbys, clients, network paths, storage devices, requests, releases, and recovery behavior.

For an agent system, the world model should include:

user commands
model responses
streaming chunks
tool calls
tool results
permission decisions
context-window pressure
retries
provider errors
cancellation
process boundaries
artifact storage
timeout behavior
degraded runtime behavior

The list changes by product. The principle does not.

If production can observe it, delay it, corrupt it, drop it, reorder it, retry it, or partially complete it, the simulator should eventually model it.

I am seeing this in a personal project I am working on now. The testing stack separates broad simulation from narrower DST regressions. The broad simulator explores seeds and scenarios. DST owns replay, fork, explore, sequencing, fail-soft behavior, provider simulation, and degraded runtime behavior.

That split matters.

Broad simulation is the discovery machine.

DST is the regression machine.

You need both. Broad exploration finds strange behavior. Named deterministic regressions keep the strange behavior from returning.

You also need fuzz testing beside DST.

DST is best at external nondeterminism: time, IO, network shape, provider behavior, tool behavior, scheduling, storage, crashes, retries, and recovery. Fuzz testing is best at internal logic: parsers, serializers, sizing, escaping, truncation, state transitions, arithmetic, and local invariants.

Together, they attack heisenbugs from both sides. DST makes the outside world reproducible. Fuzzing makes each component hard to surprise.

That combination can catch most, if not all, of the heisenbugs that slip between unit tests, integration tests, and production telemetry.

It is also how I now structure most, if not all, of my client projects. DST handles the external world. Fuzz testing hardens the internal logic. The pair gives you much better odds against bugs that appear only when timing, malformed input, retries, and edge-case state meet.

I will draft a separate post on fuzz testing later. I think it is too important to compress into one section here, especially because fuzzing is one of the foundations that makes simulation testing practical.

Check Safety, Then Check Liveness

TigerBeetle also taught me to separate safety from liveness.

First, the simulator runs in a fault-heavy mode. Replicas crash and restart. Storage fails. The network partitions. Packets disappear or replay. The system must keep processing requests without violating safety.

Then the simulator switches to liveness mode.

It heals a core set of replicas, disables the disruptive faults, and asks whether the system converges. If the system should be recoverable, it must recover. If it converges, TigerBeetle validates durable state, including append-only files, against final checksum expectations.

That is much stronger than ‘the test did not crash.’

For agent systems, the equivalent split is:

Safety phase: inject provider failures, tool faults, timeouts, bad outputs, token pressure, permission denials, and cancellation.
Liveness phase: remove the artificial pressure and verify the agent can finish, fail cleanly, or reach a stable terminal state.

This gives you two different questions:

Safety: did the system do anything forbidden?
Liveness: once the world became fair, did the system make progress?

Both matter.

An agent that never violates permissions but loops forever is broken.

An agent that completes the task by ignoring a denied permission is also broken.

DST infrastructure should make both failures visible.

Assertions Are Part Of The System

TigerBeetle’s VOPR refuses unsupported build modes that disable assertions.

That is not just a simulator choice. It is part of TigerStyle, TigerBeetle’s engineering philosophy.

TigerBeetle treats assertion failures as programmer errors. The correct response to corrupt code is to crash. In a database, continuing from an inconsistent state is more dangerous than stopping the process and forcing recovery. They have explained this posture publicly too, including in this TigerBeetle engineering talk.

That is the right posture.

Assertions are not decoration. They are sensors. They tell the simulator or production system when an invariant has been violated.

The mistake is to turn them off once code gets ‘serious.’ DST flips that. The simulator is only as useful as the invariants it checks. The production system is only as safe as the corrupt states it refuses to continue from.

Good invariants are plain:

token accounting never goes negative
tool calls per turn stay bounded
denied tools do not execute later in the same path
replay logs have no sequence gaps
state transitions remain valid
terminal states are consistent
recovery attempts stay bounded
artifacts never escape their artifact directory
a supposedly replayable seed has a replay log

These are not unit tests in the usual sense. They are laws of the simulated universe.

The more laws you encode, the more useful each generated world becomes.

The Architecture Pattern

The hard part of DST is not generating random numbers.

The hard part is drawing a clean boundary around nondeterminism.

The personal project uses a pattern I now consider the baseline: dependency injection plus a pure state machine.

All entropy flows through a context:

live context for production time, IDs, and randomness
simulation context for seeded PRNG, deterministic time, and deterministic IDs
replay context for recorded values

Then the core system becomes a state machine:

transition(state, input) -> state

The transition does not read the clock. It does not call the network. It does not inspect global mutable state. It does not depend on hash-map iteration order. It receives state and input, then returns state.

That gives you replay.

If every external observation becomes an input, and every input is appended to an ordered log with sequence numbers and initial-state metadata, a failure stops being a memory. It becomes a file.

Once you have that file, you can do four powerful things:

Replay the exact run.
Replay to a specific sequence.
Fork at a sequence and inject a hypothetical input.
Explore alternative futures from the forked state.

This is the time-travel part of DST.

It is also where many implementations fail. They seed the PRNG, but leave time, IDs, tool output, filesystem state, provider behavior, and scheduling outside the deterministic boundary. Then the seed reproduces only part of the bug.

A partial seed is not enough.

The simulator has to own the world.

The Workflow

The workflow should be simple.

First, swarm.

Run many seeds. Vary the world. Track seed and scenario seed. Record tool coverage, fault counts, outcomes, failure signatures, build identity, and replayability.

Second, capture artifacts.

Every interesting or failing seed should produce enough state to inspect later. In practice, that means a seed index, per-seed summaries, replayable input logs, and omission reasons when a log is intentionally not captured.

Third, replay.

The internal debug command should resolve either a direct log path or an artifact-backed seed. Mixed forms should fail. Bad artifacts should fail cleanly. Non-replayable artifacts should explain why.

Fourth, fork and explore.

Replay to the point before failure. Branch. Try a different provider result. Try a different tool result. Try a recovery input. If the branch succeeds, you have learned something precise about the failure.

Fifth, minimize.

A huge failing seed is useful, but a smaller reproducer is better. Search should consume the artifact, find a smaller or clearer failing case, and emit a minimized replayable log.

Sixth, promote.

Do not leave stable failures as tribal knowledge in .vopr-artifacts.

Promote them into named DST regressions. That is how discovery feeds regression.

This is the loop:

Broad simulation discovers.

Artifacts preserve.

Replay explains.

Search minimizes.

DST regression protects.

A Practical Checklist

If I were building DST infrastructure from scratch today, I would start with this checklist:

Build one nondeterminism boundary for time, random values, IDs, and external observations.
Make the core transition pure enough to replay.
Use deterministic data structures or deterministic ordering.
Log every external input with sequence numbers and initial-state metadata.
Generate worlds, not isolated failures.
Model the faults production actually produces.
Split safety checks from liveness checks.
Treat assertions and invariants as simulator sensors.
Persist artifacts for every interesting failure.
Include seed, scenario seed, build identity, config, summary, and replay log path.
Make replay, fork, and explore artifact-aware.
Minimize failing seeds into smaller replayable cases.
Promote stable discoveries into named regressions.
Keep broad exploration and deterministic regression tests as separate layers.

That is the difference between chaos testing and DST.

Chaos testing asks whether the system survives.

DST gives you the coordinates of the failure.