Simulation Testing: Build What You Can Simulate

We treat AI agents like magic spells. Write a prompt, whisper context, hope the black box spits out something useful. When it works, we call it intelligence. When it fails, we call it a hallucination and try again.

This isn’t engineering. It’s tinkering.

As an ex-civil engineer, I learned this lesson viscerally. We would never, ever touch construction without simulation. No responsible engineer pours concrete for a bridge without first modeling load distributions, stress points, and failure modes. We simulate wind loads, seismic activity, thermal expansion—every force nature could throw at our structure.

Yet here we are, deploying AI agents to production with less care than we’d apply to a residential deck.

Alan Kay saw this coming fifty years ago. His mandate was simple: you are not allowed to build what you cannot first simulate.

The Engineering Process: Design-Simulate-Build

Mature engineering follows Design-Simulate-Build. You design the system, simulate its behavior under every conceivable stress, and only when you know it will work do you begin fabrication.

In civil engineering, this isn’t optional—it’s legally mandated. We use finite element analysis to model every beam, every joint, every foundation. We apply safety factors of 1.5x, 2x, sometimes 5x the expected loads. The simulation isn’t just a test; it’s the proof that our design meets code. Without it, you don’t get a building permit. Without it, you’re not an engineer—you’re a gambler.

Software has this dangerously backward. Our process is Build-Test-Lament.

Build: Cobble together an agent with a prompt and a few tools.
Test: Run it against live APIs and production data.
Lament: Spend 80% of our time debugging the unpredictable mess.

Simulation Testing for AI Agents

Simulation testing means creating a fully deterministic, observable, and controllable universe for your agent to live in before it ever touches the real world.

1. Simulate the Environment

Build a high-fidelity replica of your entire system:

Filesystem: Virtual, in-memory filesystem where the agent can read, write, delete without consequence
APIs & Services: Mock servers that replicate real-world behavior, from success to network errors
Databases: Ephemeral test containers for each simulation run

The goal: eliminate network latency, rate limits, non-determinism. Run thousands of tests in seconds with perfectly repeatable outcomes.

2. Simulate User Interactions

Test the full spectrum of human interaction:

Prompt Variations: Same request phrased 20 different ways
Ambiguous Instructions: Vague requests like ‘fix the bug’ to ensure clarification
Adversarial Inputs: Prompt injection attempts, contradictory instructions
Context Degradation: How prompts interact with varying amounts of context

Prompts are inputs that need rigorous testing. You wouldn’t deploy an API without testing malformed JSON; why deploy an agent without testing malformed instructions?

3. Simulate Time

Replace real-world time with a deterministic clock:

Instantly Advance Time: Test timeouts, retries, scheduled events without sleep calls
Freeze Time: Pause at the exact moment an error occurs to inspect state
Rewind Time: Step backward to see the chain of events that led to failure

This control over time isn’t just for testing software—it’s a fundamental principle for creating more time in life itself. As I’ve written about in On Creating Time, Alan Kay’s insight was that ‘we want to control time. We do not want the CPU to control time. We want to control time, and we’ll do it by simulating our own time.’ The same deterministic time control that makes advanced simulation testing possible can help us design, simulate, and build better lives.

4. Simulate Failure

Make chaos your ally. Engineer the simulated API to throw errors at specific, repeatable moments:

What happens if the LLM returns malformed JSON?
What happens if a critical file is missing?
What happens if the agent loses its authentication token mid-workflow?

In simulation, these aren’t bugs; they’re test cases.

Real-World Proof: TigerBeetle’s VOPR

TigerBeetle built a distributed financial transactions database on this principle. They didn’t write code first and test later. They built VOPR (Viewstamped Operation Processor Replicator)—a deterministic simulator—and ran it 24/7 before touching real disk.

VOPR is a deterministic simulator. It runs an entire TigerBeetle cluster in a single process, with mocked network and storage. Every I/O operation is controlled, every fault is injected at will, and—crucially—time moves at 1000x speed.

The numbers are stark. Three seconds of VOPR simulation equals 39 minutes of real-world testing. One hour equals one month. One day equals two years. TigerBeetle runs 10 of these simulators continuously, across 1024 cores, fuzzing the latest version around the clock.

They even put it in your browser. Go to sim.tigerbeetle.com and you’ll see real TigerBeetle code running in WebAssembly, replicated across simulated replicas, with live visualization of the Viewstamped Replication protocol. It runs three levels: ‘City Breeze’ (perfect conditions), ‘Red Desert’ (Jepsen-level faults), and ‘Radioactive’ (catastrophic 8% disk corruption).

The impact is what matters. TigerBeetle found and fixed bugs that would have taken years to surface in production. They proved their consensus protocol works under every fault they could invent. They shipped a distributed database in a fraction of the time conventional wisdom says it takes.

The DST Revolution Started at FoundationDB

TigerBeetle didn’t invent this. They learned it from FoundationDB.

In 2009, a small team in Virginia built a distributed storage engine with ACID guarantees. They did it by spending 18 months building a deterministic simulation framework before ever writing to a physical disk. The result? FoundationDB became one of the most robust databases ever made.

Kyle Kingsbury, the researcher behind Jepsen (the gold standard for distributed systems testing), refused to test FoundationDB. His reason: their deterministic simulator already stress-tested it more thoroughly than Jepsen ever could.

FoundationDB proved you don’t need decades to build a distributed database. You need simulation.

What This Means for AI Agents

The connection is direct. Multi-agent systems are distributed systems. They have the same problems: consensus, partial failures, timing dependencies, state divergence. They need the same solution.

TigerBeetle’s VOPR shows how. Replace real time with deterministic time. Replace real network with simulated network. Replace real storage with simulated storage. Run 1000x faster than reality. Inject every fault you can imagine.

When a bug appears, you have the exact seed that produced it. You can replay it, pause at the failure point, inspect every variable, and understand what happened. No more ‘it works on my machine.’ No more mysterious heisenbugs that vanish when you attach a debugger.

This isn’t theory. Building my own agent framework taught me the hard way: agent behavior depends on hundreds of interacting parameters. Configuration, state, timing, resources, fault handling, provider selection—each parameter interacts with every other. Even a single-agent system explodes in complexity.

Every parameter must face every fault. Tool timeouts (30s) against shell timeouts (1 in 5,000). Token budgets (8,192) against memory pressure (1 in 50,000). Provider failover chains against API authentication failures. Without deterministic simulation, you can’t test this space.

Fault injection exposes 12 failure modes: API timeouts, rate limits, authentication errors, file system failures, permission denials, shell timeouts, token exhaustion, memory pressure, connection failures, DNS errors, cascading failures, flaky components. Each has its own probability, recovery policy, and cascading potential.

From Black Box to Glass Box

The dominant paradigm for agent development is the ‘black box.’ You give the agent a prompt, it does something, and you inspect the final output. If it’s wrong, you tweak the prompt and run it again.

This isn’t pair programming; it’s an autopsy. You can’t see the agent’s reasoning, correct its course, or understand why it made a mistake.

A simulation-first approach transforms the agent into a ‘glass box.’ The simulation environment is not just a test harness; it’s a live, interactive cockpit for development. You can:

Visualize Internal State: See the agent’s plan, its tool calls, and its interaction with the simulated environment in real-time
Introspect and Intervene: Pause execution, inspect the agent’s memory, and even modify its state before resuming
Ask ‘What If?’: Interactively change the environment while the agent is running to see how it adapts

The Path Forward

Building agentic systems without simulation is like writing code without a debugger. It’s an exercise in frustration that produces complicated, brittle artifacts. We’ve built an industry on fixing bugs when we should have been focused on preventing them.

The challenge ahead:

Stop building agents. Start building simulations of agents. Your primary development artifact should not be the agent itself, but the simulated world that proves its correctness.
Demand better tools. We need frameworks that treat simulation as a first-class citizen, not an afterthought.
Embrace the engineering mindset. Shift your focus from the magic of the LLM to the discipline of systems design.

The computer revolution, as Kay famously said, hasn’t happened yet. We are still tinkering with powerful materials we don’t fully understand. The path to real progress—to building reliable, scalable, and truly intelligent systems—is not through bigger models or cleverer prompts. It’s through the rigorous, predictable, and insightful world of simulation.

The best way to predict the future of your software is to simulate it first.