Skip to content
Go back

Simulation Testing: Is Your Software Built To Fix Bugs or Predict Them?

Published: Sep 4, 2025
Vancouver, Canada

We treat AI agents like magic spells. We write a prompt, whisper an incantation of context, and hope the black box spits out something useful. When it works, we call it intelligence. When it fails, we call it a hallucination and try again. This isn’t engineering. It’s tinkering.

We are so mesmerized by the dream of the autonomous agent that we’ve forgotten every hard-won lesson from decades of building complex systems. We are shipping first and asking questions later, building the digital equivalent of bridges without blueprints and hoping they don’t collapse.

As an ex-civil engineer, I learned this lesson viscerally. We would never, ever touch construction without engineering simulation. No responsible engineer would pour concrete for a bridge without first modeling load distributions, stress points, and failure modes. We’d simulate wind loads, seismic activity, thermal expansion—every force that nature could throw at our structure. The liability alone would be catastrophic, but more importantly, it would be unethical. Lives depend on that rigor. Yet here we are, deploying AI agents to production with less care than we’d apply to a residential deck.

Alan Kay saw this coming fifty years ago. He watched software developers build systems with a ‘fix-it-later’ mentality and recognized it not as a new paradigm, but as a failure to learn from every other mature engineering discipline. His mandate was simple and severe: you are not allowed to build what you cannot first simulate.

The future of robust, reliable agentic AI isn’t a better LLM. It’s a better process. It’s time we stopped fixing bugs and started predicting them out of existence.

From Fabrication to Simulation

Mature engineering—the kind that builds jet engines and skyscrapers—follows a clear path: Design-Simulate-Build (CAD-SIM-FAB). You design the system, you simulate its behavior under every conceivable stress, and only when you know it will work do you begin the expensive process of fabrication.

In civil engineering, this isn’t optional—it’s legally mandated. We use finite element analysis to model every beam, every joint, every foundation. We apply safety factors of 1.5x, 2x, sometimes 5x the expected loads. We simulate a century of weather patterns, model soil liquefaction during earthquakes, calculate fatigue cycles from daily temperature swings. The simulation isn’t just a test; it’s the proof that our design meets code. Without it, you don’t get a building permit. Without it, you don’t get insurance. Without it, you’re not an engineer—you’re a gambler.

Software, and especially agentic AI, has this dangerously backward. Our process is Build-Test-Lament.

  1. Build: We cobble together an agent with a prompt and a few tools.
  2. Test: We run it against live APIs and production data.
  3. Lament: We spend 80% of our time debugging the unpredictable mess that results.

This is the core distinction between an autonomous agent and an engineered agentic workflow. An agent, given a vague goal, is a probabilistic, exploratory system. It’s a million lines of code we hope works. A workflow is deterministic and designed. But even workflows are often built without being rigorously simulated in the environments where they will run.

Kay’s philosophy forces us to confront this. To him, software is not a craft to be tinkered with; it’s a system to be understood. And understanding comes from simulation.

What is Simulation for an AI Agent?

Simulation testing isn’t just a better version of unit testing. It is a fundamentally different way to design and validate agentic systems. It means creating a fully deterministic, observable, and controllable universe for your agent to live in before it ever touches the real world.

This is what it looks like in practice.

1. Simulate the Environment: Complete System Replication

Instead of letting an agent run wild on your actual filesystem or call live APIs, you build a high-fidelity replica of your entire system.

  • Filesystem: A virtual, in-memory filesystem (like pyfakefs) where the agent can read, write, and delete files without consequence. You control the entire state.
  • APIs & Services: Mock servers that replicate the behavior of real-world APIs, from successful responses to network errors and authentication failures.
  • Databases: Test containers that spin up an identical, ephemeral version of your production database for each simulation run.

The goal is to model the entire system the agent will interact with. This eliminates network latency, rate limits, and non-determinism, allowing for thousands of tests to be run in seconds, all with perfectly repeatable outcomes.

2. Simulate User Interactions: The Prompt Laboratory

The most overlooked aspect of agent testing is simulating the prompts themselves. I’m not just talking about testing with ‘hello world’—I mean simulating the full spectrum of human interaction patterns, from the perfectly clear to the utterly chaotic.

In practice, this means:

  • Prompt Variations: Testing how your agent handles the same request phrased 20 different ways. ‘Delete the file,’ ‘remove that file,’ ‘get rid of the file,’ ‘can you delete the file please?’—each potentially triggering different behaviors.
  • Ambiguous Instructions: Simulating vague user requests like ‘fix the bug’ or ‘make it better’ to ensure your agent asks for clarification rather than guessing.
  • Adversarial Inputs: Testing prompt injection attempts, contradictory instructions, and requests that would violate safety boundaries.
  • Context Degradation: Simulating how prompts interact with varying amounts of context—what happens when the user refers to ‘that thing we discussed’ after 50 messages?

The key insight is that prompts are inputs that need the same rigorous testing as any other system input. You wouldn’t deploy an API without testing malformed JSON; why deploy an agent without testing malformed instructions?

3. Simulate Time: The Deterministic Clock

The most powerful tool in simulation is control over time. In a simulated environment, you can replace real-world time with a deterministic clock.

This allows you to:

  • Instantly Advance Time: Test logic for timeouts, retries, and scheduled events without sleep() calls that cripple your test suite.
  • Freeze Time: Pause the entire universe at the exact moment an error occurs to inspect the agent’s state.
  • Rewind Time: Step backward to see the chain of events that led to a failure.

This decoupling of the simulation clock is the key to creating tests that are not just fast, but completely predictable.

4. Simulate Failure: Engineering for Resilience

The real world is chaos. APIs fail. Files get locked. LLMs hallucinate. A simulation is where you make that chaos your ally. Kay’s principle was to design with failure as a primary consideration, and simulation is the tool to do it.

Instead of hoping your agent handles a 503 Service Unavailable error, you engineer the simulated API to throw that error at a specific, repeatable moment. You test how the agent behaves under every fault condition until you are certain it is resilient.

  • What happens if the LLM returns malformed JSON?
  • What happens if a critical file is missing?
  • What happens if the agent loses its authentication token mid-workflow?

In a simulation, these aren’t bugs; they are test cases.

5. Simulate Agent Effects: Dual Impact Testing

When an agent executes, it creates two distinct categories of effects that both need to be simulated and validated:

Direct Effects: These are the intended outcomes—the files modified, the API calls made, the database records updated. In simulation, you track every state change the agent makes and validate it against expected outcomes. Did it modify the right files? Did it call the APIs in the correct sequence? Did it handle transactions properly?

Indirect Effects: These are the ripple effects that are often ignored until they cause production incidents:

  • Resource Consumption: Memory usage, CPU cycles, API rate limit consumption
  • Cascading Behaviors: How one agent’s actions affect other systems or agents
  • State Accumulation: Temporary files, cache buildup, log explosion
  • Timing Dependencies: Race conditions between parallel operations

The simulation must capture both categories. You’re not just testing ‘did the agent complete the task?’ but also ‘what was the cost of completion?’ and ‘what mess did it leave behind?’ This dual-effect simulation is what separates toy demos from production-ready systems.

6. Simulate Conversational History: Temporal Context Testing

Real-world agent interactions aren’t one-shot commands—they’re ongoing conversations that evolve over time. Your simulation must account for this temporal dimension.

This means simulating:

  • Multi-turn Dialogues: How does the agent handle a conversation that spans 50 messages? Does it maintain context, or does it forget critical information from message 3 when you reach message 45?
  • Context Window Management: What happens when the conversation exceeds the model’s context window? Does the agent gracefully summarize and continue, or does it lose critical state?
  • Interrupted Workflows: Simulate a user starting a task, leaving for a day, then returning. Can the agent resume coherently? What about system restarts between conversations?
  • Temporal References: Test how the agent handles time-based references—‘like we did yesterday,’ ‘the file I mentioned earlier,’ ‘undo what you just did.’

By simulating conversations over time, you discover whether your agent is truly stateful or just pretending. You find out if it can handle the messy reality of human interaction patterns—the interruptions, clarifications, and context switches that define real work.

The Bridge Example: Why Simulation Matters

Let me make this concrete with an example from my civil engineering days. When designing a bridge, we don’t just calculate the maximum load—we simulate the entire lifecycle. We model:

  • Dynamic Loading: Not just the weight of cars, but the resonance from synchronized footsteps (yes, bridges can gallop—look up the Millennium Bridge wobble)
  • Material Degradation: How concrete carbonation and rebar corrosion affect strength over 50 years
  • Extreme Events: The 500-year flood, the maximum credible earthquake, the derecho windstorm

For one project, our simulation revealed that a seemingly robust design would experience catastrophic resonance at a specific wind speed—something we’d never have caught with static calculations. We redesigned the deck cross-section, adding wind baffles that would have seemed unnecessary without simulation.

This is exactly what we should be doing with AI agents. Just as we simulate a bridge’s response to a 7.0 earthquake before breaking ground, we should simulate our agents’ responses to API failures, data corruption, and edge cases before deployment. The difference is that in software, we can run these simulations thousands of times per second, exploring failure modes that would take centuries to encounter naturally.

From Black Box to Glass Box

The dominant paradigm for agent development today is the ‘black box.’ You give the agent a prompt, it does something, and you inspect the final output—the commit, the file, the database entry. If it’s wrong, you tweak the prompt and run it again.

This isn’t pair programming; it’s an autopsy. You can’t see the agent’s reasoning, correct its course, or understand why it made a mistake.

A simulation-first approach transforms the agent into a ‘glass box.’ The simulation environment is not just a test harness; it’s a live, interactive cockpit for development. You can:

  • Visualize Internal State: See the agent’s plan, its tool calls, and its interaction with the simulated environment in real-time.
  • Introspect and Intervene: Pause the execution, inspect the agent’s memory, and even modify its state before resuming.
  • Ask ‘What If?’: Interactively change the environment while the agent is running to see how it adapts.

This fulfills Kay’s vision of a system where the UI is a malleable ‘costume’ for the underlying model. The true system is the simulation, and our development tools are just windows into it.

The Path Forward: Simulating the Future

Building agentic systems without simulation is like writing code without a debugger. It’s an exercise in frustration that produces complicated, brittle artifacts. We’ve built an industry on fixing bugs when we should have been focused on preventing them.

This is the challenge ahead:

  • Stop building agents. Start building simulations of agents. Your primary development artifact should not be the agent itself, but the simulated world that proves its correctness.
  • Demand better tools. We need frameworks and platforms that treat simulation as a first-class citizen, not an afterthought. We need live, interactive environments for agent development.
  • Embrace the engineering mindset. Shift your focus from the magic of the LLM to the discipline of systems design. Define the ‘meaning’ of your agent—its simplest, verifiable implementation—and test rigorously against it.

The computer revolution, as Kay famously said, hasn’t happened yet. We are still tinkering with powerful materials we don’t fully understand. The path to real progress—to building reliable, scalable, and truly intelligent systems—is not through bigger models or cleverer prompts. It’s through the rigorous, predictable, and insightful world of simulation.

The best way to predict the future of your software is to simulate it first.


The principles in this post are heavily inspired by Alan Kay’s directives on simulation-centric development, which you can find compiled here.

  Let an Agentic AI Expert Review Your Code

I hope you found this article helpful. If you want to take your agentic AI to the next level, consider booking a consultation or subscribing to premium content.

Content Attribution: 50% by Alpha, 25% by Claude Opus 4.1, 25% by Gemini 2.5 Pro
  • 50% by Alpha: Core concept, initial title, civil engineering perspective and examples, and final review.
  • 25% by Claude Opus 4.1: Final draft based on Alpha's feedback.
  • 25% by Gemini 2.5 Pro: Initial draft based on Alan Kay directives, revised with civil engineering additions.