AI Agents, meet Test Driven Development
April 22, 2025

Historically with Test Driven Development (TDD), the thing that you’re testing is predictable. You expect the same outputs given a known set of inputs.
With AI agents, it’s not that simple. Outcomes vary, so tests need flexibility.Instead of exact answers, you’re evaluating behaviors, reasoning, and decision-making (e.g., tool selection). This requires nuanced success criteria like scores, ratings, and user satisfaction, not just pass/fail tests.
And internal evals aren’t enough. You need to make continuous adjustments based on real-world feedback.
So how can you build a process around this?
When software is deterministic, testing the same functionality with the same inputs will always give the same outputs.
But the output of LLMs is nondeterministic? So what does testing look like?