Testing Pyramid of AI Agents | Block Engineering Blog
January 14, 2026

I’m a huge advocate for software testing and have written and spoken quite a bit about the testing pyramid. Unit tests at the bottom. Integration tests in the middle. UI tests at the top. Fewer tests as you go up, because they’re slower, flakier, and more expensive.
That model worked really well as it gave teams a shared mental model for how to think about confidence, coverage, and tradeoffs. It helped people stop writing brittle UI tests and start investing where it mattered.
But now that I work on an AI agent and have to write tests for it, that pyramid stopped making sense because agents change what “working” even means.
Testing AI-based systems or large language model-based systems is challenging in no small part because their output is unlike that of traditional software systems, not deterministic. So, how do we go about testing our AI-based system? Here are some thoughts from Anthropic’s Angie Jones







