

Agent Observability: Monitoring and Understanding Agents at Internet Scale
Daniel Nadasi Principal Engineer Google
Agent usage is exploding (if you haven't noticed) with an unprecedented transformation in the activities of both developers and other roles creating enormous volumes of new autonomous, dynamic decision making programs that can do extraordinary things but also hallucinate, misunderstand and in the worst case cause real damage.
In this talk I'll discuss Google's broad approach to figuring out what it is that agents are doing: what actions they are taking, how risky they are, what data is necessary to make good decisions, and how we can scale our approach to an enterprise the size of Google through automated monitoring and policy. I'll also discuss how we've been tackling the fundamental and critical problem of how to achieve both speed and safety.

Our AI Hallucinated in Production: How We Fixed It With Evals
Yicheng Guo Senior Machine Learning Engineer REA Group
We shipped one of REA Group’s first generative AI features to production: Property Highlights, which turns long real-estate listings into three skimmable takeaways. The demo was easy; real traffic wasn’t—hallucinations showed up in front of real users.
This talk covers how we built an evaluation stack to launch safely at scale. Basic guardrails (three bullets, length limits) didn’t catch the failures that mattered: made-up features, off-brand tone, and useless copy. We built a review tool for side-by-side prompt/model testing, defined a rubric for factuality, usefulness, and language quality, and scaled it with an LLM-as-judge calibrated to expert reviews to score thousands of listings daily. We then tied evals to real user feedback and business metrics, including a 10% engagement lift.
You’ll get a practical pipeline and a repeatable way to iterate on LLM features using evals, not vibes.

The Application Layer Is the New Research Lab
Abdul Karim Applied AI Scientist
In the pre-genAI era, vertical product teams handed insights to a separate R&D group, who shipped a new model two quarters later. That handoff is now a bug. Agentic systems are built from dozens of model calls, judges, tools, and harness decisions, and every one of those is a hyperparameter. The product surface and the training surface are the same surface. This talk argues that every vertical AI company is now its own applied research lab. I walk through what that function actually ships (custom judges, scenario benchmarks, data flywheels, harness tuning), where the thesis breaks (most domains are not Cursor), and how to staff for it without losing engineering velocity.

Orbital Lasers vs For Loops: Economically Matching Models to Tasks
Stephen Sennett AWS Community Hero & Lead Consultant at V2 AI V2 AI
Most developers pick their AI model the same way: use the biggest, smartest one available for everything. Bash script? Opus. Dockerfile? Whatever's at the top of the dropdown. Then they hit their usage limits halfway through the day and lose the productivity gains they were chasing. After too many cases of my workflow pausing because my Claude subscriptions limit, I started asking a different question: what model does this task actually need? The answer, for a surprising number of daily tasks, was something far smaller, faster, and cheaper.
This talk shares a practical framework for model selection built from real development work across cloud infrastructure, scripting, code generation, and documentation. I'll walk through concrete comparisons across model tiers — from frontier models through mid-range options down to lightweight and even local models — covering output quality, speed, cost, and the dimension most benchmarks ignore: actual impact on developer velocity. You'll walk away with a mental model for matching tasks to appropriate tiers, an honest look at where cheap models genuinely fall short, and a case for why thoughtful model selection is an engineering discipline, not just a cost optimisation exercise.

Your AI Can’t Engineer (Yet)
Theodoros Galanos Generative AI Leader Aurecon
Large language models excel at code—but engineering isn't just code. When you ask an AI to calculate short-circuit currents per IEC 60909 or size a pavement per Austroads 2022, you're asking it to operate outside its training distribution. The result: confident answers that miss unit conversions, ignore standard-specific constraints, and fail the "gotchas" that trip up junior engineers.
At Aurecon, a multinational engineering consultancy, we found that two-thirds of project rework stems from controllable errors—dimensional mistakes, specification mismatches, standards compliance failures. These are exactly the errors AI should catch. But how do you know if your AI assistant is actually reliable on engineering tasks?
This talk introduces aecbench, an open benchmark suite born from Aurecon's quality engineering practice. With tasks across 12 engineering disciplines—electrical, civil, structural, geotechnical, and more—it maps the capability space AI needs to inhabit: deterministic calculations with standards compliance, mixed problems requiring judgment, and verification workflows that catch errors before they become rework.
But benchmarks aren't just for measurement. Each task is an environment of experience—a structured space where agents learn what "correct" means in engineering. Deterministic tasks provide dense reward signals. Complexity tiers enable curriculum learning. "Gotchas" become adversarial scenarios that force understanding over pattern matching.
I'll showcase results comparing frontier models, custom agentic harnesses, and early RL fine-tuning experiments on real engineering tasks—plus how the community can contribute challenges to the open benchmark and run agents on the private leaderboard.

Flue: The Agent Harness Framework
Michael Hart Senior Principal Engineer Cloudflare
[Flue](https://flueframework.com) is a programmable, open source agent harness, able to represent any autonomous agent or workflow, from simple chatbots to entire coding platforms.
In this talk we'll will touch on the rapidly growing world of agent harnesses in general, and how Flue fits into this landscape.
As well as a technical deep dive, we'll discuss how companies have been using Flue, including how we're using it at Cloudflare.