
Evaluation Precedes Evolution: Rubrics as the Load-Bearing Infrastructure of Self-Improving Agents
Tanya Dixit Forward Deployed Engineer Google
The 2025–2026 wave of "self-evolving" agents — prompt-tuning loops, memory accumulation, agent swarms, GEPA, ReasoningBank — share a structure that is sometimes lost in the jargon: every one of them is hill-climbing on a judge. The judge is the fitness function. When it's sharp, the agent compounds. When it's vague, the loop drifts confidently in the wrong direction.
This talk argues that rubrics, not prompts or scaffolds, are the load-bearing infrastructure of agent improvement. We'll walk through three concrete failures from recent work: prompt optimizers that regressed without rollback (OpenAI), memory systems that hurt performance as they grew (ReasoningBank), and 18 months of capability gains that delivered almost no reliability gain (Princeton). All three share a root cause: the rubric was the bottleneck, and nobody was looking at it.
Then we'll build one. Five principles for a rubric that can actually drive evolution — stack deterministic before semantic, score failures explicitly, measure beyond accuracy, version the rubric itself, keep it cheap. You'll leave with a checklist you can apply to your next agent before you ship a single optimization loop.
In collaboration with Pouya Ghiasnezhad Omran.
Shipping Sandboxed Workers for Notion Agents
Adam Hudson Software Engineer Notion
In this talk, we will share how we built a platform at Notion that allows developers to extend AI agents with custom code. The system enables developers to write small programs that give their agents access to tools such as internal APIs and external services.
We will focus on the engineering decisions that made the first version practical to ship: what we chose to build, what we deliberately left out, and the operational and safety constraints that shaped the design.
We will walk through the developer experience from local development through deployment and execution, and discuss how we approached packaging, distribution, and running user-supplied code in an isolated environment.
From there, we will explore the boundaries required to safely support untrusted tool code in production, including capability constraints, governance around who can manage and attach extensions, and the guardrails needed to keep agent-driven execution safe and observable.
Finally, we will share lessons from bringing the system out of its early stages, including the operational challenges we encountered and the changes we made along the way.
Close your agentic loop
Moss Ebeling Head of AI Engineering Optiver Asia Pacific
Every time you've told an agent it broke the layout of your website, output the wrong schema or failed an invariant - you are the feedback loop. The teams achieving the best outcomes right now are focused on building better systems: automated feedback that allows agents to check their own work. Join to learn what closed-loop design looks like, and how you can build real leverage.
How Many Agents Are Too Many? The Hidden Cost of Multi-Agent Systems
Anannya Roy Chowdhury GenAI Developer Advocate AWS
Multi-agent systems promise scalability and smarter reasoning—but in production, more agents often mean more cost, latency, and failure. This talk shares real-world engineering lessons, metrics, and architectural trade-offs to help you decide when multi-agent designs add value—and when a simpler approach performs better.
Kill the God Agent
Adesh Gairola Co-founder & CTO raxIT Labs
Your multi-agent system probably has one orchestrator with access to every tool, every database, every API. If that agent gets injected, the entire toolchain is compromised. Guardrails won't save you. In this session, learn three architectural patterns that move agent security from hope to proof: how to isolate agent capabilities so no single agent holds all the keys, how to scope authorization per task using cryptographic tokens that survive prompt injection, and how to enforce policies outside the LLM using a formally verified engine that intercepts actions in microseconds. Walk away with patterns you can apply to your agent architecture this week.