AI Engineer Melbourne 2026 — AI Engineering (Day 1 Midday)

Evaluation Precedes Evolution: Rubrics as the Load-Bearing Infrastructure of Self-Improving Agents

Tanya Dixit Forward Deployed Engineer Google

The 2025–2026 wave of "self-evolving" agents — prompt-tuning loops, memory accumulation, agent swarms, GEPA, ReasoningBank — share a structure that is sometimes lost in the jargon: every one of them is hill-climbing on a judge. The judge is the fitness function. When it's sharp, the agent compounds. When it's vague, the loop drifts confidently in the wrong direction.

This talk argues that rubrics, not prompts or scaffolds, are the load-bearing infrastructure of agent improvement. We'll walk through three concrete failures from recent work: prompt optimizers that regressed without rollback (OpenAI), memory systems that hurt performance as they grew (ReasoningBank), and 18 months of capability gains that delivered almost no reliability gain (Princeton). All three share a root cause: the rubric was the bottleneck, and nobody was looking at it.

Then we'll build one. Five principles for a rubric that can actually drive evolution — stack deterministic before semantic, score failures explicitly, measure beyond accuracy, version the rubric itself, keep it cheap. You'll leave with a checklist you can apply to your next agent before you ship a single optimization loop.

In collaboration with Pouya Ghiasnezhad Omran.

12:50 pm

Beyond Forgetful Bots: Architectural Patterns for Persistent, Proactive Claw-Style AI Agents

Navan Tirupathi CTO , Architecture and AI Expert Arivminds

Most AI agents are reactive chatbots—great for one-off queries, but they reset, forget, and lack initiative, failing in real-world use like personal assistants or autonomous workflows.

This talk dives into the battle-tested architecture of Claw-family agents (OpenClaw and lightweight forks like NanoClaw, PicoClaw, TinyClaw, IronClaw, ZeroClaw), which power persistent, proactive systems that run 24/7 on your devices. Drawing from real implementations, we'll unpack core patterns:

Hub-and-Spoke Separation: A stateless gateway routes inputs (messages, heartbeats, cronjobs, hooks, webhooks) while adapters normalize diverse channels (WhatsApp, Discord) and enforce typed protocols/security handshakes.

Ephemeral vs. Persistent State: Transient context (system prompts, recent interactions) stays token-efficient; durable memory (append-only logs + curated facts) uses hybrid retrieval (semantic + keyword) with flush safeguards to survive compaction/restarts.

Runtime Loop & Extensibility: RPC streaming for task queuing/execution; plugin discovery (tools, providers, memories) and Markdown-based skills (SOPs) enable hot-loading without recompiles, plus multi-agent delegation for collaboration.

Security Boundaries: Defense-in-depth with network isolation, sandboxed sessions, identity pairing, and safeguards against injection/poisoning.

Proactivity and Deployment: Inputs trigger autonomous actions; architectures span local native, VPS/Docker, or cloud for low-resource edge devices.

1:10 pm

Shipping Sandboxed Workers for Notion Agents

Adam Hudson Software Engineer Notion

In this talk, we will share how we built a platform at Notion that allows developers to extend AI agents with custom code. The system enables developers to write small programs that give their agents access to tools such as internal APIs and external services.

We will focus on the engineering decisions that made the first version practical to ship: what we chose to build, what we deliberately left out, and the operational and safety constraints that shaped the design.

We will walk through the developer experience from local development through deployment and execution, and discuss how we approached packaging, distribution, and running user-supplied code in an isolated environment.

From there, we will explore the boundaries required to safely support untrusted tool code in production, including capability constraints, governance around who can manage and attach extensions, and the guardrails needed to keep agent-driven execution safe and observable.

Finally, we will share lessons from bringing the system out of its early stages, including the operational challenges we encountered and the changes we made along the way.

1:30 pm

Close your agentic loop

Moss Ebeling Head of AI Engineering Optiver Asia Pacific

Every time you've told an agent it broke the layout of your website, output the wrong schema or failed an invariant - you are the feedback loop. The teams achieving the best outcomes right now are focused on building better systems: automated feedback that allows agents to check their own work. Join to learn what closed-loop design looks like, and how you can build real leverage.

1:50 pm

How Many Agents Are Too Many? The Hidden Cost of Multi-Agent Systems

Anannya Roy Chowdhury GenAI Developer Advocate AWS

Multi-agent systems promise scalability and smarter reasoning—but in production, more agents often mean more cost, latency, and failure. This talk shares real-world engineering lessons, metrics, and architectural trade-offs to help you decide when multi-agent designs add value—and when a simpler approach performs better.

2:10 pm

Kill the God Agent

Adesh Gairola Co-founder & CTO raxIT Labs

Your multi-agent system probably has one orchestrator with access to every tool, every database, every API. If that agent gets injected, the entire toolchain is compromised. Guardrails won't save you. In this session, learn three architectural patterns that move agent security from hope to proof: how to isolate agent capabilities so no single agent holds all the keys, how to scope authorization per task using cryptographic tokens that survive prompt injection, and how to enforce policies outside the LLM using a formally verified engine that intercepts actions in microseconds. Walk away with patterns you can apply to your agent architecture this week.

Conffab