AI in Software Delivery beyond Copilot: reimagining software delivery

From Typing to Value: A First Aha and Today’s Shifting Ground
Sarah contrasts early-career pride in “real programming” (C++ headers, pointers, and memory management) with a later ThoughtWorks epiphany: real programming is delivering value early to customers. She draws a parallel to the present, describing a second aha as AI reshapes software delivery and forces teams to rethink what matters. Setting expectations, she stresses how fast the landscape shifts—slides were updated the night before—and likens adoption to working in quicksand. This frames the talk’s theme: focus less on keystrokes and more on outcomes amid rapid AI-driven change.
Framing the Conversation: Three AI Loops and Why the Inner Loop Matters
Sarah introduces a simple structure to avoid “everything everywhere all at once” discussions: three loops of AI. She defines the inner loop (builder productivity in software development), the middle loop (business process optimization across functions like call centers and finance), and the outer loop (AI inside products: personalization, chatbots, agentic offerings). She narrows the talk to the inner loop to keep a clear focus on how AI changes day-to-day developer work. This framing anchors the rest of the presentation in practical, delivery-oriented impacts.
Why Copilot Isn’t Sticking: Adoption Frictions and the Human Change Curve
Sarah addresses a common leadership question—“We gave teams Copilot, why aren’t they using it?”—by unpacking barriers. Early-stage tool maturity, training-data gaps for niche stacks, developer experience, problem complexity, information overload, inconsistent outputs, and tolerance for errors all shape uptake. She pushes back on replacing collaboration by highlighting pair programming’s enduring value, especially for juniors. The segment pivots from tools to change management: leaders can accelerate individuals’ change curves by guiding, not just provisioning software.
Hype vs. Data: Productivity Claims and the Toolchain’s Evolution
Sarah examines bold productivity claims, noting current evidence is largely perceptual (e.g., positive signals in the DORA survey) and tied to early capabilities like “autocomplete on steroids.” Using a train-versus-faster-horse analogy, she argues we’re only a few iterations into realizing true gains. She traces the trajectory from autocomplete and chat toward agents and richer context providers, citing emerging IDEs (Cursor, Windsurf, Klein) and agents that can raise PRs. This context situates inner-loop gains as real but uneven, with more evolution to come.
Demo Deconstructed: Prompt-to-Code on the Mars Rover Kata
Sarah describes a prompt-to-code demo using Klein to solve ThoughtWorks’ classic “Mars Rover” interview exercise. With a brief problem statement and constraints (Java/JavaScript, Maven, tests), the agent planned the solution, scaffolded the project, and wrote code and tests with human approvals at each step. As a seasoned reviewer, she says the resulting solution would merit an immediate interview—her second big aha on AI’s coding potential. She tempers the excitement by noting repeatability gaps: similar tasks achieved 97% success one day and failed the next, underscoring volatility.
Autonomous Agents and the Quality Question: Repeatability, Duplication, and Refactoring
Reporting on repeated autonomous agent trials, Sarah notes agents consistently produced working solutions yet missed important quality issues—like duplicated code—in most runs. She highlights broader signals from industry data: code volume surged after code assistants, while moved (refactored) code dropped to near zero. The takeaway is clear: AI accelerates adding code but not stewarding it, risking rising complexity and total cost of ownership. This segment raises the central inner-loop challenge—go faster without eroding maintainability.
Leading Through Quicksand: Guardrails, Rituals, and Measurement
Sarah outlines concrete leadership practices to harness AI responsibly: maintain human code reviews (don’t offload to peers), monitor code quality, shift-left on testing, and run “AI gone wrong” rituals to share failures and lessons. Psychological safety is essential so teams can experiment and learn in uncertain terrain. She recommends measuring both sentiment and flow with developer experience tools (blending qualitative surveys with repo and ticket data), including modules for AI productivity. The message aligns with the talk’s theme: disciplined engineering amplifies AI’s benefits.
Beyond Coding: Eliminating Waste and Extending AI Across the SDLC
Faster coding shifts the bottleneck to backlog, testing, and tech debt, so Sarah urges attacking system-wide waste—often reducing cycle time dramatically before AI even enters. She then broadens the aperture: apply AI across the entire software delivery lifecycle—planning, requirements, design, testing, deployment, and operations. Borrowing from education, she offers three questions to guide teams: what thinking is fundamental and must not be outsourced, what is mechanical and can be, and which AI tools fit best. This reframes adoption as end-to-end productivity, not just faster typing.
Team-Level Augmentation: From ChatGPT One-Offs to Haven’s Shared Context
Sarah maps GenAI “superpowers” (translation, knowledge retrieval, brainstorming, summarization and clustering) to SDLC activities, noting the tool market skews heavily toward coding assistants. Because many teams default to individual ChatGPT usage, she introduces Haven, an open-source prompt collection designed to bring shared context to team workflows. Haven supports ideation, pessimistic scenario analysis, requirements breakdown, and threat modeling so groups can explore and decide together. This segment shows how to scale augmentation from individuals to teams, consistent with the inner-loop theme.
The Road Ahead: AI-Native Development, Legacy Modernization, and Phoenix Code
Looking forward, Sarah explores how AI can tackle hard engineering problems: modernization, understanding COBOL at scale via CodeConcise (ASTs, GraphQL, RAG), and even reverse-engineering black boxes from binaries. She envisions AI-native software development—AI at the center, seamless human–agent collaboration, responsible practices, and leadership literacy—already visible in startups using agentic tools like Cursor and Lovable. The near term keeps humans in the loop; longer term, end-to-end prompting could enable autonomous delivery and “Phoenix” code that self-heals and adapts to library upgrades. She closes with a call for empathetic leadership to build windmills, not walls, as teams cross the adoption hump.
Thank you, Andrea, and thank you room.
It's pretty exciting time to be in tech right now, I've got to say. But do you remember your first 'aha' moment?
I'll tell you mine. I was working as a grad developer in an organization. We were doing C++ coding, and I was doing real programming. I was writing header files. I was dealing with pointers and references and garbage collection.
Now, we had other teams in our organization that did C-Sharp.
So my boss one day sent me across to a C-Sharp training course.
Oh, what did I find in that C-Sharp training course? I'll tell you. They weren't doing real programming.
They were tab, tab, tabity tab, and all of a sudden, their signature blocks appeared. They didn't need to think about references and pointers. They didn't even know about garbage collection.
They were not real programmers.
So I went back to my boss, and we were all pretty excited. Now, I'll tell you what, it wasn't until I joined ThoughtWorks and in the first eight weeks of working on a project, I delivered far more software - working software to our customer than I ever had in my three years working as a C++ developer.
And that was the first time I had an epiphany.
Maybe real programming wasn't about the typety, typety, typety. Maybe real programming was about getting value to the customers and doing that quickly and early so that they could actually make use of the software that we were producing.
And I'm having that same epiphany right now when we're thinking about the role that AI plays in software delivery.
What I'm seeing through the tool space and through lots of people experimenting, is some really interesting innovation going on in our industry right now. And I think it's the second time in my career that I've had such a huge 'aha' moment. So today I want to talk to you a little bit about what those movements are and what the future might look like.
Now, before I begin though, I'm going to tell you that everything that I say today is wrong. Well, not wrong, but it'll be outdated before I sit down. And I can guarantee that because in the time that it takes me to write these slides and write this deck, I have to continuously change it and update it all the time. In fact, as late as last night, I was adding new information to it. So everything that I say, please take it with a grain of salt and don't quote me in three weeks down the track when I'm wrong.
Because really, we're working in quicksand.
This is the thing that I'm feeling the most. We're working in quicksand.
We're trying to get a nice stable base for our teams to join this revolution, but we're fundamentally working on shifting sands. Every three months, a new breakthrough comes out, new tool sets, new ways to use this. And so not only are we trying to change hearts and minds within our organizations, what we're trying to teach them, how they need to go about it, that advice fundamentally changes too. So who feels like this at the moment? When every time a conversation about AI comes up, It's everything all at once, everywhere all at once. I get that.
And so the more I think about it and the more conversations that I have, I feel like we need to have just a little bit of structure around what we're talking about. Because if we don't, we kind of go off in lots of different directions. So here's the framework that I tend to use when I talk about this with organizations. There are three loops of AI. The first loop is Inner Loop, the builder productivity. That's how we build and develop software. The second loop is our Middle Loop. It's our business process optimization.
That's the role that AI plays in things like call centers or improving claims processing or working in finance teams or even working between business units and between different parts of our business. Then there's the third Outer Loop, which talks about the role that AI plays in our products.
So it might be through full personalization.
It might be through chat bots. Or it might be a new agentic ecosystem offering that you have.
Today, I want to focus in on the inner loop. So those other two are really interesting spaces to be in as well, and I can definitely talk a lot about them. But we're going to focus and zero in on this Inner Loop, this builder productivity.
When we think about this change curve that we're having, and when I talk to organizations, most organizations are really in this maturity at the moment or at this level of adoption when it comes to AI and software delivery.
There's either an awareness or there is a focus on AI assisted code and the focus is more so on it being assisted and driving adoption throughout the organization.
That's a really good place to start.
But when I speak to a lot of tech leaders who are in this part of the adoption cycle, the most common question that I'm getting is: "I've given my teams GitHub Copilot, why aren't they using them? Why aren't we getting better adoption?" So this is what I tell them. There are many factors affecting the adoption of GitHub Copilot in teams right now. Well, the first one is the LLMs and tools. We're really in the early stages with lots of promise and maybe less maturity. But it's still quite early in the tool development right now.
Some other factors that are involved is the prevalence of your tech stack in the training data. If you're working in a very proprietary language, I can almost guarantee you that won't be present in the training data. If you're working in Java, JavaScript, Python, it's more relevant. The experience of your developers also changes. I've heard many stories about junior developers going for a lot longer with a code assistant than they would normally before they had to interrupt a senior developer. I'm going to pause right there because I fundamentally disagree with that being someone who has promoted pair programming for such a long time. If nothing else, that is the very reason why we have advocated pair programming since at least 18 years.
The complexity of the problem. It's just too big. It's not a boilerplate solution.
We haven't solved this before, and therefore the code assistants just aren't helping us.
There's so much information. It's just really difficult.
It's like drinking from a fire hose right now.
The repeatability of results, like it's - you run thing one - you run and prompt one day and you don't get the same answers the next. And often it comes down to our tolerance for errors and also our tolerance for a slowdown as people learn to use these tools.
So these are just very wide and generalized factors that are limiting adoption. This is what I'm telling other tech leads about that. But more importantly, it's not about tools.
More importantly, it's about how - rewiring how we work and how we think. Now, we know about change management. We've gone through that when we brought Agile to the industry. You will go through a change curve, and it's not just about giving a tool and expecting someone to use it. Every individual goes through their own version of this change curve, and how they go through it is largely determined by how you as a tech leader help them navigate through it. If they're doing it by themselves, they will go through that at their own pace. If you do it together as an organization, you can help accelerate that change curve.
Now, let's face it, with all the hype that's out there, there's some pretty bold claims. But are they actually fact or fiction? And so this is the number one reason why I hear developers tell me that they're slow to adopt these, because they're hearing the hype and it's not playing out.
Let's unpack the most significant one that you've probably had hurdled your way.
This is a stat that has loved to do the rounds.
55% productivity gained through GitHub Copilot. Who believes that?
Let's look into it. All right, so Dora Report. I love the Dora Report. It gives us a snapshot on on how teams are feeling about different aspects.
Last year, one of the questions that they included in the survey was people's perception about AI. Still, we're working very much in a perception area right now rather than a measured result. But perception is a large part of what we do anyway. So we're seeing 75% of respondents reported positive productivity gains. So people who are using tools do actually seem to find them to be useful.
So that's good. And I wanted to take you back to the place where we were in the industry with a tool chain when that survey went out. Copilot was really all about autocomplete. It was a glorified tab, tab, tab. So it was helping on very small parts where the unit assistance was really at that method level.
And it was really just a chat.
But that really just got us bit faster horses. So what we're really doing right now is rethinking and trying to imagine what the new car is or what the new train is. And a side tangent on trains and where - the development of the train, it set out to be a fast - a better horse, a faster horse, more productive, but a human could walk faster than it in its first version. It wasn't until about the third or the fourth iteration of what the train was that you actually started to see the productivity improvements. And that's where we kind of are right now with the AI tool chain.
We can see the promise. We've still got to go through a couple of iterations before we can actually live up to productivity gains. We're on this trajectory and we started with autocomplete on steroids and chat. Now we're working towards agents and context providers, but through new IDEs like Cursor, Windsurf, Cline. And now in the latest development, the latest new announcements, we're now moving to a lot more autonomous agents like the OpenAI Codex, which go and just create PRs for you. So this is where we are at the moment. So we've got 'prompt to code'. This should mean this starts playing...
I want to just briefly look at what this might be. So this is a video of someone writing - solving a problem with Cline, and Claude as a backdrop. Again, this was taken in January, February timeframe. Things have moved on since then.
But this is when my second 'aha' moment came.
Because what this program is doing right now is solving one of ThoughtWorks' well-known interview questions.
So as part of our interview processes, we gave candidates a problem to solve, the Mars Rover problem.
I can say this with freedom now because it's been exposed on the internet so many times and we don't use it anymore. But it was a really neat program to show the level of quality that a programmer put into it. I have been on the reviewing end of many, many of these coding assignments, and it's very easy for me to assess them now. I can look at them and very quickly see how well written the code actually is. This has just prompted with a problem statement that we want to use Maven, that we want to use Java or JavaScript and that we want tests for it. And then the client has taken over, created a plan, created the folder structure that it was going to have, created the code snippets, and all the while just prompting to say, "do you accept this change or do you need to alter what we're doing?" And in the end, the code it produced was so impressive as a blind reviewer, I would have said, "I want to see that person straight away." That is how impressive this thing is.
Now, as impressive as it is, the problem starts to come with repeatability, because another team has used Claude Code to port a tool that we have called CodeConcise into a different language. And they found that 97%-- it saved us 97% of the work. Then they went to do it the next day, and it failed completely. You can read that report. We've got it. There's a QR code on there. It's just - we've published this on martinfowler.com.
Oh, no, this one was published on thoughtworks.com, I think.
But this is one thing that we want to do as Thoughtworks - continuously publish what we're seeing and what we're finding so you can follow it along and follow our experiments along. So now we're getting to the autonomous - so we wanted to test out some of these coding assistants, these autonomous coding assistants. And so Bergitta, who heads up our AI FSD experiment team, had a look at - actually a lot of - all of the autonomous agents that are out there right now, but specifically at Codex. And she got it to do the same task over and over again, repeatedly ran it to see what the results were like, to see how repeatable it was. The good news is that the agents came up with a working solution every time. So that's very good news. But unfortunately, six runs of it, only two times did the respective agent find this piece of code which led to duplicated code. So, the bold claim about being productive. I think this is true. I think it's making us going faster. But now the question that we've got to answer is about the quality of the code.
Because this is another report that Gitclear put out towards the end of last year. What they found and they measured code over the years.
They have found at the point in time when Copilot and other code assistants came about, the number of code added into code bases increased.
However, at the same time, the number of moved code decreased to 0%. What does that tell us? Tells us refactoring is not taking place.
It tells us coding assistants are really great at adding new code, but because they're really fast at doing that, our code bases are increasing at a really fast and rapid speed.
And we know what happens when code bases get too big and too unwieldy. We have a total cost of ownership problem.
It's really easy to see typical AI missteps where AI is not doing the right job at the commit level because it just doesn't commit. It just doesn't pass. Tests don't pass. It doesn't compile. Or it doesn't do what - it doesn't solve the problem. It's a lot harder to see the overall complexity that it's introducing to your code base. The lack of reuse that you're getting through your system.
The verbose or redundant tests that it's creating.
So, as we go through the adoption curve and the adoption cycle of moving it through, as tech leaders, this is something that you need to be the most acutely aware about.
Your teams will yell at you when the things aren't working.
They'll know how to do a change at an individual level.
You need to make sure certain practices are still taking place, like reviewing code. Don't offload that to other team members. if you're going to be using code assistants to generate the code, that's great, but you have to be the reviewer. Don't offload that to your colleagues. But at the team level, you need to really focus in right now on monitoring your code quality, shifting left, testing, doing 'AI gone wrong' rituals. So bring the teams around and say, this is where it works and this is where it didn't. But most importantly, introducing psychological safety into your team. Letting the teams fail and go through this. We're working in quicksand.
No one knows the solutions right now. We're all... someone described it to me as a bit like Emperor's New Clothes, and I can see that. GenAI is an indiscriminate amplifier. It's giving us gold or garbage. It cranks both to 11. And so here are some things that you can actually do within your teams.
We really like GetDX as a product. It's done by the same people that wrote the Accelerate book and came up with Dora metrics initially, Dr. Nicole Forsgren.
They have a tool now which helps teams look at quality measures through their system. It looks at things like speed and throughput, effectiveness, quality, and impact, and does that through both survey qualitative and answers and quantitative by actually looking at Jira and your Git repos as well. They've also introduced a new module in there that actually starts to track AI productivity.
Keep an eye on quality as your team don't just measure the speed and the productivity, because otherwise we're going to be building a problem for us to clean up later on. And also, engineering practices still matter, if not more. So all of the engineering practices that made XP good, things like clean code, fast feedback, simplicity and repeatability, they actually make AI even better for you.
And all of these other things like vertical slices, simple design, pair programming, test-driven development, these are all things that will help you on your AI journey.
All right, but once we start looking at faster coding, we're going to start pushing a lot of pressure on the rest of the system. Because if we can code faster and have a higher throughput, how can we fill the backlog faster? And if we can code faster, how can we test faster? and how can we also make sure our technical debt is in check?
Because what we know is even without AI, we have such a huge amount of waste within systems. Most organizations that I go to have only about 30% of value-added delivery in their systems. And we've been working hard with organizations to get them to a part of like 60% value adds. and if you can do that, you're actually introducing about a 30 to 50% increase in productivity.
So you can go from having a 26-day cycle time to a 16 to 20-day cycle time just by removing the waste within your systems. And this is with the next stage that organizations go through on this adoption journey. So once you've got over the adoption hurdle, now teams are starting to think about how can we accelerate productivity - not accelerate productivity for coders - do it for the whole team. So now we're looking at the role that AI can play within the whole software delivery lifecycle. So it could be through planning, requirements design, other aspects of software engineering, testing, deployment, and then operations.
I want to pause a second. because I think now's the time to introduce some really interesting questions that are not created by people in tech.
This is - my sister is a teacher at a private girls school in Brisbane - and she has a background with a PhD in archeology, and she's now their ancient history teacher. And so she of all people are tasked with finding out the role of AI in education for their students. Which I find very ironic, I will say. But because she is so academic and because she has such a strong focus in understanding the past, she's come up with these three questions that every subject is asked. The first one is: What thinking is fundamental to my subject and should not be outsourced?
What thinking is mechanical and can be outsourced to expedite learning?
and then: What is the best AI tool to use? And I love these questions because I think they're universal across all professions, including our own. So the first one: What is fundamental to software engineering or software delivery that should not be outsourced? What is thinking in mechanical and can be outsourced to expedite software delivery? And: What is the best AI tool to use. So I think these are really great framing questions that you can ask your teams, because I think everyone's struggling with the question of: Will my job disappear?
And I don't believe your jobs will disappear, but I do think that they will change. And so this is a great framing device that teams can think about in order to help them through this adoption journey. So then we can have a look at: What are the superpowers of Gen AI? We've got translation finding knowledge, brainstorming ideation and summarization and clustering. And we can look at the parts of our software delivery lifecycle and work out which of these parts of our activities can be augmented with AI. It's really hard right now because - I mean we're keeping track of all the tools that are out there that help offer delivery - and an overwhelming majority are code-based tools. So these are coding assistants and there are far and fewer tools that help other people of our team or other activities in our team. And so the tool landscape is pretty thin.
Which means there are a couple out there like Figma UI helps with wireframing, Lovable helps with prototyping and Gen AI, and lots of people in the startup game are using this to go test their ideas and thinking they're really impressive to see the things that they can do. But most teams are actually reaching out to ChatGPT to get going with this. The problem with doing that is that it's on an individual level and not really at a team level. So we've kind of got an answer for that right now. It's - we've created a tool - well, it's not a tool. We've created a collection of prompts, stuck it on the internet. It's an open source.
You can go grab a copy of Haven. But it It aims to help teams have a collective context as they're working through and chatting and exploring things like ideation. So coming up with ideas to a new product, a new - if you want to know a pessimistic realistic example of what might go wrong with your code, it can help you with a requirements analysis. So breaking down stories and epics into the majority of the story written for you; help with threat modeling, so thinking through the different go-wrongs that could happen, and then come up with scenarios for how you might treat them.
And so that is how some teams are getting through the accelerated.
Now, I want to touch really briefly on what the future could look like. So once we're working well with each of the tools, and once the tools are kind of caught up with what we're trying to do, this is where we start to get augmented. And this is where we can start asking the question: How can Gen AI help us solve those hard engineering problems we've been unable to tackle manually? Things like modernization.
Things like being able to understand COBOL code bases in a way that we can chat to it, using ChatGPT on the front. So we've solved that problem. We wrote about it on martinfowler.com with a tool called CodeConcise, which takes COBOL abstract syntax trees, puts it into a GraphQL, puts a ChatGPT and a RAG-based model on top of that. So you can actually start Inquirer and have a look at the code. So now you can start to understand legacy code bases. We've also been thinking about things like reverse engineering a black box because we've heard lots of clients and one in particular tell us that they've lost the code of something that's running in production. So they've got the binaries and they've got the code - no code. So how can AI accelerate reverse engineering that black box? I've had some, we've been running these experiments. We've got some really interesting results - enough for us to want to move from a dummy project into actually working with a client to see.. We believe it is possible.
But there are many other things that we can solve in this engineering space. That's where our focus right now is on. Because what we want to get to is a position where we've got native AI software development, where it's not just an upgrade, it's a reinvention. A fundamental shift in the way we conceive, develop, and deliver software.
What might that look like?
Teams that where AI is the center of creation, integrated AI deeply into the core, emphasize seamless collaboration between humans and AI agents, ethical and responsible development of AI, enhance AI literacy across the leadership teams. We might be sitting there thinking that future is very far ahead of us, but it's not. And I know it's not because I know this is what startups are doing out of the gate. They're using Cursor and Lovable and a whole bunch of other agentic tools that are around and they're just getting on with the job. It's just us enterprises that I feel a lot slower kind of moving on this journey. So what that might look like is where you're just prompting through each of the stages. At the moment, it could be guarded outputs, human intervention, humans still in the loop looking through. But it's not too hard to imagine a world where once we solved all of the different problems along this track, that we can actually go a lot faster through end-to-end quite autonomously. So go from requirements all the way through to deployment. If we get there, we might be able to get 'Phoenix' software, so self-healing code.
Every time a new version comes out, you don't have to spend a whole bunch of time factoring a library upgrade into it. It just automatically happens.
So that, we think, is what the future might look like.
We've just got to get there. We've got to get there with different layers in our organization, different thinking, and most importantly, through empathetic leadership. And I'm hoping that we can talk about that a little bit more on the panel afterwards.
Because the future is bright, the future is interesting, but we just have to get through this hump right now of adoption.
And lastly, and this is the last thing I'll leave you with, this lovely quote: "When the winds of change blow, some build walls, others build windmills." Build the windmill.
GenAI is transforming the way we build software, but code generation tools like Copilot are just the tip of the iceberg. Software delivery is much more than writing code—it encompasses planning, design, testing, deployment, and continuous improvement. The real opportunity lies in applying AI across the entire software delivery lifecycle to unlock greater efficiency, enhance collaboration, and drive innovation.
At Thoughtworks, we’ve spent the past year exploring how to take an AI-first approach to software delivery. Through hands-on experiments, we’ve discovered if the bold claims are fact or fiction, uncovered what works (& what doesn’t), and identified where untapped potential lies. In this session, we’ll share up-to-date real-world insights, practical strategies, and our vision for a future where AI is seamlessly integrated into every stage of delivery—not just for developers, but for product teams, designers, and operations.
You’ll leave with a clear-eyed view of what’s real and what’s hype in AI for software delivery—plus practical, actionable strategies to move beyond code generation and start using AI to innovate faster, smarter, and across your entire SDLC.