Building an AI team when no one knows anything about AI

Introduction to AI at Atlassian

Matt Colman introduces himself and shares his experience transitioning to an AI-focused team at Atlassian. He discusses his initial feelings of imposter syndrome upon joining an AI team and receiving guidance from more experienced colleagues. Colman emphasizes the joy of learning together with his team and offers advice on approachable resources to understand AI.

Building the AI Team and Learning Together

Colman details the initial steps his team took to familiarize themselves with AI concepts, including using online courses, following AI experts, and sharing resources. This chapter emphasizes the importance of collaborative learning and adapting to new technologies. He also highlights their study group's impact on reducing anxiety about AI and shares how they collaboratively learned using various resources.

Introducing Autodev

Colman introduces Autodev, a product that automates tasks from Jira issues to pull requests. He demonstrates its functionalities, including how it generates unit tests and performs code translations. The chapter also delves into the challenges of building around LLMs, such as their indeterministic nature and cost considerations, while recognizing their potential to revolutionize workflows.

Challenges and Adaptations in AI Development

Colman discusses adapting traditional development loops to accommodate AI's unpredictability, highlighting the need for offline and online evaluations. He describes using SWE Bench for benchmarking and explains the importance of building internal datasets and leveraging tools like Statsig to make informed decisions on development processes.

Evaluations and Tooling for AI Teams

This chapter explores the significance of tools and evaluations in streamlining AI development. Colman details how his team built custom evaluation services and discusses the necessity of creating dashboards for analyzing code quality. Emphasis is placed on continual adaptation and learning to manage the complexities of AI development efficiently.

Final Thoughts on AI Implementation

Colman wraps up with reflections on the rapidly evolving landscape of software engineering due to AI technologies. He offers encouragement to those feeling imposter syndrome, underscores the importance of problem-solving skills, and calls for continual investment in team tools and learning. The session concludes with an invitation for further discussions on AI, as well as Colman’s light-hearted interests outside of work.

Look, Web Directions.

Absolutely love this conference.

I've been here the last four years.

I'm a little bit embarrassed today though, because my, outfit isn't really matching the slides like the last two presenters.

So apologies for that.

A little bit awkward.

Building an AI team when no one knows anything about AI.

That's what we're talking about today.

I'm Matt.

I do front end.

I'm a software engineer, manager, flip flopper.

I've gone back and forth a few times.

I did a talk about that a couple of years ago.

Previously worked at Blake and Domain and now Atlassian for four years.

And so at the start of this year, I received an offer from my manager to change to a new AI team.

And I thought, AI's Pretty cool, pretty sparkly.

So I didn't think about it for too long.

I just said, yes, sure, why not?

I tend to say yes to opportunities, let's do it.

I'm probably not the only one in the room who's in a similar situation.

Has anyone else been thrown into a new AI team this year?

I can see a couple.

Okay.

Yeah, so it's happening, right?

Like it's good fun.

And so my new manager, he was, he's actually from Meta and he did have a little bit of AI experience.

And so he was telling me, okay, there's, it's like an AI code generation thing.

You'll, have 14 engineers.

We've got three products in mind.

Good luck.

And so I said, all right, like I was feeling a little bit.

All right.

I asked, what can I do to catch up to speed?

Tell, give me some reading or something.

So he sent me a few links, a few YouTubes.

He sent me.

The first one I clicked on was this one from Stanford University.

So I'll play a very small clip of this.

[Narrator] So at inference time, our decoding algorithm would define a function to select a token from this distribution.

So we've discussed that we can use the language model to compute this P, which is the next token distribution.

And then G here, based on our notation is the decoding algorithm, which helps us select what token we're actually going to use for Yt.

So the obvious decoding algorithm is to greedily choose the highest probability token as Yt hat for each timestep.

So while the [Matt] Alright, so pretty, basic math there.

Was anyone not following along with that math?

A few dummies here that Heh, that's alright.

So I actually watched this whole lecture.

It went for an hour and 20 minutes.

Pretty much the entire thing went straight over my head.

And then when it finished, it said, would you like to watch the next lecture, which is the advanced topic.

And it was at that point where I thought like major imposter here.

I was pretty scared at that point.

So I reached out to a few other managers at Atlassian who had been through a similar experience because there's a lot of AI teams spinning up at Atlassian.

I certainly wasn't the first.

So I reached out to, a colleague who had, been through this much earlier and had a really good chat.

He assured me, he said, Matt, you don't need to know the math.

But you do need to know how LLMs work at a high level.

So maybe focus on that.

And so that was good advice.

And then actually I also found some other YouTubes where if, you just Google, explain LLMs to me, like I'm a five year old, that is a much better place to start than the Stanford lecture.

I highly recommend that.

And so whilst I felt totally out of my depth with my imposter syndrome, it didn't last very long.

I actually quickly jumped into the team and the project and realized pretty quickly that whilst technology is changing, software development will always be about problem solving and providing value to the customer.

And I think I'm pretty good at that, so I didn't stress for too long.

Actually, the whole team suffered very similar imposter syndrome.

This is the team here, we're all quite beautiful.

And so what we did is we, just established that early, right?

We said, we're all in this together.

None of us know AI.

And so we started learning together and this was something that really calmed the whole team down because we knew we're in this together.

We started looking at courses, deeplearning dot ai was a good one.

We started following AI people and groups on Twitter and other platforms.

We looked at research papers from archive dot org.

Importantly, we started sharing not only the papers, but how we found the papers.

Podcasts.

There are heaps of great AI podcasts out there.

I've just listed here this day in AI, which is a great one because it's an Aussie podcast.

I actually listened to the most recent episode on the way here today and check it out because there is a Claude ChatGPT 4.

0 Rap Battle.

It's stunning.

You really want to check that out.

We also had an internal Slack AI study group, which is where we started, but it's still going today.

It's a really great group.

It just reminds us again that the industry is moving incredibly fast.

It's really hard to keep up.

We're in this together.

We're sharing tips together.

So that's a study group that just keeps going.

Okay.

So look, this is not a product pitch at all.

I'm not trying to sell you anything, but, just for context, this is what we're working on.

It's a product called Autodev and it basically takes a Jira issue and it goes, from the issue straight to the pull request.

It is absolute magic.

If it works.

So let's have a look at where we're at today.

[Matt narrating a screen cast] Hello, Web Directions.

Okay, here I am on my Jira board.

Let's go and pick up a task.

Click into here.

Now I've got the task which is called Generate unit tests for the handle result function.

This is our shiny new button right here.

Code with AI.

Let's give it a go.

Okay, so it pops up this modal and it has selected the right repository for me.

I can edit that if I like or I can edit the issue description if I like.

But I'm going to go straight to generate plan.

Okay, so it's done.

We have a plan generated here for how we should create unit tests for the function.

And we have the code here.

So how did this work behind the scenes?

What we did is we cloned the repository in a secure environment.

We then gave the LLM a whole bunch of tools to go and search that repository for the handle result function.

Once it found the handle result function and the file, then we handed that back to the LLM to go and generate the code.

As you can see, it's come up with some pretty good unit tests here.

And if I wanted to add anything or change anything, I could talk to Autodev right here and make some changes.

So what else can Autodev do?

It could also convert JavaScript to TypeScript.

It could make some UI changes.

It could clean up some feature flags.

It can add some logging.

There's a whole range of things Autodev can do.

[Matt in the room] Alright, so that's the product today.

It's pretty exciting.

Like it's really fun to work on.

It's really exciting stuff.

But what's interesting is like in traditional web development, we've been moving towards functional programming.

I feel that is using pure functions where given an input, the output will always be the same without any side effects occurring.

That's what my mentors were telling me for many years.

They were just hammering that in the, value of pure functions.

But now we have indeterminism of LLMs.

And then testing paradigms have been getting incrementally better and better.

We're trying to make the test faster and more reliable.

But now we have slow responses from LLMs.

And companies at the moment are doing a lot of cost saving initiatives.

We've got to save money.

Introduce the most expensive thing I've ever seen, LLMs.

So it feels like we've taken about 10 steps backwards in many ways.

But of course, we've also taken a giant leap forwards in terms of the power of LLMs.

So this is something that we really have to adapt to and adjust.

So if you take the traditional dev loop and you apply it to using LLMs, it can look something like this.

So we write some code, we test it locally, we wait, we get rate limited because everyone else is doing the same thing and we're all hitting LLMs.

So you get rate limited a lot.

So then you wait some more.

Eventually a response comes back and the result looks good.

So you ship it.

Then alerts start going off because the tests are flaky.

Customers start complaining because the results are flaky.

We only saw this work once with an indeterministic LLM.

So maybe it doesn't work that often.

So it's a sad time if you try to apply the traditional dev loop.

We learned pretty quickly the dev loop has to change.

So our approach today is offline evaluations and online evaluations.

Offline evaluations is where we evaluate the performance of our solution against a static data set.

This is pretty familiar if you're into data science or machine learning, but it's less familiar if you're a software engineer.

We run our offline evaluations regularly during off peak hours to save on cost.

SWE Bench Who's heard of SWE Bench?

How good is SWE Bench?

If you haven't checked it out, QR code, SWE Bench dot com.

Check it out, it's a wild ride over there at SWE Bench.

SWE Bench stands for Software Engineer Benchmark.

I don't know if you've heard the term SWE recently, but I've heard it a lot.

We're all called SWEs now.

We're not called devs anymore.

So SWE Bench is this, giant repository of Python tasks.

They're mostly single large files and they all have unit tests for those tasks.

And so you can give your, LLM coding agent, and, run it on SWE bench and see how it performs.

And so the measurement for passing is that all the unit tests pass for that task.

So if you can get the in test to pass, you get a tip.

So today Autodevev is scoring 36.45 percent on SWE Bench verified 500.

We are not at the top of the leaderboard.

There are.

I don't know who's winning at the moment.

Maybe it's 42 percent or something like that.

It's good fun, but we, did realize that because they're all Python files, they're all single files.

It's actually not very representative of what we want because ultimately we want Autodev to go GA for our Jira customers.

So what we've built is an internal data set of 369 issues.

These are all Atlassian solved issues of easy to medium complexity.

There's a variety of tasks and different languages.

There's a variety of repos from small to large mono repo.

And we test this by similarity score.

So instead of, compared to SWE Bench, which runs the unit tests in this one, because we actually have real tasks that Atlassians have solved, we now have a ground truth PR, a target to aim towards.

So we run autodev against the same issue when we say, do those two diffs look similar?

If they do, you get a tick.

We also build a lot of custom data sets.

So we build these, from like online metrics and customer feedback.

These help us test specific issues such as hallucinations, lack of context, low file recall, et cetera.

This allows us to really focus on the problems that we see today and try to solve them in a way that, that actually is significant and we can trust.

Statsig is a great tool.

I don't know if you've used it.

If you haven't, check it out.

We use it for offline evaluations, online experiments and feature flagging.

Statsig stands for statistically significant, which is very important for indeterministic LLMs.

What we can see here on the, my right, your left is, these are all our experiments that are running.

So the filled colors with the ticks.

We've shipped those experiments.

The gray bars that are filled with the cross, they failed.

They were no good.

And the white ones are active.

Now on this side, this is, really interesting for Statsig.

It's quite unique to Statsig.

We've got the gray bars here, which you can see they intersect with that neutral line.

That means that we don't have enough data yet, and we can't statistically significantly say that it's good or bad.

So they stay gray.

The green Statsig positive, the red Statsig negative.

This is how we run experiments.

Every single change we do, because LLMs are involved, we have to do it this way.

Everything we do runs through a Statsig experiment.

So at the end of it, when it looks like this, we get together as a group, or we have a channel with some approvers to discuss the positives and the negatives, and then we decide if we ship.

But we always start with a hypothesis.

So if the green bar was our hypothesis, Then we say, that's great.

What about these red bars?

Are they serious regressions?

Are they, does it work out?

And then we decide to ship.

We've also discovered that because we are actually building a new way for developers to work with Autodev, it's required for us to build some new tools to move at speed.

And probably for anything you're doing with LLMs, it's, investing in tools is really important.

So we've built our own evaluation service and user interface for analyzing and debugging both offline evaluations and online jobs.

There are similar off the shelf evaluation platforms out there.

So if you like go and check them out before building your own, cause it takes a while to build your own.

Yeah, just, Google or chat GPT that, and you'll see some of the options.

So in this screenshot you can see, evaluations running, green's obviously good, red's, bad.

This gives us an idea of how the evaluations are doing, but importantly we can click into them and study the things that are going wrong.

And, but also study the things that are going right.

Online evaluations are not really new at all to software engineers.

However, In the age of indeterministic LLMs, they are more valuable than ever.

Dashboards.

We have loads of dashboards.

They're so helpful.

They're so important to what we do.

We don't just keep an eye on the numbers, but importantly, we click into those numbers and find out more.

So take out code quality rate, for example, 24%.

This means that for every successful code generation, 24 percent of the time that leads to a merged pull request.

Now the ability to click through to that job and investigate and understand the 24 percent but also the 76 percent that did not lead to a merged pull request is very critical to our strategy.

We continue to create new queries and dashboards weekly to focus on specific problems.

And so actually we saw an example just this week where we changed our underlying model.

And when we did that, we had a statistically, significant positive result on our internal data set.

But really interesting was that the online metrics also showed that some problems that we were focusing on completely disappeared.

So this is huge for us because we had teams working on fixing those problems and then this model comes along and just like it just disappeared.

So obviously we can now shift those teams to work on something else.

This will continue to happen.

I'm sure we're all seeing this when you're working on something.

Oh, how am I going to, prompt engineering, chain of thought.

And then this new model comes along and like you just delete all that code that you're working on.

It's good fun though.

It's good.

You just got to smile.

SWEs are evolving.

SWEs.

I love the term.

As I said, we're big, we're called SWEs now because.

Because of the MLEs, the Machine Learning Engineers.

That's why we're not called devs anymore.

But it's really interesting.

We're actually learning data science skills now.

So recall, precision, F1 score.

These are terms I hadn't heard about before, but now we're using them a lot.

We're building data dashboards a lot.

So my Databricks and SQL skills are getting better, even though I use AI to write the SQL.

We're learning traditional machine learning skills.

Investigation debugging skills are more critical than ever, which is great for us, like software engineers, when you come into this.

Just know that your debugging skills are going to be super important and that's what we're good at.

SWEs are writing better task descriptions than ever, which is great because now the AI is writing more of the code.

We actually use Autodev in our team on a day to day basis or at least a weekly basis.

And that means that you need to write a really good task description if you want the code to come out the other end in a good way.

Lots of cultural shifts as well.

As we shift to LLM, so sharing cultures really lifted.

We have weekly org wide sharing sessions, brown bags, demos, things that we do anyway, but they're just lifted because we need to share a lot.

Focus on data, weekly check ins on data and results, data again, it's always been useful, it's just, lifted.

And focus on customer feedback.

We have weekly feedback analysis.

And now feedback actually does more than just, you don't just look at it and say, Oh, okay, that's interesting, maybe I'll do something, maybe I won't.

It powers our custom data sets now.

If a customer comes in and says, Oh, this thing's broken.

We categorize that, we add it to the categorized data set.

And then the human will go along who's unhappy with Autodev and they'll write the pull request themselves.

And again, now we have a target.

Okay, we say, all right, they just showed us what we need to get to.

So now we have a target.

It's really interesting and it's really important to listen to the customers.

Okay, so some final thoughts.

Software engineering is changing.

That's obvious.

I didn't need to write that.

Your existing problem solving skills are more valuable than ever.

So as I said, if you're coming into this to an AI team, it's all good.

Don't feel like you don't have the imposter syndrome.

Problem solving is really key.

Do measure import, measure performance of your solution with offline evaluations.

You'll learn that very quickly.

It's really important.

It's, critical.

Invest in new tools to increase dev productivity.

I really recommend to jump on this one very quickly.

Otherwise devs will be running LLMs locally.

They'll be like scratching their head.

They'll be pulling their hair out.

Like it's super frustrating.

If you ask the devs of a new AI team, are they enjoying the dev loop?

The answer will be no.

Initially, like it's really tough, but invest in those tools.

Make sure that you can get the dev loop to a point where you enjoy what you're doing.

That's really important.

And learn together.

That's really important as well.

Keep it fun, learn together and get rid of that imposter syndrome.

Alright, so look, if you'd like to know more about building an AI team or you want a live demo of Autodev after this, then let's chat, I'd be happy to show you.

Or if AI is not really your thing, then look, this is me, I've got two amazing kids, love playing golf and I'm a proud owner of a pizza oven, so come and chat with me about that.

That is it, thank you very much.

Applause.

  • Frontend
  • Software engineer / manager flip flopper
  • Previously Blake and Domain
  • Atlassian for 4 years

Feb 1 '24

Jira dev integrations -> new AI team

  • AI code generation
  • 14 engineers
  • 3 products in mind
  • Good luck

Feb 1 ‘24

Jira dev integrations -> new AI team

  • AI code generation
  • 14 engineers
  • 3 products in mind
  • Good luck

Basics of natural language generation

  • At inference time, our decoding algorithm defines a function to select a token from this distribution:
Screencast of a lecture on machine learning

Mathematical formula illustrating the decoding function used in natural language generation with a focus on selecting tokens based on probability distribution.

Imposter

Animated character in a suit, curled up on the ground, with a caption reading "whimpering."

Whilst technology is changing, software development will always be about problem solving and providing value to the customer

Illustration of a light bulb and a cartoon character with a robot waving.

Learning together

Image of five people sitting at a table, happily engaging in a discussion, with a laptop and several books in front of them.

Autodev PoC

Jira issue to Pull Request

Image of a person making a magical gesture with text "MAGIC" overlayed.
Screenshot of a Jira project board named "Autodev Team24" displaying tasks separated into columns: "To Do", "In Progress", "In Review", and "Done".

Devs had to adjust

  • Indeterminism
  • Slow responses
  • Cost

Traditional dev loop with LLMs

  1. Write some code
  2. Test it locally
  3. Wait
  4. Get rate limited
  5. Wait some more
  6. Result looks good
  7. Ship it
  8. Alerts go off - Tests are flakey
  9. Customer complaints - Results are flakey
  10. :sad:
Image of a large crying face emoji

The dev loop has to change

  1. Offline Evaluations
  2. Online Evaluations

Offline Evaluations

Offline Evaluations

SWE Bench

  • All python
  • Mostly large single files
  • All unit tests must pass
  • Autodev 36.45% SWE bench verified 500
Illustration of a podium with people celebrating, a QR code labeled “SWE-bench: Evaluate Language Models on Open Source Software Tasks” with the URL swebench.com at the bottom right, and an illustration of a robot and a person with a speech bubble.

Offline Evaluations

Internal data set

  • 369 issues
  • Atlassian solved issues of easy to medium complexity
  • Variety of tasks and different languages
  • Variety of repos from small to large monorepo
  • Tested by similarity score

Offline Evaluations

Custom Data sets

  • Help test specific issues such as:
    • Hallucinations
    • Lack of context
    • Low file recall
    • Language server use
    • Command line tools use

Offline Evaluations

Statsig

Image showing two charts: the left one with a list of features and progress bars next to them, some with checkmarks, and the right chart shows a bar graph with color-coded percentages in green and red, indicating statistical results.

Offline Evaluations

Invest in tools

Dashboard of recent evals

Online Evaluations

Online Evaluations

Dashboards

A dashboard displaying performance metrics including percentage figures for issue resolved rate, code quality rate, issue generation rate, and PR merge rate. There is a graph showing code quality rate over time.

SWEs are evolving

  • Learning data science skills
    • Recall, Precision, F1 score
    • Building data dashboards
  • Learning some traditional machine learning skills
  • Investigation / Debugging skills are more critical than ever!
  • SWEs writing better task descriptions as AI is writing more of the code
Illustration of human evolution, starting from an ape to a modern human using a computer, with the caption "Somewhere, something went terribly wrong."

Cultural shifts

  • Sharing culture
    • Weekly org wide sharing sessions
    • Brownbags
    • Demos
  • Focus on data
    • Weekly checkins on data and results
  • Focus on customer feedback
    • Weekly feedback analysis
    • Custom data sets crafted from feedback
Illustration of four people around a large light bulb, symbolizing collaboration and ideas.

Final thoughts

  • Software engineering is changing
  • Your existing problem solving skills are more valuable than ever
  • Measure performance of your solution with offline evaluations
  • Invest in new tools to increase dev productivity
  • Learn together

Let’s chat!

Collage of images showing two children with artwork, two people on a golf course, pizza in a wood-fired oven, and a pizza topped with arugula and cheese.
  • Generate unit tests for the handle result function
  • Code with AI
  • convert JavaScript to TypeScript
  • functional programming
  • testing paradigms
  • offline evaluations
  • similarity score
  • Statsig
  • feature flagging
  • hypothesis
  • Databricks
  • SQL
  • task descriptions
  • dev loop