Building a Performance Culture

Claire Tran at Lazy Load '21

Transcript
Slides

Hi everyone.

I'm Claire- an Engineering Manager at SafetyCulture, and today I'll be talking about building performance culture.

Firstly, let's dive into why performance is important.

In my experience, I've worked in environments where there were millions of users visiting each day.

And so handling the load and the traffic was something that was important because if we didn't, this would have an impact on the user experience and have business impacts.

So jumping back in time: when Twitter was experiencing a lot of load, users would often see the 'fail whale' whenever there were scaling issues.

And then moving forward to a more recent example, Signal also experienced something similar.

As their user base was growing - (you can see here they're really excited by that) - but they also had scaling challenges that they needed to overcome.

So these were a couple of examples where the back end performance is important, but as a part of the user experience, the front end performance is just as important.

So diving into an example here: Google conducted a survey and found that 53% of people will abandon a site if it takes longer than 3 seconds to load.

As a part of that same survey, they found that 46% of people said waiting for pages to load was one of the things they disliked the most when browsing on mobile devices.

And lastly, as a part of that survey, they also found that mobile sites took an average of 19 seconds to load over 3G connections.

Diving into another example: Pinterest also conducted a series of experiments to improve their website performance back in 2015.

And as a part of that, they worked to improve the mobile landing page and improved the performance by 60%.

This led to a sign-up conversion rate improvement of 40%.

And after this, the team worked on productionalizing these improvements, which led to the biggest increase in user acquisition in the following year.

So now I want to touch on: What does performance culture look like?

You've got examples out there where teams are working on performance, but what does it really look like inside?

Jumping back to an example a few years ago, where I first encountered the topic of performance was a team at Fairfax, or they also looked at the Sydney Morning Herald, and this was something where it was just baked into the processes and everything that we did.

For example, if we were building features, we'd be considering performance impact.

Before rolling out features, we would be performance testing and tuning if any of the benchmarks were not met.

And then as we rolled out the changes into production, we'd also monitor these and see what was happening and respond as we needed to.

I can remember when engineers were clustering around a dashboard if it was a busy day and, you know, there was a lot of load, and that's just something that we just lived in breathed.

Then jumping forward a few years, I found myself in a team where we were looking at front-end performance.

And in this scenario, it wasn't just one team's focus.

It was actually a focus globally in the organization.

So being a classifieds site, the impact of ads is something that we were considering, and we wanted to understand what was the impact on the user experience, as well as in finding places where we could optimize things.

So A/B testing some of our changes and improvements, and then leveraging some of the technical advancements that were coming out at that time.

So for example HTTP/2 was rolling out some changes and we wanted to leverage some of those improvements into our optimizations.

So we've talked about what does performance culture look like?

But let's backtrack now to how that can start.

And that can be different depending on the environment you're in.

So more often than not, you may not have a clear idea of what the performance is like at that point in time.

You may or may not be testing, and you could find that this could be quite a manual and ad hoc process.

Customers might start complaining, particular pages might take a while to load, and this is when that starts to surface.

Another way that this could surface is that the user base is growing.

And this is where performance becomes more and more of a focus.

So at SafetyCulture this started in a few different ways.

The scalability of our systems is something that's important, and that's because our user base is growing and we're going through a period of growth, which is very exciting.

But at the same time, we needed to understand our limits and if we could support the growth that we were expecting.

On the other side of that, we also wanted to understand: What was the impact to the front end?

And we just didn't know.

And so one of the things that we wanted to do is just run a bunch of performance tests to just understand: What was the performance like right now?

So now I'm going to pass over to talk about how this started at SafetyCulture: Team members reach out and ask, "Hey, you know, do we have any metrics on how our application is performing on specific pages"?

And we had a few select engineers who ran those tests on an ad hoc basis.

So it was very manual and it wasn't until we had a quality-themed hackathon that we started diving into that process.

We wanted to embed that process into our day-to-day workloads.

And by doing that, we really start to bring performance to the forefront of our minds in our day to day work.

So as part of that, we also wanted to push a lot more of that information to the engineers, so performance metrics information to our engineers, and that really involved automating that process as part of their daily changes.

And so, why do we do that?

I think we want to ensure that we've got a way to conveniently present that information to our engineers so that they can understand and really view and analyze what their changes are like and how that impacts performance.

And so that's what we've done at SafetyCulture.

We've updated our CI/CD pipelines such that upon any change that is merged and deployed to our staging and production environments, we run these sets of performance tests.

And then we've also got infrastructure capturing the results of these tests.

We then have dashboards that we've created to visualize and show the trends of these metrics over a period of time.

And then we've also got reports that we're sending out to teams on a weekly basis.

And that focuses again on proactively pushing that information to our engineers, and that allows us to then get them to engage in the performance analysis process.

And then the automation piece also allows us to be a lot more flexible with how we run the tests, what we configure to test, how we run the test and how we...

- actually let's cut back on that - The automation process also allows us to run the test with various configuration settings.

In particular, location and network connection speed are quite important to us here at SafetyCulture.

We've got various customers all around the world with different profiles, and you can imagine a frontline worker, for example, who's in a mining site or construction site where they don't have great internet connectivity.

And we really need to understand: What is the user experience like within those environments?

And then we can start to improve and iterate on our application, on our platform.

Another thing that you'll find as you're working on starting performance culture is understanding your surroundings.

And so, for example, if you're at a startup, the environment will be very different to other types of organizations.

In this scenario it's important to understand what matters.

As a startup, you're most likely going to be focusing on product/market fit and understanding the needs of the customer and finding solutions to meet that.

And so performance may or may not be a focus at this point in time.

However you could identify: What are the key pages to focus on?

And a great way to do that is just understanding the user journey as they're going through the funnel.

For example, landing pages, booking forms, they're indicators of customers using your site and the success of your business.

The other thing that you could do is look at the bounce rates of these pages and find those opportunities where you could start optimizing performance.

And then for scale-ups or organizations where you're experiencing higher load and more users using the site, understanding the limits of the systems is going to be a focus.

This is a quote from someone in the team who mentioned that "we just didn't know what the limits were.," And so one of the things that we needed to do was just backtrack and just understand the current landscape to understand our limits, and from there kick-started a bunch of activities to start improving the scalability as the user base is growing.

So now I'm going to pass over to Kevin to talk a little bit more about that.

So for us at SafetyCulture, it was really, really important for us to recognize and build awareness of where our architecture would begin to fail.

With a growing business like ours and a scale-up where we're adding users on an ongoing basis and the requirements on our system are growing and growing on ongoing basis, it's really important to maintain an awareness of where you're going to start to see problems and take action as soon as you can, to ensure that users who are onboarding and users who already exist on your platform or have been around for a while, continue to have the best experience possible.

So for us, that was about taking a look at the systems that we had in place and asking ourselves,:" Are they still right for us?" "Do they still serve us?" "Do they still serve our customers?" "Are our customers still getting the best possible experience?

And as we continue to add users at the rate that we're currently adding users, or in an alternative scenario, if that user curve, that growth curve was to tick up, what would that look like, and at what point would we start to see problems?" So once we'd identified a goal, once we knew where we were going with the growth of the company, with the growth in our user base, the next thing to do was to set some performance targets.

And we thought about that along two different dimensions.

It's the leading and lagging indicators of performance.

The lagging indicator is really the user experience, and that tells you, you can look at that and say, "Hey, users are starting to have a really poor experience".

And you can see that experience starting to degrade as a result of the scalability challenges, or scalability issues that you might be experiencing.

That's a lagging indicator - it tells you that something has already gone wrong.

For us, we wanted to look at leading indicators.

We wanted to look at indicators that told us we were starting to see a degradation in performance, which allowed us to then take action and make a change.

So that, that degradation didn't lead to a degradation in the user experience.

So what might that look like?

We set goals, for example, on SLAs and SLOs for response times from our data stores.

SLAs and SLOs for response times between microservices, and paying attention to data at that level allowed us to take corrective action to ensure that as we made changes in our back end architecture, we could make changes with the view to avoiding issues with the user experience and not degrade that user experience.

As you're working through performance, you also need to understand how to communicate the value of this to other people, and that's going to be different for the audience.

So for example, if you're talking to marketing they most likely want to understand about the bounce rates, and time spent on pages.

If you're talking to customer experience, they want to know: What is the user experience going to be like?

They'd be looking at things like page responsiveness, NP scores, and other indicators to demonstrate this.

For stakeholders they want to understand limitations.

For example, Will we be able to grow and meet customer needs?

Another way you can do that is looking at benchmarks and comparisons to competitors in the market.

An effective way you could do this is showing this visually through a filmstrip, and you can see here that - you can see how the page is being loaded and the performance impacts of this.

And another example here is how you could demonstrate benchmarks to competitors.

This is taken from SpeedCurve, so you can see that similar organizations' load times are compared against each other, to just give you an indication of where your organization could sit.

So at SafetyCulture, we conducted a bunch of different activities.

So as I mentioned earlier, we started with an audit.

Just to understand where we're at.

And then we documented those findings.

So looking at the problems that we found, as well as opportunities.

Following on from that we drafted a plan.

So, diving into particular areas that we felt that we could continue investigating or start to roll out performance improvements.

We also had senior leadership buy-in and that's quite key.

We had a director who was interested in understanding what was the front end performance across a few of our front-end repositories, and then dived in to understand the performance results.

From there, they also communicated this across other leaders in senior leadership.

And if you didn't pick that up from earlier in the earlier points, communication is key and finding different ways that you communicate this across the group, so that you can drive the engagement and awareness of performance.

As you're building up performance culture, it's important to also have a plan and figure out: Where do you start?

As a part of that, you will be identifying different areas to tackle.

So for example: What are the problems and opportunities that you could tackle?

Can you break down the bundle size?

What are the impacts of third party scripts and other things that you've found from your performance report?

As a part of that, there'll be metrics that you want to focus on.

At SafetyCulture we've decided that Cumulative Layout Shift was something that we weren't going to focus on.

And it wasn't because it's not an important metric.

It's just that our pages are not organized in a way where content will be shifted down by other content.

So as a part of that, we also looked at pages that we could tackle.

So for example, landing pages, pages that were visited more often, or where there were more shared components, these were areas where we felt we could make an impact if we started picking up optimizations in these areas.

The other thing to pay attention to is the data that's coming out of performance reports.

So as you're adding more features in, it's important to revisit the results from these tests and see other areas that you could start diving into.

The other thing here is OKRs.

So something that we've been doing here at SafetyCulture is just adding in the performance score into the OKRs.

So teams can revisit every quarter and identify areas that they can start improving in.

As you're rolling out performance culture in your teams, the other thing you want to think about is: How will you drive change and keep that momentum going in your organization?

So a great way to do this is form a team around it.

Something that you can look at, for example, is finding the performance champions who can really champion the importance of performance and the value that it brings.

Finding the subject matter experts to guide others, for example, who will help do the investigations and who can help guide others in making improvements in their systems.

And remember, there'll be many steps and changes to drive some of these initiatives forward, and having a team helps with driving this.

I now want to touch on the examples of performance teams in other organizations.

So Etsy is a great example.

They live and breathe performance.

For example, they've got dashboards, they write up, they contribute to the tech community.

They really show others what performance is like.

Jumping into the example from earlier at eBay Classifieds Group - as I mentioned, this was an initiative that was a focus globally.

So there were different teams formed worldwide to look into performance and also bringing that back within the group.

For example, there was a slack channel where people would post questions and share their learnings as they were diving into performance.

A more recent example is at Yoox, where they have a Performance Guild that specializes at looking at optimizing - so they have a great example on their Medium page, where they talk about some of the optimizations they've rolled out.

At SafetyCulture, we have a framework called World-Class Engineering, and this is a great way for engineers to bring initiatives to improve engineering and front end performance was something that was a part of this.

And now I'm going to pass over to Kevin to talk a little bit more about World-Class Engineering.

So, in a technology organization like ours, it's super important that you're always moving forward, changing, growing, learning, and adapting how you build software.

In the tech industry, if you're not moving forward, you're actually moving backwards, because the landscape around you is moving so quickly, and moving and changing almost on a daily basis, that it's necessary to continue to improve and continuously improve on a daily basis to keep up with that change.

So, the World-Class Engineering framework is a framework which brings people together around initiatives.

And through those initiatives, people across our teams look to make change in some aspect of how we run our teams and how we build software, or in how we connect with the community, the greatest software engineering community in the city market, in the Australian market.

The engagement within the team is going to help you continue to build the performance culture and a great way to do that is knowledge sharing.

So whether it's through lunch and learns workshops, documentation training, find the method that's going to suit your environment.

As you're working on optimizations and learning new things, share that back into the organization.

Another way you can do this is just putting this into the team processes.

So at SafetyCulture, we have something called "Ops Reviews" and that's a forum where the team gets together regularly to review the health of the system.

Looking at things like dashboards, incidents, things that have happened over a period of time, and identifying the optimizations and improvements that they can make.

As a part of that, the front end performance has become a focus for some teams to pull in as a part of that process.

Another thing that we've done is looking at How do we divide up the work so that teams will find it easier to contribute to the performance?

And I'm going to now pass over to Tom to talk a little bit more about how we did that.

One of the challenges that we have with really large-scale engineering challenges is: How do you break that up into small parts that software engineers can contribute to in a meaningful way?

And the analogy we discussed internally was: if you were at a party and they were serving a chicken dish, they wouldn't just get a roast chicken and put it on a platter and take it around to the people at the party.

They would beforehand prepare it out in the back and put it in a nice bite-sized dish that you could pick up neatly off the tray and eat and enjoy.

And the same challenge - we have the same challenge with the software engineers.

How do we take that large, large task and split it up into smaller pieces that they can just pick up in a sprint and do quickly without having to understand all the details of the greater piece of work?

Another thing that we did at SafetyCulture was look at how to automate performance tests.

So this was something that we ran manually and ad hoc.

For example, if someone wanted to understand the performance of the page, they'd ping someone to run a test and find out.

So this was something that we saw as an opportunity to take it away and automate this.

So then that way teams could focus on the data that was coming out of those reports and understand the trends, and then make it easier for them to find the opportunities to improve performance.

So those were examples of how you could do this internally, but there's also a tech community that can benefit from this.

And a great way of doing that is writing up some of the findings and the learnings that you've been doing for your performance testing and performance improvements.

Something that's really important is remembering to celebrate the improvements.

So as you're working through things, celebrate the achievements and the hard work that the team's been working on, share that with the group so that everyone can see the progress that you're making.

It's also important to remember that this is a continual journey and will go through a series of different iterations.

Remember that you'll be learning about different ways you could be optimizing different metrics, learning how you communicate with others, and just remember that it's a continual process.

So that's the end of my talk.

I hope that there were some points in here that can help you with driving the performance culture in your environment.

Building
Performance Culture

Background image of an orchestra

Why is Performance Important

Background blurred abstract image

Two side by side images related to poor web performance. The first is from 2009 and features what became known as the Twitter 'fail whale', an image that appeared during service outages, usually caused by having too many users online. The picture is of a serene Beluga whale being lifted through the water by a flock of orange birds. Text above the image reads: Twitter is over capacity.

The second image is a screenshot of a tweet from the Signal messaging app posted in January of 2021 which reads:

Verification codes are currently delayed across several providers because so many new people are trying to join Signal right now (we can barely register our excitement). We are working with carriers to resolve this as quickly as possible. Hang in there.

Resources
https://edition.cnn.com/2009/TECH/03/31/twitter.fail.whale/index.html

User Experience

Image of an Exit sign featuring a large green arrow and and an icon of a figure running through a door

53% of people will abandon a site if this takes longer than 3 seconds to load

Data source: Think with Google AMPProject

Image source: https://unsplash.com/photos/TenXEF3z1pI

Image of multiple rows of taxicabs jostling in crowded lanes outside a carpark structure

46% of people said waiting for pages to load was what they disliked most when browsing on mobile devices

Data source: Think with Google AMPProject

Image source: https://unsplash.com/photos/VYeFTyjns0M

Image of a person scrolling over a smartphone screen

Mobile sites take 19 seconds to load on average over 3G

Data source: Think with Google AMPProject

Image a Pinterest page with the Pinterest logo overlaid on a range of image tiles

Mobile Landing Page Performance Improved by 60%

Biggest increase in user acquisition in 2016

Data source: https://web.dev/why-speed-matters/

Image source: https://medium.com/pinterest-engineering/driving-user-growth-with-performance-improvements-cfc50dafadd7

What Does Performance Culture Look Like?

Background image of a group of people playing beach volleyball at sunset

That’s Just How We Do Things

Baked into processes, testing before production, monitoring, expertise

2007

Logos of Farifax Digital and the Sydney Morning Herald, companies whom Claire worked with in 2007

Image of an aircraft instrument panel showing various flight instruments

Image source: https://unsplash.com/photos/4VSsq9ErzOs

Dedicated Groups & SMEs

Third Party Scripts, User Experience, A/B Testing, HTTP/2

2017

Logos of the eBay Classifieds Group and Gumtree, whom Claire worked with in 2017

Image of two lego figures of Batman and Superman eating popsicles and sitting on a sidewalk

Image source: https://unsplash.com/photos/kgz9vsP5JCU

How It Starts

Background image of a person kneeling in the 'on your marks' position on an athletic racetrack holding a relay baton

Image source: https://unsplash.com/photos/9HI8UJMSdZA

What's Happening

No Idea what performance is like currently
Ad-hoc testing
Customer complaints
User base is growing

Image of a messaging app showing two message bubbles. The first bubble reads:

Hey team, can I ask for a favour if it is a quick one? Do we have a measurement that shows the average load time of the page on the web app?

The second bubble reads:

Hey team, Susan from Company-XYZ is having trouble with the platform ... She also said the platform runs really slow and puts her off using it …

Image of a Chin scratchy 'hmmm' emoji

How this started at SafetyCulture

Image of the SafetyCulture logo

Cartoon graphic of a lighthouse with a searchlight cast into the night

Image of a movie Clapper Board

Slide introducing a video insertion into the presentation by Scott from SafetyCulture discussing his first audit process

Image source: https://unsplash.com/photos/ygUPa_SaXKw

Understand your Surroundings

Background image of a person sitting on a bench at sunrise overlooking a scenic vista

Image source: https://unsplash.com/photos/HS5CLnQbCOc

Start Ups

Understand what matters
Key pages
Understand the funnel & bounce rates

Image of a bullseye with red and yellow darts stuck in it

Image source: https://unsplash.com/photos/bPiuY2ZSlvU

Scale Ups

We do not know available headroom in the current system - this puts at risk our user growth plans

Image of a road sign reading: "Speed Limit 35"

Image source: https://unsplash.com/photos/vrbZVyX2k4I

Image of a movie Clapper Board

Slide introducing a video insertion into the presentation by Kevin from Safety Culture discussing scaling issues (backend)

Image source: https://unsplash.com/photos/ygUPa_SaXKw

Audience

Marketing: bounce rates, time spent on pages
Customer Experience: NPS, page responsiveness, experience
Stakeholders: limitations, benchmarks & comparison to the market

Background image of a crowd of people traversing a zebra crosswalk

Image Source: https://unsplash.com/photos/X53e51WfjlE

Data Source: https://web.dev/how-can-performance-improve-conversion/

Film Strip

Visually see the page performance

Image of Film strip, a dev tool that shows a visualization of the process of a page loading over time. The case study in the images is the Gumtree homepage

Data Source: https://support.speedcurve.com/en/articles/869994-benchmark-yourself-against-your-competitors

Benchmarks

Comparison with competitors

Image of comparative performance data from a group of media sites in the SpeedCurve interface, a digital tool that allows users to benchmark their page performance metrics against their competitors

Data Source: https://support.speedcurve.com/en/articles/869994-benchmark-yourself-against-your-competitors

What SafetyCulture Did

Started with an audit
Document findings
Draft with a plan
Senior leadership buy-in
Communicate updates (slack, email, confluence)

Background image of new shoots growing from the trunk of a large tree

Image Source: https://unsplash.com/photos/eLTEuvma_vw

Planning

Background image of a group of printed geographic maps

Image Source: https://unsplash.com/photos/AFB6S2kibuk

What To Tackle

Problems & Opportunities e.g. bundle size, 3rd party scripts, blocking JS
Worst performing metrics that matter
Pages: landing pages, most visited, shared components
Data in performance reports
OKRs: Performance Score %

Background image of a chess board with a game in motion

Image Source: https://unsplash.com/photos/U_Kz2RnfFAk

Driving Change

Background image three hikers standing triumphantly on the rise of a hill at sunrise in a wooded setting

Image Source: https://unsplash.com/photos/xOigCUcFdA8

Team

Bringing change often takes more than one person

Performance Champions
SMEs to guide others
Many steps & changes

Background image of a group of hands coming together in a circle of friendship

Image Source: https://unsplash.com/photos/Zyx1bK9mqmA

Examples of Performance Teams

Etsy

“Our Perf team is two people, but every engineer thinks about performance”

Mike Brittain (Ex-VP Engineering Etsy)-Web Performance Culture and Tools at Etsy

eBay Classifieds Group (Gumtree)

Performance SME and teams formed worldwide

Yoox

Performance Guild optimising websites (e.g. chloe.com)

Background image of a heap of brightly colored Lego pieces

Image Source: https://unsplash.com/photos/kn-UmDZQDjM

Diagrammatic representation of Safety Culture's framework for WCE initiatives where people are encouraged to bring forward initiatives to build engineering excellence.

The diagram is titled: 2022 - World-Class Engineering and has a vision statement that reads:

We are the best at delivering practical solutions to customers that exceed their expectations in terms of quality, speed, and user experience. We take on large goals, raise up other engineers and constantly seek to improve how we work.

Underneath this are five boxes representing their best practices from the 2019 working doc: People, Simplicity, Productivity, Excellence, and Community

Image of a movie Clapper Board

Slide introducing a video insertion into the presentation by Kevin from Safety Culture discussing World-Class Engineering

Image source: https://unsplash.com/photos/ygUPa_SaXKw

Engagement

Background image of a group of hands coming together in a circle. Each hand is holding a different Lego figurine

Image source: https://unsplash.com/photos/1FI2QAYPa-Y

Knowledge Sharing

Workshops, Lunch & Learns, Training, Docs

Image of a group Safety Culture employees sharing information at a group meeting

Team Processes

Reviewing system health

Image of a hand with a pen making a list with a row of checkmark boxes

Image source: https://unsplash.com/photos/RLw-UC03Gwc

Divide & Conquer

Smaller pieces for teams

Image of a pizza pie with one slice segmented from the others

Image source: https://unsplash.com/photos/eSeo6IzOV00

Video presentation by Tom from Safety Culture discussing "roast chicken" i.e. divide and conquer

Image source: https://unsplash.com/photos/ygUPa_SaXKw

Automation

How, Data, Trends

Image of "Nino" robot bartenders fixing cocktails

Image source: https://unsplash.com/photos/GpNOhig3LSU

Writing

Learnings back to the community

Screenshot images of (top to bottom): A Site Performance Report posted by Etsy to their site;
A blog post from Gumtree titled: "Introduction to performance monitoring with Speedcurve";
A blog post from Yoox titled: "Optimizing for web vitals on chloe.com".