Observability is for User Happiness

(audience applauding) - Thanks.

Oof, all right, thanks so much.

So this talk is Observability and User Happiness. Like Tim said, my name's Emily Nakashima.

I'm the Director of Engineering at Honeycomb and I've worked on web performance since about 2013 and I've worked on observability since about 2017. So first of all, just a quick question for you all. How many people think of web performance as like one of the reasons that they have their job is web performance? Okay, it's most of you, that's good, that makes sense. Probably could've see that one coming.

How many of you think that observability is part of your job? Okay yeah, a few hands, pretty good, pretty good. How many of you have heard that word, but you're like not really sure if it applies to you? Okay, a few more, yeah, thank you for being honest. And then hands up if you're like, that word is nonsense, you just made that up. What does that even mean? All right, yeah, thank you.

Yes, like, what is this, a telescope conference? Observability, this is nonsense.

This is a fair response, this is what I thought when I first heard about observability, but I'm gonna try and convince you that we are all observability practitioners in addition to being performance practitioners. I'm gonna take just a few minutes to lay out the relationship between web perf and observability, I think it's useful to have it all out there, and then I'm gonna talk a little bit more about the applications, particularly around what our users experience when they use our apps. So does anyone recognize this map? Yes, that's right, it's a map of where developers come from. (audience laughing) Down here, you'll see that this is designers and product managers. We've got Frontend Island up here.

That's where the frontend developers live.

This is Fullstackville, somewhere in between, right? There's Web Perf Island, specialized very close to Frontend Island, but kind of its own thing.

Of course, the all-important Isle of Browser Vendors, small but mighty, and then we've got Backend Engineer Island, very close to the Island of Ops, getting closer every day I hear, and then finally off over here, there is Site Reliability Engineering Island, SRE Island, which is close to Ops Island, but people don't like to, you know, they don't really wanna talk too much about that. And so I found Web Perf somewhere right over here and Observability is coming from way over here and we know that somewhere there's this like secret tunnel inside of Google that connects SRE Island to Browser Vendor Island, but nobody will tell me more about it, and they won't let the rest of us in there, so we're gonna pretend that it doesn't exist. And the thing I noticed about these two communities is that we often talk about the same things, but we speak a slightly different language. So just to kind of put some definition around this, the term observability originally comes from control theory, so we say an observable system is one whose internal state can be deeply understood just by observing its outputs.

Charity Majors is my boss at Honeycomb and she's our CTO and she always has a way of saying things shorter and more memorably than I will and she says, "Can I ask new questions "about my system from the outside "and understand what is happening on the inside "all without shipping any new code?" So that's a really good thing to aspire to right there. Sounds good, right? We all want that, we wanna be able to ask any question about how our code is behaving, how our users are experiencing our site, and get an answer. And maybe some of us feel like we already have that. We have telemetry, we have metrics, we have Real User Monitoring, RUM.

We have something that looks a lot like that, but maybe we talk about that a little differently. So a really important thing I'm gonna do is talk to you about birds, yeah, to explain the relationship between observability and web perf, I wanna tell you a story about birds. On the left, you will see a red-tailed hawk, one of the most common birds of prey in North America. On the right, that is a peregrine falcon.

This is a common falcon found on nearly every continent, including here in Europe. They can often look very similar, so if you just take a second, try to list the things in your head that you can see in common.

Okay, like I can think of five right away, right? They both have dark brown feathers on their wings, they have the short, sharp beak, they've got a brown head, they've got like white feathers in the front with like these dark bars or spots, and they're both relatively large birds, like they're both about the size of the new MacBook Pro if you turn it on its side, that's like, yeah. The range of these birds overlaps in North America, so it's quite common to confuse them, but they're different that the hawk is part of the family Accipitridae, which includes eagles, and the falcon is part of the family Falconidae, and they both used to be part of the order Falconiformes, which is falcon shapes, thank you, science. But then there was drama in the bird community like five or 10 years ago.

They started doing this genetic research and looked at bird genomes and what they saw was that the genome suggested that actually falcons are more closely related to parrots than they are to hawks, so someone I follow on Twitter, and I looked forever for this tweet and couldn't find it, but called the falcons evolutionarily convergent murder parrots because they were like, you know, they're, it's like just a parrot that's really good at killing things, which I found very charming.

So turns out these birds look the same, even though they're not closely related because they evolved to do the same thing, which is hunt for small animals and birds from the sky. When you get really good at eating mice outdoors, apparently this is just what you look like. And I think that's just like web perf and observability. Right, so we've got our web practitioners, our observability practitioners, and like one thing that we notice when we look back at this definition is that it's not a product, it's not a job, it's a system property just like performance. When I say a system property, remember when your customer's browser is running like two megabytes of JavaScript that you wrote, the browser is part of your distributed system. So when I say systems, I'm not just talking about back-end systems. Like my customer's browser is running my software just like AWS is, right? Charity thinks a lot about systems engineering and she's always saying like, "Nines don't matter "if users aren't happy," meaning having five nines of uptime for your servers is totally meaningless if your customers aren't having a good experience, and this is, of course, like the exact same arc that we've been on in the performance community over the past five or 10 years.

Customers don't care if your onload handler fires after one second if there's nothing on the page that they wanna look at, right? And as load time metrics have become less and less relevant to user experience, we've adapted and work harder and harder to find things that really do correlate to a good user experience, like first meaningful paint, long tasks, you can see these APIs getting much more targeted around things users really care about. And it turns out that we built communities around caring about these specific system properties, but there's just this huge amount of overlap, and it turns out that we actually like, just care about, we have a lot of the same problems.

We, oh yeah, lots of emotional energy going into the standards process, not sure how to balance my real job with what they think my job is.

(audience laughing) I think these numbers look a little wrong, I've heard that one before.

This is my team every day.

And, of course, just the thing that ties us all together is this care about UX, right? So, the thing that if you take one thing away from this talk, I think that we should all join forces and go to conferences together and I wanna leave you with this photo of 80 birds of prey on an airplane together. It's from a news story, it's a long, long story. Anyway, this talk, first we're gonna talk about birds for five minutes, done, you made it, no more. We're gonna talk about data models after that. I think that observability is kind of tied to a specific set of data models.

I think that's relevant to performance work that we do, and then I'd like to talk about the tools that we use to make sure we're focused on user happiness. I think that would be SLOs in the observability community, and then performance budgets for web perf folks, and finally I'm gonna talk about how we use this data to drive performance optimizations and a little bit about observability for UX design. So kind of looking forward to see what you can do with this data.

If I talk about anything that seems useful or interesting, all of the links from this talk and the slides are available at this bit link, so if you wanna go there, you can get all the bird photos and there's also just links to lots of articles and that kind of thing. And I really love the idea that like some ornithologist somewhere is gonna try and get that bit link and then be like so mad that someone has it for computer things, so please click on it, make me feel like it was worthwhile. All right, data models.

This is an area where the web perf and observability community, I think, have slightly different strengths. The web perf community is super sophisticated about data, like we're used to using all different kinds of data. We distinguish between RUM metrics and synthetic data, and we have really mature tooling.

There's great vendor tools out there that you can pay for and there's just also lots of great open source stuff like WebPagetest, Boomerang, Lighthouse, like a lot of Lighthouse is open source and we've spent all these years and years getting performance best practices built into our tools and frameworks that we use every day. So, Steve Souders has always been giving those talks about web performance best practices and it wasn't that like the average web developer got so much better at doing all the things in Steve's list. It's that people who worked on Bootstrap, Ruby on Rails, Webpack, all took the things that we were recommending in this community and built them into those tools, which I think is really cool.

Likewise, a lot of people in this community have been doing great work for years and years to get better specs for performance measurement in the browser, which I am super grateful for 'cause I use that stuff every day.

And working to get those APIs implemented in the browser or across the different browsers is really, it's a big effort.

The observability community's a lot younger. There's not that same maturity around tooling. Part of the goal of specification efforts like OpenTracing, OpenCensus, OpenTelemetry is to start to standardize the way that we collect some of this observability data so that we can start that same process that the perf community has already been involved in.

Working with framework makers to get better instrumentation hooks built into the code of those frameworks, so that we can all get better data out of them. But of course, we do both hugely care about data quality and data type and I think that's one thing that Observability really has right.

Events are the fundamental data unit of observability. And when I say events, I don't mean DOM events. For the purposes of this talk, I'm really sorry, that word has like nine different meanings. Just forget that you've ever heard of DOM events. Monday, they'll still be there.

So an event is a set of keys and values that describe a unit of work for your system. Oftentimes that work is an HTTP request in a browser app. It could be like a user interaction, so it could be like clicking on a button and then all of the event handling that happens after that. It really depends on what's important to your system. So you can kind of contrast that with logs. Like, you can log anytime.

You can console log from any point in your application. It doesn't have to represent a particular thing. You can log when you get a request.

You can log when you're idle.

You can log because you encountered a weird condition in your code.

You might even log twice for a slow operation, like once at the start and once at the end. In the event world, you don't get to do that. You have to be very disciplined about writing once per unit of work.

If an error happens, you encounter a weird condition, that's just metadata that you attach to the event rather than logging again.

And then you have to capture the duration of the event, so if the operation is slow, how long it took. They're gonna look something like this and you're like wait, that's just JSON, like don't try to tell me this is something new. Yes, you can represent events really well with JSON and any metadata that's ever been relevant to a bug or incident at your company should probably be included, so this event has fields related to the HTTP request. There's fields about our environment, so that would be like the build ID, what environment are we in? Are we in production? And then there's a few fields that have a user context like the user email, user ID. These are fields that are smaller number, these are like high cardinality fields that are really important to helping us understand who was experiencing what.

I'm gonna use that term high cardinality a few more times, so I'm just gonna clarify what I mean by that. A low cardinality field is like HTTP status, it's a field that has few unique values, so your app might only use like a half-dozen different status codes.

You can contrast that with a high cardinality field, which is gonna have many unique values, so some of the most common examples in the apps that a lot of us work with might be things like email address, IP address, user agent.

Feature flags are a really interesting one because feature flags are, of course, not high cardinality by themselves, they're like a boolean value, but oftentimes you wanna look at the metrics of all the different feature flags combined together and if you wanna try and like tag your metrics with that, what you end up with is something that's pretty high cardinality.

The more specific these are to your domain or application, the better, so for us, we tag all of our events with the like plan type that the customer is on. We wanna know if they're on the free plan, or like they're doing a super fancy like paid proof-of-concept.

We care about all of our users, but we might deal with the requests differently, depending on who it's coming from.

So this is the data that's gonna let us see who is having a good experience, who's having a bad experience and often guess at why. There's a lot of people in the observability community that like to talk about the three pillars of observability, logs, metrics, and traces.

And actually, Cindy shared her on, if you follow her on Twitter, she's copy construct. She's one of the best writers about observability and she has a great content about the three pillars in her book it's linked to.

But I wanna stress, like observability isn't about tools, it's not a list of products you can buy, so you don't have to use all these things to be an observability practitioner, right? The same goes for performance tools.

Like there's so many great tools out there and if you watch this talk and you're like I can do all that. Like great, you can just check this box and be like yes, I am doing observability for user happiness, so I don't wanna convince you to change your tools or your instrumentation.

The goal is to get to this place, where you can answer questions about your users experience and not where you'd buy a specific set of tools, so you don't need all this if you don't want it. I am a big believer personally in the power of that event data model for everything. You could derive all those pillars from events, if you want to.

So long lines, obviously, you can just take that JSON object and it's a log.

I should say I'm a big fan of structured logs, so let Kon Marie tell you the top log there is a structured log and you can see that it's a set of keys and values.

The bottom is a log that represents the same thing. That's an unstructured log and the difference between those two is that you can pipe that top one into all kinds of tools to do great analysis. That bottom one, you can grep that with a regex, and we know that regex makes people cry, so you can decide which one is better for you. Anyway, so events can be logs, if you're disciplined about how you structure your logs, you're gonna add quantitative data to these events, so you can derive metrics from those events by looking at the contents of those fields. And then, if you think about traces, about like distributed tracing tools, they're actually just a set of events that have metadata that tells you how to arrange them in like parent-child relationships, right? So we call those events spans when they appear in a trace, but they're the same thing.

And, of course, you hear that and you go wait. Actually, if traces are just events with special fields that describe their relationship to other events, like maybe we should just use traces? And yes, yes, you should.

That's actually what we do for our browser app. We started with just an event data model and we switch to tracing and it's been great. We have a React App with a go back and we wrote our own custom tracing code because it was before some of these things were ready, but I think that if you're getting started now, I would leverage one of these open API standards. If you haven't heard of open telemetry yet, this is the the merger of open census and open tracing. They kind of looked at those two open standards and got the best of them and put them together. And there are vendor options in the space that are good too. We did have to do the instrumentation ourselves. That is we had to look at our React App and figure out what to measure and how to measure it.

There are open source projects to do some browser tracing instrumentation on GitHub. There is, unfortunately, not yet a super popular standardized library that gets great traces out of your browser app without a lot of work that's like available to everyone, so there's no like Boomerang for browser phases yet, but one day, I'd like to create one.

If you wanna see example code, there's some links in there. Someone named Ryan Lynch also put together a really nice demo app that is runnable, if you wanna try and like play around with this yourself. In this super simple case, one of our browser faces might look like this. Each of these rectangles is a span or an event. The top span is the page load measured from the navigation start to the load event. And then, you can see that this page made a request for data, that's that next line. The user clicked a button, and then that button actually navigated them away from the page.

That last thing is the unload event.

Super simple.

If we think about like how to make generalized tracing instrumentation, what we're doing is capturing a span on page load, so that's always our boot span and we're using the navigation timing to figure out how to draw that.

Whenever we do a single page app navigation, like a lot of folks use React now, and so that's actually really significant user interaction, so we make sure we capture that.

Any change to the URL with a history API, that's a span. We wanna capture meaningful user interactions like clicks. We wanna take note of errors.

We also use a dedicated error monitoring tool, but seeing these in the context of everything else that's going on the page is really useful, and then we capture the page unload, including how long it took 'cause we wanna know if there's slow unload handlers or anything like that.

And then, the trace obviously, shows this user's whole experience for this particular page view. But, of course, remember that there's a lot of metadata on each of those events, so you're pulling in things like... Obviously, everything has to have a type and the duration. The names of these fields are slightly different depending on which open standard you're using, but they all have kind of had the same idea. So in the page load case, I'm gonna add things about that actual page, the navigation timing data, the user agent, the device type, the window dimensions and then like put that button click span.

I'm gonna add things about that user interaction. So it could be like which button they clicked on the page, where, what it was supposed to do, and then how long it took from clicking the button to like handling the event to like rendering something onto the screen. And, of course, all these tracing tools will also let you look at the page load event and graph across multiple events.

So we can take that same event and go what was the average of page load time for the whole site? So this is like a whole year of page load times. You can see with that vertical stripe pattern that people really don't use their product very much in the weekends and that most of our page loads are under three seconds, but there's kind of this long tail of slow requests. These can also get really complicated, so we have pages with tons and tons of user interaction. So this is a series of complex interactions with the product kind of shown in the trace.

These are interesting as one-offs, but they're actually really good for answering like larger behavioral questions that we all wanna know that we've been talking about for years.

So you might say let's look at my users who had like the fastest interactions with this forum and the users who had the slowest interactions with this forum, like how many converted on this page view? How many bounced? Is there a correlation between having that slow interaction and bouncing? I know that we've all been trying to do that different ways for a long time and this model supports that really well.

So I wanna just clarify that there's another convergent evolution thing here, like these do look a lot like what you see in the network panel of the browser developer tools, but they're actually pretty different.

First of all, the network panel is just focused on network data.

It's also just one person's data, usually your own data, and it's focused on just requests with a few key events, page load, domready thrown in there. Distributed tracing, it's gonna be RUM data, right? So for capturing real user metrics about their experience in production, we don't wanna collect up quite so much data. We have to send it over the network and if anyone has ever downloaded like a har file, HAR file from their developer tools, they know how big those get. They're often many, many megabytes.

And we, obviously, don't wanna send a bunch of PII or user content, either.

We wanna use these for debugging, so we want metadata rather than data.

These won't always be enough to figure out why your user is having a good or bad experience, but you generally get a lot of detail about what's going on in their individual page views. There is some overlap here with like session replay products that are available now from some vendors.

And I would also say if those are working for you and you feel like you can do this kind of stuff, great, like I said, this isn't a talk trying to convince you to pick a specific tool. It's about being able to answer questions.

As a team, the next thing we wanna add is being able to do instrumentation around render metrics on our React App.

We've tried a couple of different ways to do this. We try to capture both the React reconciliation time, and then also how long it takes to flush those changes to the Dom and recalculate styles and layout. And that async nature of how React handles that is really great for user experience, but it also makes it really hard to figure out which code was connected to which layout.

So Conrad Irwin of Superhuman recently wrote up this really nice post kind of digging into all the complexity there.

And if you're interested in that, I would definitely recommend checking that out. One of the best things about these two communities is that we have come up with really disciplined ways to hold ourselves to these numbers that we think users care about.

So this is another fun example of where these communities have two slightly different approaches to the same thing. In the web performance community, we've done such a nice job with performance budgets. There's a lot of good tools out there for these. Like I said, Lighthouse has built-in support. And if you use Webpack, you can use the performance hints feature and actually build compliance into your JavaScript build, so your build can actually fail if you go over budget and you can go like wait, wait, we can't deploy to production, we have to go figure out why I just committed another copy of Lodash by mistake, right? So that's the docs for that Webpack feature. We use a performance budget ourselves at Honeycomb. To do ours, we actually take data from the traces that I showed you.

So one of the pieces of data we collect on page load is the size of the CSS and JavaScript on the page. We use the resource timing API to calculate that and then we beacon it back with the trace, and then we can do this top graph.

That's the CSS size uncompressed.

The bottom graph is the JavaScript size and then whenever those two values go over a set number, which I've kind of drawn on there with the dotted line, I get an email alert and then I know to go like schedule time in our project management tool to make sure we work on performance.

I like the system 'cause it's really simple. It took maybe an hour or two to set up the whole thing and get it deployed to production.

The observability community uses something different. They use SLO's or service level objectives for this purpose and they are a different beast entirely.

I think the best way to get a handle on exactly what these are is to read all three relevant chapters in this book. So, there's a lot to digest to put in your head. There's also a really good video that I link to that will kind of sum up how to go through this process yourself, but the short version is you sit down with all of your business stakeholders and you have a conversation about all of the facets of your system behavior that your customers care about.

So, you're gonna talk about latency, you're gonna talk about errors, you're gonna talk about everything.

You ask your stakeholders and your engineers what are our service level indicators? So what do we measure to understand how we're doing? What are the numbers that our users really care about? Maybe it's load time, maybe it's errors.

And then you say okay, we're gonna measure that. And then you say like what are our service level agreements? Is there anything that we've promised to our customers? This is usually a really unimpressive number because it's written into a contract somewhere and you have to give back money if you miss it. So maybe we said we really care about the response time of buyback requests and we've promised our customers it'll be under 10 seconds 99% of the time.

But we know that that number doesn't actually sound very good, like that's actually a pretty bad experience. So we're also gonna come up with this service level objective, which is the number that we think is actually gonna keep our users happy, like the target that we wanna have.

So we maybe say like response time should be under one second 99.9% of the time. An interesting like cultural difference between performance budgets and SLO's.

For performance budgets, my sense is that people just always wanna be under their budget.

For SLO's, we wanna be under the number, but we don't wanna shoot for being as far under the number as possible.

If we're too far under, it probably means we're spending time on performance and reliability that our business would really rather us spend on new features or tech debt or something else. A really cool detail that I really liked from the SRU book is that Google has actually done planned downtime for systems that are beating their SSO's by too much. They don't want internal engineers to start to assume that those systems are more reliable than they're designed to be, so they've actually like turn some systems off a little bit of advance notice to make sure that the other systems that rely on them have built-in handling from when that service is down. That was just like mind-blowing to me.

And, of course, users of SLO's typically expect to get alerts when they're close to exhausting their budget. These are called burn alerts.

So we wanna get an alert when we have like three days of budget remaining, or when we've started going through our budget at a faster rate, so we can schedule remediation. And then, of course, we wanna get like a higher priority, like pages when we're about to run out.

So if we have a hour of budget remaining, I wanna get paged, so that I can take emergency measures. And you can see where this is going.

Like there's a lot of moving parts here.

So another hawks versus falcon scenario.

The superpower of performance budgets is we've tried really hard to make them easy. People like yourselves have built really great tools for them, it's easy to get started, they're easy to understand.

The trouble that I've actually honestly had with performance budgets is that they're so easy to set up that like you often don't take the time to really get that good business buy-in.

So a lot of times, you're like oh, we're over our performance budget and you go like try to talk to your product manager and they're like but you you've scheduled this feature, like go work on that.

SLOs, like part of the hard part of getting started is having those conversations, but it means that you have that business buy-in up front. They're also extremely flexible, so you really can just set them up to measure anything that your company or your users care about. For us, we might measure the query response time, but we also might measure the durability of query results, or the correctness of query results, like we really can do anything.

And, of course, that all means that building support for SLOs within your tooling is really hard.

There's no tool that I can tell you to go download right now that will make it really easy for you to track these. We have a blog post in the links that is about what it took for us to do this for ourselves internally, and that was really a process that took months, honestly like months to get set up and get started, but it's substantial.

So I hope that as the observability community matures, we can take all those lessons from performance budgets and apply them to SLOs and help people get started more easily and have just a better experience of being able to set them up.

So, of course, none of this data matters if it's not actionable, so I wanna talk about how we use this data to do performance optimization.

My company is small.

We have eight software engineers.

Our web application traffic is pretty low, so we're not gonna do the kinds of like niche optimizations that like a top 100 site might.

And we care about the experience of basically every individual customer on the site. So I wanna talk about one case where good observability was super useful to us and then one case where it didn't really help when we kind of went back to our older set of tools. So we have a regular product, where customers send us data from their systems and then they query it and then we show it to them in the UI.

So that's the customer system on the left.

They send us data to our API and then their browser, that's that little square at the top that is gonna request it from our UI.

Very simple, great.

There's a more secure version of the product that some people use so customers use an encryption proxy. That's those boxes in the middle, there's two of them just for redundancy 'cause we have lots of ops people in our team and they want both of them on the diagram. The proxy encrypts all their data, so what's going out to the vendor to us is only encrypted data, we can't see it.

And the interesting thing about this product or this architecture is how users browsers interact with it.

The laptop on the right is actually now gonna make two different requests, it's gonna make one request to our UI server to get that encrypted data, and then it's gonna also make a request to the proxy that's running inside their own VPN to be able to decrypt that data and they'll actually be combined together, like the UI will be combined in the browser with the decrypted content.

So a little bit of an interesting architecture. It's not super common out there.

And we had the problem where this experience was slow for just one team just on the secure architecture. And it was the kind of thing that we couldn't reproduce ourselves, but they were like no, it's really bad, like you really need to fix it.

So we wanted to solve it for them.

We looked at various loading metrics and other performance metrics that we track and we couldn't see any difference in request time, so this is like top 50 customers broken out. I think one of the pink lines is this customer, but the point is that you can't really tell it's doesn't really matter.

This is a log scale, so things are a little more closely clumped on the y-axis, but there really wasn't too much of a difference at either scale.

So we started looking at their traces and doing that like what's going on just jumps out at you right away.

There's literally only two customer interactions on here. There's like one click to do a request for data and then another click and you can just see that there's like a request after request after a request coming out.

What's happening is you actually see that there's that stair set pattern two requests at a time. Because they're on this secure architecture, they're making more requests, like they're making double the number request as an average customer and then they were on the high end of total amount of data, so they were just trying to process a lot of data that way. So we zoomed out on this page, like counted the number of requests for domain and realized that we're actually hitting the browser limit of concurrent requests, so six requests at a time.

We could have done domain charting, but it turned out that there was an easier solution, which was just to be smarter about like batching these requests together. And that was enough to bring this team's performance down to being about as fast as everyone else's performance. I should say also you could solve this problem with a little HTTP, too, that would also help. But this was the kind of case where this might have been hard to spot with their old tools and all of a sudden, it was super easy and it made a huge difference for this customer. Embarrassingly enough, we also ran into that exact same problem with our instrumentation code. It can send a lot of data.

We actually saw it slow down performance when we first rolled it out.

We started sending those requests spans and the lesson was the same.

Batch this code together or batch the request together, so we now send one request for every 20 or so spans. And, of course, we use the beacon API to make sure that they get sent on page download.

And finally, we do sometimes do like non-trivial operations to try and calculate some of this like data that we wanna collect, so just make sure that that's wrapped in a request idle call back or set timeout, just so that you're not getting in the way of handling user interactions.

Here's an example of something where tracing didn't help. We had this nice big data table in our UI that shows customer provided data.

So it's a view of all their event data.

It can be arbitrarily wide because customers can send us events with as many fields as possible.

So if they send us 2,000 different field types, we might have 2,000 columns on this table.

And we had some customers report that it was really slow. People would go load this table, they'd like click that tab and then they would just stare at this spinner for seconds and then they would be like I think something's going wrong, so they would try and click something else and then the browser would be like no, like you're just gonna sit here.

It would become unresponsive.

We love new performance API, so we were like yes, we have this great thought. We're gonna instrument with the long task API, which is relatively new.

This is a great idea.

We are still tweaking this instrumentation a little bit and we do wanna do that, it's gonna be really interesting. But we thought about it a little more and there's two things that kind of don't line up here. One, our instrumentation really thrives on context, and we can't actually capture a lot of context with this API, you kind of know where that long task is running within the browser, but that's it. We could count up the total duration of long tasks and see if we had more or fewer of them and we could see if they always occur after the same like button click or something like that. This is a case where we didn't actually need our tooling to show us where things were going wrong.

We had full reports, we could reproduce the issue, so this is really cool, we're still gonna set it up. We don't need it in this case.

We just went back to the handy-dandy browser dev tools that we know and love.

So here's the Chrome performance tools.

I bet you can spot what's going wrong there. So we spent 10 seconds processing data and updating state and React many more times than we needed to.

That's that yellow part.

And then we're spending five seconds and then three seconds and then one second recalculating all the styles and doing layout, so that's that purple section. And it turns out that the TLDR is that if you let your browser lay out 50,000 Dom elements of unknown height and width and you want that to all be in a flexible layout, that's gonna be really slow, so don't do that. We gotta go back and virtualize this table and use fixed heights and widths.

But the point is just observability is really good for us to be able to ask any question about our system, but sometimes we don't actually need it, and sometimes our existing tools are better for figuring out how to actually fix a problem. Like we can know that something is slow, but we have lots of tools that are in our toolbox to figure out how to actually fix it.

Finally, I wanna talk a little bit about using this kind of stuff to look forward and figure out how to do the right things for the user as you kind of approach design and approach thinking about what your app should look like, what it should feel like.

We all add instrumentation because we hope it will help us catch problems, but certainly, it's not only useful for debugging. So I'm wanna tell you the story about how we did this. We've had this sidebar on the right side of the page for a long time. It shows a historical list of each query that you've ever run in the product, and we found that users didn't always know what it meant, or really wanna use it.

They were just kind of like what is that thing over there? Why does it have so many lines in it? So we decided to think about redesigning it and our designer was like what do we wanna change about it? Would users maybe rather just like have that whole thing go away and have more space to see whatever the main visualization is in the middle of the page? And the cool thing about our tooling is that even though we hadn't thought about this question ahead of time, we could answer it really quickly within a couple of minutes just 'cause we've had the data there already. So one piece of data we were already capturing was our users' browser window heights and widths and their screen heights and widths, so the height and width of the window, but then also the screen around it.

So we heat map those and then we took that same data and we used it to make one more calculation, which is what is the percentage of the screen that that window takes up? So you can see that like the kind of densest clustering is right about at that 90% line.

And, in fact, the median value was a little over 85%. Kind of looking at what was on screen, I noticed the most cases browser and operating system Chrome tend to take up like five to 8% of the screen and the desktop of roughly the laptop size, or laptop. So something like 94% is the value that you're gonna get when the user has expanded their window all the way open and they're not using full-screen mode.

So this tells us that most users are making their windows as large as possible to view these graphs.

And there's a point of nuance here that I don't wanna get missed, which is that all this metadata that we collect on our events is there so that we can arbitrarily slice across these. So Google Analytics will also tell you like what screen sizes do people use to view your site, which is a great fact to know, but it's not really what you wanna know in this case. You wanna understand what window sizes users are using on average and then you wanna understand how it relates to the size on the specific page you're redesigning.

So what we found out was that once we broke it down by page type, here it comes through really clearly.

The window size for this page is the highest average window size for any page on our entire site by about 30,000 pixels.

So that suggests to me that users are like making sure that their browser window was extra large just to view this page.

So knowing that, that helps us understand how to think about the sidebar problem.

We decided to make the thumbnail graph slightly wider 'cause we were like people wanna see a bigger graph, so they probably want that everywhere.

And then, of course, we also chose to make it collapsible to get out of your way.

So if you really wanna see that graph bigger, you can just go like collapse and then here, a giant graph. And I think that this is a nice design intervention that was driven by real user data because it's so quick and it made sense once all the pieces came together.

I don't wanna talk you out of doing user research or user testing.

Those are still the gold standards and actually after we did this, we took the time to talk to some users and make sure that what we were hearing from them matched what the data was suggesting.

But, it's really nice to be able to use data to validate your hypotheses.

So to close the loop, we look back and asked ourselves did this decrease the percentage of users who made their windows larger on this page? It did a little bit.

Not as much as bringing it down to the rest of all the other pages, so we know there's a little more work to do, but it did seem to kind of help a little bit. So finally, I think a lot of people hear this and they're like yeah, this is kind of weird, like you're doing this thing that looks like something else I've seen before. And then this is like the natural question that people start to think about.

You have this like performance data that has all of this user context.

Are you not just doing product analytics? And I think the answer to that is like yes, maybe we are. As we've shifted more and more towards user perceived performance metrics, we all have to understand our user experience with increasing sophistication.

Yeah, I think there was a time when developers could be pretty far removed from the user experience and still do a pretty good job and I think that's going away, especially as our apps get more sophisticated and more interactive. I've noticed that more teams are starting to use the term product engineer instead of full stack engineer or front-end engineer for people who build the actual decor products that customers use. And I really like that it acknowledges that we are making engineering decisions that really deeply affect how users experience our products. This is a theme that we've heard a few times today. I think it's really true.

We really are so much closer to the user than we used to be.

And so, I welcome our gleeful future.

We're not siloed off into DevOps people and performance people and front-end engineers and back-end engineers.

We're all product engineers that have these different specialties that we bring together and together we're all working on building this great user experience.

And, of course, the slide that you've waited 35 minutes for. If you do wanna be able to tell these two birds apart, yes, the red-tailed hawk has these very round, broad wings. You often see the feathers kind of splayed out at the end, almost like fingertips.

They have a shorter, wider tail.

Sometimes it looks red, it doesn't always... And then if you look at the peregrine falcon, it has wings that look much pointier.

They come to a point at the end, but they also have that little kind of bump shape. And then the tail is longer and narrower, and that is because they are designed to dive out of the sky at high speed to catch small animals.

So look for those pointy aerodynamic shapes and then you can be pretty sure you've identified a falcon instead of a hawk. And that is the second most important thing I want you to take home from this talk.

The first most important thing is that we're all working on user experience now. Thank you very much.

(audience applauding) - Thanks, Emily.

Do you have time for a few questions? First off, are we the hawks or the falcons? - (laughs) I spend a long time thinking about that actually. And then I was like this is gonna be really polarizing, like I can't-- - Maybe. - I really like the phrase murder parrot.

- Me too, that was pretty cool.

- I think that that has to go to the observability community 'cause like they're newer and it's just like out of nowhere, surprise, murder parrot! - All right, that's fair.

All right, we'll allow it, I guess.

- Well, number all observability practitioners. - We're all murder parrots? - Yes. - Yeah.

(laughs) More statements I never expected to hear her say. Okay, so your team is how many people did you say was on your team? - We have eight engineers and one designer. - Okay.

So that's engineers from the entire product and stuff like that.

The data analysis, right? 'Cause when you're talking about this kind of stuff, collecting that data but then you really need to be able to analyze it and take the time to understand it and to make sure that it's telling the story and how to read it and decipher it.

Is that also done by your team directly? - That is.

I think that is something that you don't wanna talk too much about like the product that you work on, and so there's a component of what we build that we also sell to customers we use for that, but I also think there are great open source options in the space.

- So for organizations where that isn't what they're doing, like do you recommend? It almost feels like now for people to have. I mean, detailed information is great.

I think we've had an explosion in the performance side of like deeper understanding and collection of metrics over the last few years in particular, but it does feel increasingly like people need like a dedicated team or a dedicated person that can go through and do that data analysis. The companies that are using it, is that something that you're seeing? Is that something that you're recommending folks are doing on this side of things? - I do see the companies that are kind of at the 200 people and up size breaking out dedicated teams for this. That said, I actually like it a lot more when those people are embedded with the teams that are building the product, like people who have that expertise are just the special person on that team.

I know this is a debate with SREs, too.

I kind of feel like it should be the same thing. Rather than having an SRE team that's siloed away, if you have someone on the team who can share that expertise with everyone else. 'Cause ideally you want everyone to have it. - Sure.

- You want everyone to be empowered to answer their own questions.

- In your experience, how important is it to have a tight gap between hey, we see an issue in our SLO, or we see an issue popping up and the team being able to understand this is how you fix it? Like are there things or recommendations that you have for folks to like help make sure that at their own company is getting from crap, something's wrong, to oh, here's why to close that gap as much as possible? - Yeah, I mean, I think that the big takeaway from me seeing the SLO approach in action was that that business buy-in means that you get pressure from the business to solve those things quickly, but you also get support.

So a team that is like missing its SLOs all the time is having a hard time debugging.

I think it's easier for them to go make the case to upgrade their tools, or to potentially bring on more people, if they just need more engineers to work on the problem. And so, I don't know that I've seen a specific like technological or process solution, but at least having the business see what's going on and care and have that feedback loop really helps. - And then this was more, I guess, a specific thing to what you were showing, but you were showing the traces for the request, right? It just said request.

I was curious are you collecting more detail about what each request is? Or is that not really the purpose of that trace? Have you found that it's like a data overload kind of thing? - I had a hard time coming up with like a view to show both the kind of the trace view and then everything that is attached, but all those spans, there's as much metadata as you can imagine on them, so the request is like, there's like the actual URL one, two.

There's like a generalized URL shape.

There's all kinds of stuff that you would derive from headers and that kind of thing. We actually do combine this together with some back-end data, so there's-- - So it's all there-- - Yeah, there's a lot there. - Okay, awesome.

All right, well, thank you very much, Emily. That was fantastic. - Thank you so much.

(audience applauding)