Observability is for User Happiness

Within the observability community, there’s a saying, “nines don’t matter if users aren’t happy,” meaning that 99.999% server uptime is a pointless goal if our customers aren’t having a fast, smooth, productive experience. But how do we know if users are happy? As members of the web performance community, we’ve been thinking about the best ways to answer that question for years.

Now the observability community is asking the same questions, but coming at them from the opposite side of the stack. What can we learn from each other? Emily will talk about how approaching web performance through the lens of observability has changed the way her team thinks about performance instrumentation and optimization. She’ll cover the nuts & bolts of how Honeycomb instrumented its customer-facing web app, and she’ll show how the Honeycomb team is using this data to find and fix some of its trickiest performance issues, optimize customer productivity, and drive the design of new features.

Observability is for User Happiness

Emily Nakashima, Director, Engineering Honeycomb.io

Emily has always been a frontend engineer who loves to hang out with ops… which is possibly a bit unusual. She had a hard time adequately explaining why the roles have so much in common, until a boss commented:

Nines don’t matter if users aren’t happy. – Charity Majors

No matter which part of the stack we’re working on, we all have the same job – to deliver a great result, a great experience, to the user.

So the question is how do we know if the users are happy?

This is where observability comes in. But what does that mean, exactly? The term comes from control theory.

An observable system is one whose internal state can be deeply understood just by observing its outputs.

For the web this is probably more like…

An observable client app is one whose user experience can be deeply understood just by observing its outputs.

Looking at this definition you’ll realise it’s not a job, it’s a system property… and you can’t buy system properties. No matter who tells you otherwise!

These are things like usability, accessibility, performance, observability… You can buy tools that help you get there, but you can’t just swipe your credit card and get them delivered to you.

So this talk is about how you get there.

Some people like a three-pillar model of observability: logs, metrics and traces. Emily doesn’t agree with this – you can buy all these products and still have questions about your systems.

Emily has a much scarier graphic 😉 A range of tools are involved across the traditional concerns of frontend, backend and ops tools.

Emily will focus on events and distributed tracing, as those two parts give you a lot of insight.

Distributed tracing sounds a bit scary but the concepts are reasonably easy. It started with logs.

If there’s one big tip today it’s to move to structured logs. Traditional logs in one-line format require a ton of regex to pull information out; while something with structured key/value pairs is much easier to work with. Also try to add request IDs everywhere so you can link different logs together.

127.0.01 – [12/Oct/2017 17:36:36] “GET / HTTP/1/1” 200 – 

vs

{
  “upstream_address”: “127.0.0.1”,
  “hostname”: “my-awesome-appserver”,
  “date”: “2017-10-21T17:36:36”,
  “request_method”: “GET”,
  “request_path”: “/”,
  “status”: 200
}

Maybe you should also capture the duration of each request as well as the time it occurred.

So now you’ve gone from a single line log, into something that captures an event. Events are the fundamental data unit of observability – they tell us about units of work in our system.

Note that events do not mean DOM events in this talk! Events are often one http request, but it will depend on the work your system is doing.

The next way to add value to this data is to identify parent/child relationships between events. Which naturally leads you to start visualising things, which makes it easier to understand cause and effect; and which parts are running fast or slow.

While most people will have seen something that did this kind of visualisation – but it’s worth digging into why they are useful. It’s also the reason the three pillars don’t work so well – logs and metrics are redundant if you have good traces.

This is why Emily’s diagram has such a large bubble for Distributed Tracing, it can encapsulate so much other data.

You may have been wondering about logging duration – that’s not ‘normal’ for logging; and people generally don’t want to write the code required to do that. The way to simplify this is with a standard and a library – most people are using OpenTelemetry right now.

Most people are using this as a server-side tool, but what about using this in the browser? How might we use this for a complex React app?

We can definitely do this – we can pull out spans for fetching the bundle or fonts, running the bundle, rendering components, etc. It does take some code – there’ll be a link at the end. There isn’t a popular library yet.

It is really satisfying once you’ve got this up and running.

When we create events (spans)

  • On page load
  • On history state change (SPA navigation)
  • On significant user actions
  • On error (also send to error monitoring tools)
  • On page unload

Bringing this information together builds a good picture of the user’s experience.

The exact contents of spans varies according to the tool you’re using; but there’s usually a type, duration and a range of relevant metadata. There can be a lot of depth to this information; and it allows you compare data over time, using whatever cuts make sense.

A typical example would be heat mapping the total page load duration over a long period.

Traces can get really complicated, particularly when there’s a lot of user interaction. They are interesting as one-offs, but you can aggregate the data and see if there are correlations – eg. did slow page loads lead to lower conversions?

This all looks a lot like the information you get out of your browser’s network panel; but a key difference is your browser can only ever show you one person’s data. Nor does it capture as much context.

Fundamentally browser network data is synthetic data not RUM (Real User Monitoring) data from your production environment. Also the network tab’s data is extremely dense; you don’t want Personally Identifying Information; and so on… so your own tracing can cut the data down so you aren’t handling lots of overhead.

There is some overlap with session replay solutions. If they’re working for you, there’s no problem sticking with them.

So what next? Capturing more about the effect of interactions within applications. So far we don’t have this solved, although Conrad Irwin has a great blog post about this.

So what do you actually do with this data? Emily’s company is small, so the focus is on the customer.

Fast queries are really important for their customers – sub-second responses are good. But even so there was feedback that some queries blew out to multiple seconds.

They set a target of 1s, but found it was using polling set to 1s so it could never meet that. So they knew to shorten it; and they’d instrumented the response times and knew the median was about 411ms and many were faster.

So putting it all together, they dropped the polling intervale to 250ms. A 20 line code change instead of launching an entire project to implement an alternative. 19/20 queries were faster from that change.

There is a blog post with more details but there are two things to take away:

  1. It doesn’t matter how fast your backend is if you don’t pass that benefit along to the user.
  2. The story feels almost silly – a little data and a small change had a big benefit – but do we have enough data to find all of these gains? Probably not.

Honeycomb has two versions – one queried directly in the browser, another version queried via an encrypted proxy. Not many people use the secure version but they are a very important minority of users – and it’s slow for just one team… and it was a really important customer. But they could not reproduce the problem.

They started looking at traces and the answer popped out – something was blocking data requests. It turned out the JS to manage the requests was complex; and being single-threaded it was delaying the requests. They batched the logic and improved performance.

How to find the needle in the haystack? Use the appropriate data:

  • For breadth use metrics (a horizontal slice across traffic)
  • For depth use tracing (deep cross section of a single interaction)

Common questions:

  • Privacy – don’t collect every bit of data you can, question if you need every piece, choose the least sensitive options, avoid PII
  • Performance (will this slow my app?) – done well it won’t. Batch requests, use the Beacon API for non-blocking send, use requestIdleCallback or setTimeout for slow calculations
  • Sampling – if you have a really large amount of data, you can work with a representative sample

Observability is not just for performance and bugfixing, it’s great for getting back to the question of “are the users happy?”. Good for UX tracking:

  • Refresh/reload tracking – excessive reloads can indicate something is wrong. They tracked ctrl+r/cmd+r and found things like people hammering the user invite page.
  • Rage clicking – you can guess what this means! Rapid re-clicking on a single element can indicate a high level of frustration. A common trigger – elements that load data but don’t show a spinner.

You can also use observability to drive design.

Honeycomb had a sidebar showing query history; but designers weren’t sure what users wanted there (if anything). They looked at data about the screen to window size ratio; and found users were making that page larger than any other screen in the app.

So they made the sidebar easier to read quickly; but also collapsible so people could tuck it away when they didn’t need it.

Then they went into the data again afterwards, to see if fewer people were increasing the screen size on that page – and there was a small improvement.

Isn’t this just product analytics? Pretty much. As our apps get more complicated, the tooling has to get more powerful.

Emily likes the emergence of the term “Product Engineer” in preference to ‘full stack’ etc. It’s better if we are not all siloed away from each other.

When you look at production data, you too are an observability practitioner. Welcome to the club!

@eanakashima | bit.ly/user-happiness