Performance versus security, or Why we can’t have nice things

Yoav Weiss at Code 2020

Transcript

(upbeat music) - Thanks, John.

Hi, I'm Yoav Weiss.

I work at Google as part of the Chrome team on many web performance-related things.

I also coach here at the Web Performance Working Group at the W3C, and as part of that, I get a lot of questions that are of the nature of, "Why can't we just do this thing? Why can't we as site owners see how much data third parties are downloading for our users? Why can't we limit the data that third parties can trigger? Why can't we just get information on every paint the user sees? Why can't we avoid CORS for my favourite case, which doesn't really need it?" And the list goes on.

And I keep telling folks that the reasons we cannot do all those things are user security and privacy.

And more often than not, the answer is, "I get it, but..." which shows that they don't really get it.

That is perfectly understandable.

If you don't make a living out of web security, there's a good chance you don't know what are different threat models on the web, and as a result, don't know what browsers are defending against. So, I wanted to talk a bit today about that and outline what bad things bad people can do with our users' data if browsers don't do the right thing.

So I also wanna outline what this talk is not about. I won't be talking about server side security. I won't be talking about defence in depth, so no CSP and SRI.

I'm not talking about third-party tracking so I won't be covering Safari's ITP or Chrome's Privacy Sandbox, and won't talk about fingerprinting.

I think those subjects were covered by Marcus's talk the other day.

What will I be talking about? I wanna cover the broader categories of attacks and attack surfaces that browsers are defending against. I wanna show you some attack factors and provide some examples of cases where browsers weren't super careful and information was leaked.

And then at least for some of those cases, I'll talk about the opt-ins that enable browsers to expose some information, in some cases where we have enough guarantees and enough confidence that exposing that information won't put users at risk.

I'm hoping that this will give you some insights into the constraints under which browsers operate and hopefully help you answer your own, "Why can't we just do this?" question the next time it comes around? So what are the different threat categories we're operating with? We have history leaks, cross-site leaks, and finally speculative execution attacks, which are a dangerous subset of cross-site leaks. The first on the list is history leaks.

What do I mean by history leaks? Well, let's say you're really into kittens, and as someone who's really into kittens, you sometimes go to a site called kittenhub.com, but maybe you don't want other people to know you're into kittens and you definitely don't want every website you visit to know you're into kittens and then start sending you kitten-related advertising.

So the browser needs to make sure that information on which websites you visited is not leaked between different sites, and like everything security related that ends up being harder than it initially looks.

Probably the oldest form of history leaks is the :visited style.

Initially you could set any kind of style to your visited links, which is great from a design perspective, you could make them larger with different fonts, anything CSS can do really, but as we found out that also enables any site to collect your browsing history for any other sites by adding a visited link to that URL and checking its layout dimensions, its computed style or somewhat trickier, the time it takes for it to paint.

As a result, Mozilla back in the day blocked the visited link so that only very few styles that don't impact layout can be applied to it. Made it so that "getComputedStyle" stopped working and ensure that all of its painting operation take a fixed amount of time.

All these security measures took a long while, somewhere between 2002 and 2010 to get in place with somewhat of a cat and mouse game between browsers and attackers.

That means that visited links style is highly limited. It also means that painting it can be slower than it needs to be because of timing attacks on it's painting. In my personal opinion, those restrictions are way too lax and we should have blocked visited links to the same site so that if you're going back and forth between sites, you won't click the same link twice, but you wouldn't be able to tell, for example, if a link you're seeing in the best links of the week article is the link you previously seen on social networks or whatnot.

I suspect that users wouldn't have noticed the difference and we could have had visited links with significantly better styles and our user's history would have been significantly safer.

As it is these links constantly run a risk of leaking information by inadvertently triggering paints and through the various Timing APIs.

As we said, one of the implications of applying different styles at all is that they may take more or less time to compute, enabling timing attacks to sniff out that state. There's also a risk of triggering a paint in one case and not triggering it in another, which can be observable through those Timing APIs, as well as through "requestAnimationFrame". Browsers are for the most part defending against that, but it's adding a tonne of complexity for a use case out of which users are not getting a tonne of value. When designing the various browser performance APIs, and especially the ones that represent the visual metrics, we have to take that into account and make sure that they don't accidentally expose the visited state.

Otherwise, a few years ago, I wrote about the different browser caches, the various checks that each of them run and how they help to improve performance.

I wrote it in a form of request, so the box thing on the left and on the sides, and it's journey to find its resources looking into the different caches and eventually find what it's looking for, and my kid did the drawing which were awesome. I'm a big caching fan because caching is great for performance and the fastest request is the one that's never made. But caching is not always unicorns and rainbows. Caching also has a dark side, caching attacks. Assuming that visited links are blocked, caching attacks are one of the best ways to sniff out user's history.

Under many conditions, it's very easy to use timing attacks in order to know if a certain resource was fetched from the cache or from the servers, even with very coarse timers.

That means that if a site wants to know if you recently visited kittenhub.com, it can load a static resource from that site, see if it's cached and deduce from that if you recently visited that site and by extension, if you're into kittens.

It's an attack that's less deterministic than visited attacks as caches get evicted, and they can also potentially be populated just from running this check.

But determined attackers can inspect the cache state without polluting it, which results in the cache state being used very similarly to visited links which is pretty bad. The only thing that sites can do about that could do about that for many years was to avoid caching their resources if they can be used to reveal such history, if sites considered themselves sensitive, but that's a very tricky definition.

Safari was actually the first browser to recognise that issue and resolve it as part of their overall privacy efforts back in 2013. The way in which they're resolved it was through cache partitioning or double-key caching.

Now, I can hear you thinking, "What does double-key caching mean?" Right now, resources are cached only based on their URLs, which enable reuse across sites, but in a partitioned cache model, resources are cached using both their URLs and the name of the top level site they are fetched from. So the host name of that top-level site.

That prevents caching across sites and prevents the leaks. If a random site can't say which resources were loaded in other contexts, they cannot know if a user visited those other contexts. Cache partitioning also tackled other threats. Privacy threats such as cross-site communication to bypass cookie restrictions and security threats such as cross-site search where evil.com loads victim.com in an iframe with credit parameters and the loaded resources tell evil.com if, for example, a search result, a search query return actual results.

So Safari deployed that years ago but other browsers avoided from going down that path, mostly due to performance concerns of disabling cross-site caching.

But more recently when Chrome engineers looked at the aggregated data, it seemed that the performance benefits of cross-site caching don't necessarily outweigh the user benefits of avoiding it. Some use cases were hit more than others, for example, resources of popular font providers that are used across many sites have seen a larger miss rate and the cache partitioning case, then the general population, but all in all, it seems like a performance hit that browsers are willing to take in order to improve our users' privacy and security. One thing about cache partitioning, it goes beyond just caching.

There are many small caches in the platforms that can be used to maintain state, socket pools, DNS caches, service worker installations, et cetera.

Speaking of service worker installations.

Another recent example of cross-origin state leak is related to one of the APIs in the Web Performance Working Group.

Resource timing has an attribute called worker state, which when a service worker is installed gives you the time it takes for the worker to start up, which is an extremely important metric for sites that adopt service workers.

And when the service worker is not installed, it gives you a value of zero.

For some resources, there is no issue there because all the resources go through the site's service worker, so no cross origin leak, but iframe resources go through their own service worker yet get reported to the parent frame as resource timing entries.

That meant that if evil.com were to embed victim.com that has a service worker, it could see if the user visited victim.com in the past or not, which was bad.

And that was a bug in both spec and implementation that we then fixed.

That covers it for history leaks which are a major threat, but the next major threat is cross-origin leaks. What do I mean by cross-origin leaks? In essence, we're talking about cases where one site you're visiting can deduce information about you from another site. It can be a login state, so sites understand if you're a user of a certain social network, but can also be something that goes beyond that, like which category on that social network site you fit into, which can reveal your gender, age, political affiliation and more.

So it's pretty bad, and obviously browsers should block cross-origin and cross-site links as much as possible.

In order to prevent such leaks, browsers have put in place something called the same-origin policy.

That prevents the browser from simply reading data that it shouldn't be able to from other origins. The mechanism that prevents that information leak while still enabling information sharing between different origins is called CORS or Cross-Origin Resource Sharing. It basically means that resources that are supposed to be able to be read across origins, opt into that.

Once a resource opted into CORS and allowed a certain site to load it, that site can then fully read that resource. The assumption is that developers wouldn't enable CORS on resources that expose the user's state to untrusted origins, and in particular, it's tricky to enable that for resources that carry along cookies, so credentialed resources, by accident.

As same-origin policy and CORS protect us from direct leaks, but there are still side channels.

One of the most common ways to exfiltrate state across sites is to figure out the resource size differences. Let's say we have a shopping site to which you provided some information about yourself at some point, and which is doing some personalization based on that. Maybe their A/B testing show that older folks react better to larger icons, so they are changing their icons accordingly based on cookie state.

That makes perfect sense and there's nothing wrong with that.

But if you were to combine that with a browser vulnerability that tells you the resource size of those icons, then what happens? That would enable any site that can exploit that vulnerability to get that personal and sensitive information about that site's users.

Another example is cross-site search.

If a site has an API that enable a logged in user to send search queries and get responses, the response size can reveal if a certain search query returned results. So if your email provider enables such an API, it could allow a malicious sites to search your inbox for the word "kitten".

So yeah, exposing resource size information is bad. What kind of idiot would have bugs that expose that information? Well, turns out it can be a bit tricky.

Resource timing entries were partially exposing the response's status code by only exposing entries that have status code below 400, which was somehow enabled by spec.

I had it on my to do list for a while to change it and expose all entries regardless of status code, but I also had other things on my to do list and this, like it didn't seem very high priority, so the task lagged, and then this bug came in and was assigned to me. Turns out someone used the video element, which sends range requests without being CORS enabled, as well as a service worker that returned arbitrary responses to those video requests in order to trigger range requests of arbitrary size, to certain resources which weren't necessarily video resources.

Then they started, basically when they sent a request that was below the resource's content size, they got a 206 status code from the server and that created a resource timing entry.

Then when they sent the request that was above range request, above the content size, they got a 404 status code, which did not create an entry, that they then ran a binary search on the various potential sizes for the resource, and were able to fairly quickly find the content size of random arbitrary resources using that trick. That was a good enough motivation for me to fix the issue, not before feeling pretty bad about myself. Another place that exposed content sizes accidentally is the cache API.

When you're working with service workers, you sometimes wanna cache resources that weren't fetched using CORS.

That can happen because you couldn't set CORS headers under resources, you don't wanna expose them, or they're CSS background images or other resources that cannot be CORS enabled. That also means that the response you get back from those resources is an opaque response. You can cache that response in the cache API but cannot inspect them any further.

But turns out that the cache API has a quota, and initial implementations naively took the size of the resource into account when calculating the quota, which then enabled malicious sites to calculate the size of any opaque response, which we already determined is pretty bad.

So as a result of that, caching opaque responses is nowadays extremely expensive because browsers pad them and basically make them to be of arbitrarily large size, regardless of the resources' actual size.

Beyond simple or less simple bugs, we also have features that are blocked on resource size exposure. Many years ago, I worked with Tim Kaldec on something we called Content Performance Policy, which was meant to provide developers with control over their third parties and their users experience regarding those third parties. That proposal evolved and part of it ended up as Transfer Size Policy. In short, it was meant to give developers a way to terminate third parties that are loading excessive amounts of data, and get events when that happens that will enable them to know their third parties are failing and present something to the user, talk to their third parties, et cetera.

Unfortunately, we couldn't find a way to expose that information without also enabling potential size leaks of arbitrary resources on the web.

So the feature was abandoned.

Another feature that fell victim to this limitation is performance.memory API.

Long running web applications tend to accumulate memory leaks, and it's hard for them to detect those leaks in the wild. We were hoping to provide an API that exposes that memory, exposing the memory that a certain renderer consumed so that site owners can know that, but turns out that that memory is directly impacted by resources that the renderer loads, so exposing the renderer memory also exposes content size and can reveal user state as a result.

We're now reviving a similar effort but one that will only be limited to isolated contexts, which we'll talk about soon.

Beyond resource sizes, other information about the resource can also give out details.

We talked about how the HEB status exposure enabled revealing resource size, but it can also reveal other things about the user depending on the site's logic.

And when working on Element Timing and Largest Contentful Paint, recent performance metric additions, we required "Timing-Allow-Origin" opt-in in order to expose cross-origin images, actual rendering time, because the rendering time can vary depending on what image actually was delivered by that cross-origin origin.

And when "Timing-Allow-Origin" opt-in is not present, we only expose the images load time, which is already exposed anyway.

Those various information leaks are the reason we have "Timing-Allow-Origin" as an opt-in to enable individual resources to indicate to the browser that they don't contain any sensitive information, and enable the browser to report more information about them.

So timing of the different points in their loading process, their prints or sizes, et cetera.

"Timing-Allow-Origin" setting that is a significantly lower bar than setting CORS, and while it does expose some resource characteristics, it exposes significantly less than CORS, which enables just RFR reading the files' content. We covered cross-origin leaks, and now when it comes to security threats, I kept the best for last.

Speculative execution attacks.

Remember when we talked about caching attacks, turns out that CPUs also have caches.

Over the years, we've seen CPU level of caching attacks. For example, a few years ago, we've seen attacks that exploded the CPU caches, and which forced us to make various modifications to timing renewal authority, but we've never seen anything as dramatic as Meltdown and Spectre.

These attacks were discovered about 2 1/2 years ago, and completely shattered previous expectations around multi-tenant computing in general, and the web and cross-site implications in particular. I'm not gonna dive into the full details, but basically when modern CPUs see an IF statement, they can speculatively execute both branches in order to save time, and then that can result in huge performance improvements for those CPUs, and once they know which branch was actually taken, which part was supposed to be executed, they drop all the results from the part that wasn't supposed to be executed. But it turns out that one side effect from that execution that was never supposed to happen is that it can keep things around in the CPU cache, and those caching side effects are then observable by programmes running on that CPU. And in a practical level, that means that these attacks enable malicious websites to extract any information that resides in the same renderer process as those one of the malicious websites.

Browsers have a bunch of mitigations to make that harder, but in the end of the day, the main remedy is to make sure that unneeded or sensitive cross-origin information never makes it to find itself in the same process as malicious content to begin with. Chrome was relatively fortunate on that front. When Spectre hit, Chrome already launched Site Isolation on desktop, where Site Isolation means that cross-site iframes on a single page, go to a different renderer process than the parent frame, making it hard to exploit Spectre beyond individual resources.

You cannot load entire sites in an iframe and then read the site's content, so that's good. But while Chrome had that in place for desktop, it wasn't in effect on mobile due to memory constraints on mobile, and other browsers didn't have architectures that supported that at all, but are all now working towards that goal.

But Site Isolation doesn't help us against the loading and reading of sub resources. So Cross-Origin Read Blocking or CORB is a mechanism that block loading of sub resources that contain atypical data mind types.

So if you tried to load non-CORS-enabled JSON as, for example an image, in order to read its content that would get blocked outside of the renderer process. Beyond that, Spectre attacks, read the CPU cache state through timing attacks. So high resolution timers were facilitating them. While the platform has many, many different timers, most of them are flaccid, the most obvious and high precision ones were disabled or coarsed.

performance.now() was coarsed to about one millisecond in most browsers and later on somewhat restored at least in Chrome, where site installation was in effect.

SharedArrayBuffers were disabled anywhere other than Chrome on desktop, and also like are now in the process of being turned back on in isolated context, which we'll talk about soon.

That's because SharedArrayBuffers can effectively be used as highly precise timers by incrementing value in one thread and reading it from another.

How can we turn those features back on? Well, high precision timers can be important for performance measurements. SharedArrayBuffers can be critical to quickly move information between the main thread and workers and to enable cross thread communication in WebAssembly, which is a pretty big deal.

So we want to create new types of secure contexts or as my colleague, Mike West, calls them, "secure contexts" and make sure that we have isolated contexts in which the use of random credentialed cross-origin resources is limited, and therefore one where we can use all those somewhat risky features like SharedArrayBuffers and high precision timers without much concern.

Also, we talked earlier about the memory measurement API. There's another feature we wanna...

Like that's another feature we wanna enable, but restrict only to isolated contexts to make sure that it cannot sniff resource sizes of content it's not supposed to.

JS Self-Profiling is another API that is rather risky to expose in an open cross-origin environment, whereas cross-origin scripts can be loaded, but would be safe to expose in an isolated context because in isolated contexts, we know that no third party script was loaded without knowing that it will be embedded so we can expose details on that script's lifetime and runtime.

We talked about CORS earlier, which is a suicide opt-in that makes a cross-origin resource completely readable. That's a very high bar.

So not something we wanna require in order to define isolated contexts.

So to tackle that, we're extending a Cross-Origin Resource Policy to include a cross-origin value that indicates that our resource is supposed to be loaded in cross-origin contexts, even if it's credentialed, but it doesn't reveal anything about the user. We want to use a Cross-Origin Opener Policy in order to enable websites to commit to never trying to directly manipulate cross-origin pages they open, and enabling that allows browsers to spin up those cross-origin pages to a separate process even in browsers that don't have Site Isolation architectures. And then we also want to use a Cross-Origin-Embedder-Policy to make sure that all the resource that are loaded in the page have CORP enabled and therefore all willingly opt in to be embedded.

We have CORS, TAO, CORP, and COOP and COEP to go along with it.

Those are a lot of acronyms.

How do they all fit together? What are you as web developers supposed to set your resources to? Right now it's not super clearly defined.

For public resources that don't require any credentialed requests, you can set CORS on them, but because CORS is special, that would also impact the way in which they're used, so you'll need to request them with cross-origin attribute and that's not always even possible.

For example, CSS background images cannot be CORS enabled.

CORP can be used in cases where exposing some details about the resource is fine. For example, its size or its content if the attacker is really determined and willing to use complex speculative execution attacks, but the content cannot be CORS enabled due to operational complexity.

Finally, TAO can be used when exposing the timing about the resource is fine and doesn't reveal anything about the user. If all that sounds a bit vague, that's because it kinda is.

We still need to figure that last part out. Right now, none of those opt-ins implies the other, and ideally we'd want to clarify all of them and layer them on top of the other so that, for example, Cross-Origin Resource Sharing also implies CORP, and ideally also somehow imply timing, but that's still yet to be determined.

We want to make it very clear what permissions you're giving the browser with each one of those opt-ins in order for you to have a better sense of what cross-origin access, if at all you want to provide for your resources. In the process, we might move some of the attributes that are currently governed by "Timing-Allow-Origin", for example, the transfer size, to be blocked on CORP checks or something else. So to summarise.

Adding APIs to browsers is trickier than it looks because browsers have a duty to protect the user's information.

And since today, a lot of the user's information lives on the web, we must protect that on websites from other websites. Fundamental changes are coming.

So cache partitioning, isolated context, and opt-in rationalisation, change the way that resources are loaded on the web in ways that you should be aware of.

And finally, we can't have everything.

Features like Transfer Size Policy or knowing how much data users are reloading in third party resources.

Our fundamental problem that we cannot solve without convincing our third parties to opt in (upbeat music begins) to reveal the size of their non-credentialed resources. And with that, thank you.

Performance versus security, or Why we can’t have nice things

You may also be interested in

More presentations from Code 2020