What could you do with a neural network in your browser?

Ningxin Hu at Code '21

Transcript
Slides

Hello everyone.

I'm Ningxin Hu, a Software Engineer at Intel.

I'm participating the WC3 Machine Learning for the Web community group and Working Group.

There we are working on a new Web Standard, so-called Web Neural Network API that can help web applications and frameworks to access a purpose-built machine learning hardware.

Agenda wise, I'll first to introduce the observations we have so far.

What's the problem?

There is a performance gap between the web and native for AI applications.

Then, I'll talk about the Web Neural Network API proposal itself, and how it can fill the gap by introducing the new domain specific abstraction.

We also implemented the webnn-polyfill and as a webnn-native, that allows you to start experimenting web and API today.

And I like to share the results of the running code and give some interesting demos.

As you know, in the last decade, the machine learning in particular, the deep learning has been getting increasingly important and widely applied in many applications like computer vision, natural language processing, and the speech recognition.

Nowadays thanks to the merging, JavaScript to machine learning frameworks such as iTensorFlow.js, ONNXRuntime Web, MediaPipe and OpenCV.js now the web applications can easily incorporate this innovative usage by running the machine learning models in web browser.

Underlying those frameworks usually leverage WebAssembly, WebGL, and the even a new WebGPU to run the machine learning computation on CPU and the GPU respectively.

On the other hand, to meet the exponential increasement of the computing demand for machine learning workload, the innovation of hardware architecture is advancing very fast.

The Machine learning extensions have been added into the general purpose devices, such as the CPU and the GPU . A bunch of new dedicated machine learning accelerators is emerging, such as NPU (Neural Engine), VPU (Vision Processing Unit) and TPU (Tensor Processing Unit) just to name a few here.

These dedicated accelerators not only help optimize the performance, but also help reduce the power consumption for machine learning tasks.

By taking advantage of this new hardware features, the native platforms got a very good performance and were successfully deployed in real production.

So how about the Web?

We tested MobileNetV2, a widely used image classification neural network for client usage as the workload and measured the inference throughput with batch size 1.

For hardware devices, we chose a laptop and a smart phone, both of them equipped with machine learning specialized hardware.

The CPU of the laptop has a vector neuron network instruction know as VNNI.

The smartphone has a low power digital signal processor optimized for full machine learning.

According to the charts, there is a big performance gap between Web and native.

For instance on the laptop, the native CPU inference is about 5.6 times faster than a WebAssembly for float point 32 precision.

The reason behind is that is that native can access 256 bit vector instructions and optimized the memory layout for that vector width, however, WebAssembly only has 128 bit vector instructions available.

Native GPU inference is about eight times faster than WebGL for float point 16 precision.

That's because a hardware dependent machine learning kernels, within the GPU driver are not available to WebGL.

On the smartphone, we observed the similar results.

The lower precision inference, as known as the quantization is a widely used technique to optimize the inference performance and then reduce power consumption.

The VNNI and DSP are designed to accelerate that.

So if the using 8 bit precision on the laptop, the native inference accelerated by VNNI, can be 12 times faster than the best of the web.

And on the smartphone that 8 bit inference could be even 16 times faster.

Unfortunately, neither VNNI nor DSP is exposed to web apps through the generic web API.

The JavaScript and machine learning frameworks can not take advantage of these hardware features.

That leads to is a big performance gap.

However, due to the architecture diversity of this new machine learning hardware.

It is quite challenging to expose them by the generic CPU or GPU compute web APIs.

So we propose Web Neural Network API as an abstraction for neural networks in the web browsers.

It aims at bridging the innovations in the software and hardware ecosystem, bringing together a solution that scales across hardware platforms and the working with any framework for web based machine learning experiences.

As illustrated in the architecture diagram, web browsers may implement the WebNN API using native machine learning API available in the operating systems.

This architecture allows JavaScript frameworks to tap into cutting-edge machine learning innovations in the operating system and hardware platform underneath it without being tied to platform specific capabilities, bridging the gap between software and hardware through a hardware agnostic abstraction layer.

At the heart of the neural network is a computational graph of a mathematical operations.

These operations, as a building blocks of modern machine learning technologies in computer vision, natural language processing, and robotics.

The WebNN API is a specification for constructing and executing computational graphs of neural networks.

It provides web applications with a capability to create compile and run machine learning networks on the web browsers.

The WebNN API may be implemented in web browsers using the available native operating system machine learning APIs for the best of performance and the reliability of results.

The execution of a WebNN graph come interoperate with kernels written in WebAssembly, WebGL, and the WebGPU shaders.

With that, the frameworks can be flexible by using WebNN for hardware acceleration and the still using WebAssembly/WebGL/WebGPU for custom operations support.

There are three major interfaces within WebNN API: MLContext, MLGraph, and the MLGraphBuilder.

An MLContext to interface represents a global state of neuron network API execution.

One of the important context to states is the underlying execution device.

The execution device manages the resources and the facilitates, the compilation and the execution of the neural network graph.

With WebNN API frameworks can specify the device preference such as a CPU or GPU later, it will support AI accelerator.

Frameworks can also set power preference by high performance or low power.

The implementation would select as the best hardware device available on the platform regarding to the preference setting.

The MLGraph interface represents a compiled computational graph and exposes a compute method to perform the inference.

So MLGraph, Buder interface service as a builder, also known as a factory to create a MLGraph.

An MLOperand is a representation of data that flows within thea computational graph, which includes input-values, the trained weights as known as a constants, intermediate values, as well as output values of the inference.

At inference time every MLOperand will be bouned to the actual data, in machine learning term, they are tensors.

Tensors are essentially a multi-dimensional arrays.

The representation of the tensors is implementation dependant, but it is typically includes the array data stored in some buffer and some metadata, describing the data type and shape.

The MLGraphBuilder interface enables the creation of MLOperands.

A key part of the MLGraphBuilder interface as an operations, such as con2d(), pooling() and matmul()).

The operations have a functional symantics.

Each operation invocation returns at distinct new value without changing the value of any other MLOperand.

The build() method of the MLGraphBuilder interface is used to compile and optimize the computational graph.

The key propose of the computation compilation step is to enable optimizations that span two or more operations.

Such as operation or loop fusion.

For example, the street operations in the graph, conv2d, element-wise add and relu might be fused into one kernel depending on the hardware implementation.

The compute() method of the MLGraph interface is used to execute the compiled computational graph.

That is to perform interference.

The caller supplies the input values using MLNamedInputs, binding the input MLOperands to their values.

The caller also supplies, a pre allocated buffers for output MLOperands using MLNamedOutputs.

The buffers could be CPU buffer or GPU buffer.

That avoids unnecessary cross device data moving when interoperate with kernels written in WebAssembly and WebGL/WebGPU.

Let's use a simple example to demonstrate how to use a WebNN API, to build and execute a computational graph.

This is athe graph we'll use, it contents only one operation MatMul, stands for Matrix Multiplication It takes two inputs and a produces one output The input a is a float matrix in shape 3 by 4 the input a B is another float matrix in shape 4 by 3.

According to the matrix multiplication the output C is a float matrix in shape 3 by 3.

To build that graph with WebNN API, first we need to create the context through navigator.ml.createContext.

would take options where you can specify the device and the power preference, As I mentioned.

With the context we can create a builder instance for graph building and the execution.

For input a ,we can use builder.input to create as an input ut Operand for it.

Builder.input takes two arguments The first one is input name The second one is the operand descriptor that describes the data type and shape.

First Input a it's data type is a 'float 32 and the dimension is 3 by 4 We can create input operand for B in a similar way.

Then we can call builder.matmul, That is a matrix and multiplication operation with a and b it returns c.

That is an output operand of this operation The next step is to build the graph.

It can be done by calling builder.build.

We need to supply the output operand of the graft with its name.

Builder.build would compile an optimize a graph in the hardware optimized the way it returns compiled graphs that we can compute with it . For the sake of simplcity, we've put values in CPU buffer here are using typed array buffers.

For input a because it's shape is a 4 by 3, so we allocate afloat32array with length 12, here we fill it with value 1.0 just for example.

Similarly for input b we allocate another float32array with value 2.0 We also need to allocate a buffer for our output C.

As it's shape is a 3 by 3 it requires a buffer lists is 9.

It doesn't need to be initialized with values because it will receive results of a graph compute.

Then we can invoke graph.compute with these buffers The 1st argument of graph to compute is a named inputs that bind input buffers to input operand and through names The second argument is a named output that bind the output buffers to the output operand and through name .After graft.compute completes the results would be placed in bufferC.

This sample is available online as one of the webnn-samples.

It is able to run in today's web browsers Thanks to webnn-polyfill.

Feel free to check it out through the link in the slide . Although the previous example is very simple It actually illustrates the essential steps that for any complex neural networks.

There are more other real use case samples available on webnn-samples github repository .They include the Computer vision ones like the handwritten digits classification, image classification, style transfer, objects detection, semantic segmentation and the non-compute vision ones like noise suppression.

All these examples can run in today's web browser with webnn-polfill.

The webnn-polfill itself leverages TensorFlow.js ops implemented in WebGL and WebAssembly.

You can also run this WebNN API samples on native Machine learning API is through WebNN-native.

WebNN-native provides several building blocks: it provides C/C++ headers that have one to one mapping with WebNN IDL the applications including JavaScript runtime or web browser can use it to implement WebNN API.

For example, WebNN-native provides a node.js addon that JavaScript applications can incorporate.

WebNN-native also has several backends that use the native machine learning APIs, So far it has a DirectML backend for Windows10, and OpenVINO backend for Linux and Windows.

More backends are coming in the ways Thanks to the community's contribution.

I'd like to demonstrate a running a few WebNN samples with webnn-polyfill and webnn-native.

So I will use WebNN object detection example as a demo.

So first I will run this example in the web browser with webnn-polyfill.

So there are some configurations.

So the layouts there is NCHW NHWC stand for the channel first or the channel last for the input image layout.

So I'll use a NCHW for this demo.

And then for models you have options You can draw on the Tiny Yolo V2 or SSD MobileNet V1.

So I will use MobileNet V1.

So it will load, build model and do the compute.

So there are some time measurement here.

Like the build time is 2 57 milliseconds The inference time is a 130.

So this is a performance about as a polyfill....this is actually is the result.

The person detector with a bounding box image and then there is a kite so I will turn on the live camera, we'll get as a meta stream with a camera and do the detection.

So it detects I'm a person, it's very nice.

So the FPS is around seven.

So it's about polyfill.

Then I will demonstrates the electron with webNN native So this is an Electron app entering point with the object detection So first I will NPM install, yes it is done, it will install the electron node.JS banding of the web and native and then I will start.

So this is an electron app running the exact same WebNN object detection sample.

So the only difference is that it is using the node addon of the WebNN-native.

So let's see its performance.

So its inference time is improved much like a it's a reduced to 7.87 ms.

So if I turn on the live camera, so you see is the FPS is around 47....around 50 now, Right?

So you can see the rendering is very smooth.

It can ecognize this in a chair and it can also recognise that I have a cup.

There a corresponding performance numbers of WebNN-native on laptop, and the previous Chromium WebNN prototype on smartphone.

As these numbers show, by introducing the domain specific abstraction of a neural network and the relying on the native of machine learning APIs WebNN can help access the optimized software and hardware for machinelearning, and close the gap between the web and native.

The Web Neural Network API is being standardized within WC3 machine learning for the web Working Group.

Thanks for watching and looking forward to your feedback

Access purpose-built ML hardware with Web Neural Network API

Ningxin Hu Sept 2021

image of a brain wired with colored circles

The JS ML frameworks and AI Web apps

Chart detailing the AI Web apps features and how these feed into JavaScript Machine Learning frameworks and then can be used by the browser and by hardware. AI Features of Web Apps include Object Detection, Semantic Segmentation, Speech Recognition, and Noise Suppression (outlined on the top row of the chart and encased in a dotted blue box). These feed downward into JS ML Frameworks to include ONNXRuntime, TensorFlow.js, MediaPipe, and OpenCV.js (outlined on the second line of the chart and encased in a dotted blue box). In turn, these can be used by the Web Browser in WebAssembly and WebGL/WebGPU (shown on row three of the diagram and colored in green), which in turn can run ML computation in Hardware on CPU and GPU respectively (outlined on the bottom row in blue).

The purpose-built ML hardware

Repeat visual diagram from the prior slide, this time with the boxes representing hardware inlaid to reflect its architecture being modified to accommodate ML extensions. The CPU and GPU boxes are inlaid with orange segments representing ML extensions. Three more orange boxes representing newer extensions such as NPU (Neural Engine), VPU (Vision Processing Unit) and TPU (Tensor Processing Unit) are situated beside the CPU and GPU boxes.

The performance gap: Web and native

MobileNetV2* Inference Throughput (higher is better)

Side by side column graphs showing comparative performance between Web (show in the green bar) and Native (shown on the blue bar) MobileNetV2 Inference Throughput measured on a Laptop with VNNI (Vector Neural Network Instruction) and measured on a Smartphone with DSP. Each graph shows performance on GPU/FP16, CPU/FP32, and VNNI/INt8 for laptop and DSP.INt8 for smartphone. In all three cases for both charts the native performance is dramatically superior.

The Web is disconnected from ML hardware

Repeat visual diagram from the earlier slide showing the cascade from AI Features of Web Apps through to Hardware with the hardware boxes inlaid to reflect the accommodation of ML extensions, The orange boxes representing NPU (Neural Engine), VPU (Vision Processing Unit) and TPU (Tensor Processing Unit) are located underneath the row representing the Web Browser but they have red error symbols over them, representing the disconnect between the Web and ML hardware in terms of performance

WebNN: the architecture view

Diagram showing a proposed solution to the disconnect. A Native ML API layer has been added (in blue) between the Web Browser and Hardware. Web Apps at the top still use ONNX Models, TensorFlow, and other models to feed JS ML frameworks, and the JS frameworks still cascade through WebGL/WebGPU and WebAssembly at the browser level, but a third element of WebNN (in orange) has been added on the browser row which is interoperable with the other two. The implementation of the WebNN api is upheld by using Native ML API available in the operating system (ML Compute - macOS/iOS, DirectML - Windows, NN API - Android, and OpenVNO - Linux). These then feed to hardware below. This architecture allows JavaScript frameworks to tap into cutting-edge machine learning innovations in the operating system and the hardware platform underneath it without being tied to platform-specific capabilities, bridging the gap between software and hardware through a hardware-agnostic abstraction layer.

WebNN: the programming model

Flow chart explicating the relationships between the three major interfaces within the WebNN API: MLContext, MLGraph, and MLGraphBuilder all (delineated in blue). MLContext is at the apex of the chart and specifies the default device preference such as cpu/gpu. This interface creates the MLGraphBuilder which builds (with ops set) the Computational Graph which compiles resources to the MLGraph which computes them to MLNamedInputs and MLNamedOutputs depending on context, and then to CPU or GPU buffers

For further details or a copy of this chart, go to https://www.w3.org/TR/webnn/

WebNN:Hello Tensors

Code Demonstration of how to use WebNN APO to build and execute a computational graph - in this case containing only one operation, MatMul (Matrix Multiplication).

const context = navigator.ml.createContext();

// The following code multiplies matrix a [3, 4] with matrix b [4, 3]
// into matrix c [3, 3].
const builder = new MLGraphBuilder(context);
const descA = {type: 'float32', dimensions: [3, 4]};
const a = builder.input('a', descA);
const descB = {type: 'float32', dimensions: [4, 3]};
const bufferB = new Float32Array(sizeOfShape(descB.dimensions)).fill(0.5);
const b = builder.constant(descB, bufferB);
const c = builder.matmul(a, b);

const graph = builder.build({c});
const bufferA = new Float32Array(sizeOfShape(descA.dimensions)).fill(0.5);
const bufferC = new Float32Array(sizeOfShape([3, 3]));
const inputs = {'a': bufferA};
const outputs = {'c': bufferC};
graph.compute(inputs, outputs);
console.log(`values: ${bufferC}`);

Try it out in WebNN Code Editor

Screenshot of the range of WebNN API Samples available at WebNN's Github repo

Run WebNN samples on WebNN-native

Build WebNN-native (with a backend)
Build WebNN node.js addon
Npm install electron.js
Launch<.li>

Screenshot of the WebNN Image Classification demo webpage

Animated demonstration of using a range of WebNN samples with WebNN-Polyfill and WebNN-Native to demonstrate the functionality of the WebNN API

Developing WebNN results in major PnP gains

Repeat slide of the MobileNetV2* Inference Throughout comparison performance charts from earlier, this time with three columns: Non WebNN (in green), WebNN (in orange), and Native (in blue), showing the corresponding performance numbers of WebNN-native on a laptop and the prior Chromium WebNN prototype on a smartphone. The WebNN column performs far better than non WebNN