Web apps of the future with Web AI

Jason Mayes at Dev Summit '24

Introduction to Web AI

Jason Mayes, Web AI Lead at Google, introduces the concept of Web AI, defining it as using machine learning models client-side in a web browser. He emphasizes the distinction between Web AI and Cloud AI, highlighting the benefits of on-device processing.

The Growth and Importance of Web AI

Jason shares statistics demonstrating the rapid growth of Web AI, citing over 1.2 billion downloads of Google's web AI libraries. He stresses the importance of upskilling in this area, as customer expectations for AI features in web applications are increasing.

Superpowers of Web AI: Privacy, Offline Capability, and Low Latency

Jason discusses the key advantages of Web AI, including enhanced privacy, offline functionality, and low latency. He illustrates these benefits with real-world examples, such as remote physiotherapy and product placement verification in supermarkets.

Cost Savings, Frictionless Experience, and Reach of Web AI

Jason highlights the cost-effectiveness of Web AI compared to Cloud AI, emphasizing the elimination of expensive cloud-based resources. He discusses the frictionless user experience offered by web-based AI and its vast reach across billions of browser-enabled devices.

Real-World Example: Video Conferencing and Cost Analysis

Using video conferencing as a case study, Jason demonstrates the potential cost savings of Web AI. He breaks down the calculations for background blurring, showing how on-device processing can significantly reduce server-side inference costs.

Large Language Models (LLMs) in the Browser

Jason showcases the use of LLMs in the browser for tasks like email generation, highlighting the speed and privacy benefits. He demonstrates the ease of using the MediaPipe LLM Task API with a code example.

Creative Applications of LLMs

Jason explores various applications of browser-based LLMs, such as Chrome extensions for text conversion and tools for interacting with PDF documents. He emphasizes the potential for creative solutions using readily available JavaScript and AI tools.

Video Analysis and Generative AI in the Browser

Jason demonstrates a demo of video analysis using Transformers.JS, showcasing the ability to search for specific content within videos. He also touches on the exciting possibilities of generative AI in the browser, using technologies like WebGPU.

The Future of Generative AI and Web Development

Jason discusses the evolving landscape of generative AI in the browser and the potential for hybrid approaches combining client-side and cloud-based processing. He introduces a new website offering guidance for web developers using AI in JavaScript.

Web AI Community Showcase

Jason highlights examples from the Web AI community, showcasing diverse applications of machine learning in JavaScript across various fields, including animation, healthcare, and retail.

Types of AI Models for Web Applications

Jason presents a range of AI models suitable for web applications, including object recognition, text toxicity detection, depth estimation, face mesh, hand tracking, pose detection, and body segmentation. He provides examples of how these models can be applied in different scenarios.

Visual Blocks: A Framework for Prototyping AI-Powered Web Apps

Jason introduces Visual Blocks, a framework for rapidly prototyping AI-powered web applications. He explains how this framework simplifies the process of combining different AI models and building complex functionalities using JavaScript web components.

The Vision of AI-Compatible Websites

Jason concludes with a vision for the future of web development, where websites become AI-compatible, allowing natural interaction through text and voice. He encourages developers to embrace Web AI and shape the future of this rapidly evolving field.

All right.

So I know that I am standing between you and beers and things, so I'll be on point, but hello everyone.

I'm Jason Mayes, web AI lead at Google.

But today I've come to share a story as a fellow developer about why all of you should start investigating machine learning on the client side in JavaScript to gain superpowers in your next web application.

First, let's start by formally defining what I mean by web AI, which is a term I coined back in 2022, to stand out versus cloud based AI systems back then, which were popular.

Web AI is the art of using machine learning models, client side in a web browser, running on your own devices, processor, graphics card, or even NPU.

Using JavaScript and surrounding web technologies like WebAssembly, WebGPU, or emerging web standards like WebNN for acceleration.

Now, in some cases, the model might even be built into the browser itself and exposed to JavaScript developers via emerging APIs for common tasks like prompting, summarization, and translation.

Now, all of this is different from Cloud AI.

Whereby in that case, the model would execute on the server side and be accessed via some sort of API instead, which means we need an active internet connection to talk to the cloud API at all times to provide those advanced capabilities.

Now then, as web developers and designers, we have the privilege of working across industries when we work with our customers and in a similar manner, artificial intelligence is likely to be leveraged by all of those industries in the future to make them more efficient than ever before.

In fact, in a few years time, your customers would expect AI features in their next web project to keep up with their competitors.

So now is the perfect time to upskill in this area.

Okay, so if you're not yet using client side AI, I want to illustrate how fast this is growing and why you should start taking a look before you're left behind.

Now, I've only got statistics for Google's web AI libraries, so worldwide usage is probably higher, but in the past two years alone, we averaged 600 million downloads per year of TensorFlow.JS and MediaPipe web based models and libraries.

Bring us over 1.

2 billion downloads in that time.

And we're on track to continue growing in usage.

But why should you care about web AI?

Surely you can do all of this in the cloud, right?

You can gain some superpowers that are just not possible with cloud alternatives.

Now I know all of you come from different industries.

So think about how any of these could apply to yours as we go through them.

First up is privacy as no data from things like the camera, microphone, or even text need AI model on a server.

Which protects the user's personal data.

And a great example of this is shown here by Include Health, who use browser based pose estimation models to provide remote physiotherapy without sending any of the imagery to the cloud.

Instead, only the resulting range of motion and statistics from the session are sent, allowing the patient to perform the checkup from the comfort of their own home whilst also protecting their privacy.

You can also have the ability to run offline on the device itself.

So you can even perform tasks in areas of no or low connectivity after the page load.

Now, you might be wondering, why would a web app need to work offline?

In this great example by Hugo Zanini, he performs product placement verification using a web app in supermarkets for a retail customer that he was actually working with.

And we all know how bad data connections are in supermarkets.

He leveraged TensorFlow.JS to run a custom object detection model right in the browser.

That can work entirely syncs data back when a connection is available later.

Now, you can imagine applying similar techniques to retail analytic solutions to monitor what objects are most interacted with, or maybe even within digital kiosks in the aisles of those supermarkets to understand how busy an area is at different times of day.

Or, what about healthcare solutions in pharmacies to count pills more efficiently for prescriptions, and much, much more.

All whilst working offline to conform with regulations.

Next is low latency, which can enable you to run many models in real time, as you don't have to wait for the data to go to the cloud and back again, which might be significant, waiting time if you're on a mobile phone data connection, for example.

As such, many of our models, like these ones from MediaPipe for body pose and segmentation, can run over 120 frames per second on a mid range laptop's graphics card with great accuracy, as you can see on this slide.

You've also got lower cost as you don't need to hire and keep running expensive cloud based graphics cards and processes.

Which means you can now run generative AI directly in the web browser, like this large language model on the left, without having to break the bank.

And we're now seeing production ready web apps benefit from significant cost savings, like the one shown on the right, for advanced video conferencing features, such as background blurring right in the browser.

And even better, you can offer a frictionless experience for your end users, as no install is required to run a web page, just go to a link and it just works.

In fact, Adobe did that exactly here with Adobe Photoshop web, enabling anyone, anywhere to use their favorite features on almost any device.

And now when it comes to the object selection tools shown on this slide, embracing client side machine learning for this feature can provide Adobe's users with a better user experience by eliminating the cloud server latency, resulting in faster predictions and a more responsive user interface.

And on that note, It means you can leverage the reach and scale of the web that has around 6 billion browser enabled devices potentially capable of viewing your creation.

Alright, let's take a video conferencing solution as an example.

But you can imagine this for other real time use cases that might also stream audio or video data or other things, maybe for healthcare.

Or security cameras and beyond.

Okay, so many of these services provide background blur or background replacement for privacy.

So let's crunch some hypothetical numbers to see the value of using client side AI here at scale.

First, a webcam typically produces video at 30 frames per second.

So assuming your average meeting is 30 minutes in length, that equates to 54,000 frames for every meeting that need to have their background blurred.

Now, assuming 1 million meetings per day for a popular service, that equates to 54 billion segmentations every single day.

Now, even if we assume an ultra low cost of just 0.00001 cents per segmentation, which is optimistic, that would still be $5.4 million per day, which is around $2 billion a year for a server side AI model inference costs.

By performing background blurring on the client side of web AI, that cost can go away.

And don't forget, you can even port other advanced models to the browser.

Like noise cancellation to further improve the meeting experience for your users while reducing costs even more by running all of these features on device.

Or what about bringing large language models to the browser?

In the example shown here, you can see how the user can generate an email to their friend for a given context and with certain requirements without any of the text being sent to the server.

And even better, this runs really fast.

The demo you see here is captured in real time.

Now, at this point, if you're not from an AI background, you might be wondering, how hard is it to actually use an LLM in the browser?

Our MediaPipe LLM Task API allows you to choose from four popular models, depending on your needs and use cases, and can be used in just a few lines of JavaScript, so there's no need to be an AI expert.

So let's take a look at that.

First, you import the MediaPipe LLM Inference API.

Next, you define where your large language model is actually going to be hosted.

You'd have downloaded and hosted the model from the link on the prior slides.

Now you define a new asynchronous function that will actually load and use the model.

And inside this, you can specify your fileSetURL that defines the MediaPipe runtime to use.

This is using the default one provided and is safe for you to use as well.

You could have a host with file yourself on your own server if you prefer.

Now you can use the fileSetURL from the prior line to initialize MediaPipe's fileSetResolver to actually download and use the runtime for the generative AI task that you're about to perform.

Next, you load the model by calling Llmtask.CreateForModelPath to which you pass a file set and model URL that you defined above.

Now, as the model is a large file, you must await for that to finish loading, after which it will return the loaded model that you assigned to a variable called LLM.

Now, you can use the loaded model to generate a response, given some input text as shown.

And here, you store the result in a variable called answer.

And with that, you can log the answer, display it on screen, or do something useful with it as you desire.

And that's pretty much it.

Now you just, call the function above to kick off a loading process and await for the results to be printed.

Now, the key takeaway here is whilst there are some scary sounding variables like fileSetResolver, anyone here can take and run this code and then build around it with your existing JavaScript knowledge for your own creative ideas, even if you're not an AI expert yet.

So do start playing with these things today.

You could imagine turning something like this into a Chrome extension, whereby you could highlight any text on a web page, right click, and convert it to a form suitable for social media.

Or maybe to find some word you don't understand, all in just a few clicks for anything you come across, instead of needing to go to a third party website to do I did that right here in this demo, that I made in just a few hours on the weekend, entirely in JavaScript.

Client side in the browser, with just so many ideas waiting to be created here.

In fact, people are already using these models to do more advanced things, like talking to your PDF documents to ask questions about its contents.

Without having to read it all yourself, as shown in this demo by Nico Martin.

This is a great time saver.

And a really neat use case of large language models when combined with surrounding rag techniques, in this case to extract the sentences from the PDF that matter, and then use them as the context for the LLM to answer some given question, all working locally in the browser on your device.

And notice here how it even shows the sentences it generated its answer from, so you can reference the actual human written knowledge.

Which might be useful to ensure that no data was hallucinated in the answer.

Or what if your videos could watch themselves when it loads here, I'll give it a second.

Okay.

To extract useful information all by themselves.

Now I made this demo in just one day using Transformers.JS, whereby I'm able to enter a sentence for what I want to look for, such as my name in any video frame.

Or an image of a guy standing on a plane, which is one of my hobbies when I'm not nerding out on web AI.

And sure enough, it finds those for me.

Really cool.

From Chrome extensions that supercharge your productivity, to features within the web app itself.

We're at the start of a new era that can really enhance your web experience and the time is now to start exploring all of those ideas.

In fact, right now, generative AI in the browser is in its early stages.

But as hardware continues to evolve, with more RAM available to both the CPU and the GPU, we shall continue to see models ported to the browser on your device, enabling businesses to reimagine what you can actually do with a web page, especially for industry or task specific situations.

In fact, we could see smaller large language models.

In the 2 to 8 billion parameter range, be tuned for a specific purpose and run on consumer hardware.

Right now, one could envision a hybrid approach, whereby if the client machine is powerful enough, you download and run the model there, only falling back to cloud AI in times when the device is not able to run the model, which might be true for older devices.

But with time though, more and more compute will be performed on device.

So your return of investment should get better as time goes by when implementing an approach like this.

And on that note, we're working on a brand new website providing guidance for web developers choosing to use AI in JavaScript.

With this site, we aim to help you understand key AI concepts so you can discover opportunities to use popular models and be more productive than ever.

So bookmark the site shown on this slide as we'll continue to publish more content here over the year.

All right, now let's head on over to see what the web AI community has been up to.

These are people just like you, but are already using machine learning in JavaScript in their products and services, just to give you a taste of what's possible.

Fundamentally, what we're trying to do is build tools to allow anybody to be their true self.

There's a lot of people that feel uncomfortable on camera.

Here's a much more detailed side by side view of my face being captured alongside, the face of the 3D character.

And so what's happening is it's tracking individual facial blend shapes, and then relaying them back into that Unreal Engine to, render them.

I work as a radiologist and, the last four years I've been really interested in using artificial intelligence for segmentation.

And this is a really nice thing with, TensorFlow.JS, I think, is that you can interact with with the AI models, it doesn't have to do fully automatic, predictions, but you can actually guide, the models to get the result that you want.

We're going to see a, an animatronic talking backpack, that uses a face mesh.

And it was an, a comedian in the wild experiment, that respected social distancing.

It does what It puts all these tracking points on a face.

And it can track it with and without glasses.

It has an incredibly high frame rate, even in JavaScript and even in browser.

So wouldn't it be great if you could use this technology to control animatronics?

So this is an example of how Cardinal Health is using RoboFlow to improve the backrooms of pharmacies.

So they have a division in their company, that works with pharmacists where one pharmacist can manage several locations remotely over a video stream.

And so this runs in the browser on an iPad and previously it was just a video chat between the pharmacist and the technician.

And then they used Roboflow to add on a pill counting feature, that helps the pharmacist, be augmented by the computer vision model.

So they don't have to count from scratch, they can, use the model to estimate and then, adjust up and down where the model has, not gotten an exact count.

The video that I'd love to show here is, a video piece that I constructed which I call Mirror Exercise.

This is an AI generated duet with myself.

What you're seeing here is, in the figure in black is my real motion capture data from a couple of data taking sessions in the studio.

And in blue, you see a dancing accompaniment that is generated by the model.

In each case, the model is, It's seeing a certain segment of my movements and is, creating a slight variation on it in some sense.

It's a hand tracking engine called Yoha.

And so what it does is it processes the video feed of your webcam and it detects the location of your hand in this video feed.

And the goal of it is that it allows you to quickly sketch something.

Roughly similar to how you would do it, on a whiteboard.

HEMO engine that allows you to build very complex solutions in just minutes.

And we do that through a drag and drop interface, which helps you define any kind of process without the need for coding.

This is the final version of the program.

The video is a bit sped up so things happen fast, but you can see that as water is taken out from the container, the graph reflects the changes in the level of the liquid, which is what we wanted.

We wanted to go beyond machine and even beyond the clinic.

And so that's really when we decided to expand our platform into computer vision.

And delivering care directly into patients homes on their own devices.

We're a medical device that we can run on any device fully web based.

And you can see these patients doing different exercises, upper body, lower body, with real time feedback.

Counting their reps. We track their range of motion.

We gather all that and send it back to the clinician for review to help modify their home exercise plans accordingly.

All right.

Pretty cool, right?

So one thing I want to point out there is that not one of those examples use generative AI.

These are traditional AI models being used in the browser in those cases, even though we can run generative models too.

And notice there's even a talking backpack, which, we forget as JavaScript developers, we can talk to hardware too, via Bluetooth or web USB or web serial protocols, that are all part of the browser.

So don't forget, you can actually control things in the real world too, for your customers.

All right.

Yep, some amazing demos, and no matter if you're going to be recording motion capture data to transform users into a different persona, to make an experience more fun, to the latest in generative AI, where you can even run diffusion models in the web browser at incredible speeds with new browser technologies like WebGPU that are now enabled by default in Chrome, things are about to get really exciting with regards to what we can expect from a web app of the future.

So what types of AI models can you actually run in the browser?

Clearly not everyone here is working at video conferencing companies and gen AI companies, right?

There are thousands of models out there, many of which can be used in just a few lines of JavaScript and new models are coming out every single month.

Now I can't cover them all in this short talk, but let's go through a few of the popular ones and think about how we can apply these to any of the customers that you work with or your companies.

First up, you've got object recognition.

This allows you to understand where in the image the object is located.

But also how many exist, which is more powerful than image recognition alone.

That would only tell you something exists, but not where or how many.

You could use a model like this to create a pet camera that could notify you when your dog's been naughty, eating the treats when you're away from home.

I made this example again in just one day using off the shelf models that are capable of recognizing 90 common objects in the house, like bowls and dogs.

And as you can see, it actually works pretty well.

The dog in this video is caught red handed as he tries to inspect the bowl of food.

And then you can send yourself an alert to do whatever you want to do once you know this has occurred.

All of which is done without having to stream the real time video to the server 24 7.

Instead, you just send the clip of a dog when it intersects with the bowl in that moment so you can see what happens.

Or what about text toxicity detection?

Here, you can pass a sentence to the model and classify if it's toxic or not.

And even figure out what type of toxic it might be.

Maybe an insult or a threat, for example.

Something like this could be used to pre filter comments in your web chat or commenting system on your website, automatically holding potentially toxic ones for further moderation before you publish.

Now we can even extract 3d data from a 2d image with a selfie depth estimation model.

You can pass it a portrait image of a person, and it will predict how far away each pixel is in the image, allowing you to understand the 3d makeup of the face and body.

Now, if you can do that, you can do some really cool things like imagery lighting.

Here, I can cast the light rays around my body for a more realistic effect than was possible before, which could be great for augmented reality or lighting effects on images to bring them to life.

Next, we have our face mesh model.

This is just three megabytes in size and can recognize 468 landmarks on the human face.

And as you can see, it's running at well over 130 frames per second on my laptop.

And we're seeing companies like Modiface, who's part of a L'Oreal group, use this technology in production web apps.

Now, the thing to note here is that the lady on the right is not wearing any lipstick.

Instead, they're using our face mesh model combined with WebGL shaders to color in her lips in real time, based on the color swatch that she's chosen at the bottom of the app.

Really amazing stuff.

We also have models to track your hands.

That can track multiple hands in 2D or 3D, again over 120 frames per second on a mid range laptop with a dedicated graphics card.

A model like this could be used for gesture recognition, or even touchless interfaces for human computer interaction.

And with a bit of extra training, you can even do simple sign language interpretation.

Pretty handy.

Continuing on the human body theme, you can also detect 2D or estimated 3D pose for 33 key points at over 150 frames per second, again on a machine with a dedicated graphics card.

This model has three forms, a light, full and heavy version, which allows you to set the trade off between accuracy and speed.

Or building on this, what about full body segmentation?

With a model like this, you can differentiate the human body from the background of the scene.

Now, if you can do that, you can blur the body for privacy, like we do in Street View.

Or you can apply effects to the background for a more stylistic effect.

There's also a specialized version of this model that focuses on the upper body, as you can see here.

This is better suited for typical video call situations.

All right, so you've got all these AI models that make you feel like a superhero, but how can they actually help you?

By selecting the right models for the right situation, you can then provide your customers with superpowers themselves when you apply these models to their industries.

In fact, at Google, we often need to explore what models to use for a given task.

So we created a system called visual blocks to do that in an efficient way.

It's a framework I worked on with my team that allows you to go from an AI powered idea to a working prototype faster than ever.

Built around JavaScript web components, so it's super easy to extend by anyone who knows JavaScript.

And even better, once you make a building block to do a thing, you can chain them together with other powerful blocks to bring your idea to life without having to be lost in the code complexity.

And it's not just us using this.

The examples on this slide were actually contributed by Hugging Face.

We've been adding new client side AI models via Transformers.JS, and these are now available, to explore in Visual Blocks in just a few clicks, which is great if you're new to this space.

You can try them for yourself at the link shown, just click on their names and they'll launch immediately.

And here you can see a glimpse of the future, whereby a research team at Google use a large language model to create a Visual Blocks pipeline for me.

To solve some problem I had, like creating an augmented reality effect, just by explaining what I want in plain English and have it make the start of a prototype for me within Visual Blocks.

Really incredible stuff, and this is where things are heading.

Now then, I'd just like to end by saying there are very few opportunities in one's life to be at the beginning of a new era like this.

I, for one, remember when the internet came out for the first time.

Yes, I am that old.

And WebAI has the same feels as that, only magnified 100x.

Now, I personally believe, oh, I'm on the wrong side there, sorry.

I personally believe that in the future we as web developers will build AI compatible websites whereby you have the option to just talk to them naturally via text or voice to perform any of the tasks that they support.

To delegate to an AI agent, maybe via an exposed AI.json file for the domain that LLMs can understand instead of us manually figuring out some arbitrary workflow or user experience.

And sites that are not AI compatible may be out of fashion pretty fast, just like when we moved to having mobile versions of all of our web pages as smartphones grew in popularity.

Now, I hope in this talk I've been able to provide you a tiny glimpse into the future.

And why you should start investigating web AI today, because everyone watching here has the chance of a lifetime to shape the future of this fast growing space.

And I hope that all of you will join me on this adventure.

Do tag me in anything you make, or if you've got any suggestions, for any of the technologies that I've mentioned in this talk.

You can connect with me on social media, using the links on the slide.

And of course, do remember to use the WebAI hashtag, if you make anything worth sharing, so I can share it, and make you all popular and stuff.

Thank you, and I'll see you next time.

Web AI != Cloud AI

Client vs Server

Diagram depicting a web application structure with a front end and AI model, connected to a CPU, GPU, and NPU within a web browser. Logos for JavaScript, WebGL, WebAssembly, WebGPU, and WebNN are displayed below.

Built in JS APIs for common AI powered tasks

Logo of a the Chrome browser.

Diagram showing a cloud server containing an AI model, CPU, and GPU. The server communicates with a web app front end in a web browser via a cloud API call.

AI will touch every industry.

Time to learn in JavaScript too!

Diagram depicting AI's impact spreading across various industries and sectors represented by icons, with AI at the center.

>1.2 billion

Cumulative CDN downloads of MediaPipe Web / TensorFlow.js libraries or models in 2022 and 2023.

Why Client Side AI?

Privacy

Sensor data from camera, microphone, or other connected sensors stays on device. No cloud based inference protecting user privacy.

Learn More: https://goo.gl/IncludeHealth

Stand back so we can see you

A person standing in a living room, with a visual overlay indicating body position points as part of a motion analysis system.

JavaScript on-device AI wins

Offline

After initial page load the model and logic can run entirely offline which is particularly suitable for folk working in remote locations or areas with poor connectivity.

Consumer Goods Detection

Created by Hugo Zanini

Learn More: https://goo.gl/WebAI-Goods

Image showing a shelf with multiple products enclosed in green boxes, indicating AI detection of consumer goods.

JavaScript on-device AI wins

Latency

As no cloud involved there is no round trip time to server and back again. Instead of 100ms models can run in real time

A split image showing a person performing a handstand. On the left, the person is overlaid with a skeletal pose estimation. On the right, a silhouette of the person is filled with blue against a red background.

JavaScript on-device AI wins

Cost

As no server side hiring of expensive GPUs / CPUs and RAM, along with the bandwidth of sending data to and from server, cost can reduce significantly.

Chat with state of the art LLMs client side at incredible speeds without API usage caps

Web apps like video conferencing are even more cost effective to business when providing users with advanced features

Illustration showing a web interface for LLM inference on the left and a person video conferencing on the right.

JavaScript on-device AI wins

Zero install

Unlike native apps, no installation required, so not blocked by company policies to run the web app.

Learn More:

https://goo.gl/AdobeWebAI

Screenshot of Adobe Photoshop showing the object selection tool being used on an image of two red apples with water droplets.

JavaScript on-device AI wins

Reach & Scale of Web

Anyone, anywhere, can go to the link the web app is hosted on and it just works. No need to install dependencies or setup Linux / CUDA etc. High shareability.

Image of a laptop screen displaying a video call with an individual wearing a white hoodie. On the right side of the screen, there are options for background effects and filters.

54K

Images per 30 minute meeting*

$2B

Estimated yearly savings by using client side Web AI*

Image of a person smiling in a yellow sweater next to text indicating estimated yearly savings using Web AI.

Noise cancellation

Filters out sound from your mic that isn't speech

Image of a laptop screen displaying a video call with a highlighted text box showing noise cancellation settings.

LLMs in the browser

How could you leverage an LLM in your website?

Image of a laptop displaying a webpage at mediapipe-studio.webapps.google.com

LLMs in the browser

How could you leverage an LLM in your website?

Image of a laptop screen showing a web application interface with a text input box containing a humorous prompt about writing an email detailing why cake is better than pie.

LLM Task API

Load and run our 4 supported LLM architectures in the browser with ease.

Learn More: https://goo.gle/MPWebLLMDocs

Images of four LLM architectures: Gemma 2B, Phi 2, Falcon RW 1B, and Stable LM 3B, each represented visually with distinct designs and colors.

import {FilesetResolver, LlmInference as LlmTask} from 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai';

const modelURL = 'https://YOUR_LLM_MODEL_URL_HERE';

async function initLLM() {
    const filesetURL = 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/wasm';
    const genAIFileset = await FilesetResolver.forGenAiTasks(filesetURL);

    let llm = await LlmTask.createFromModelPath(genAIFileset, modelURL);

    let answer = await llm.generateResponse("Define the word: Dog.");

    console.log(answer);
}

initLLM();

Endless possibilities

A prototype made in just a few hours to explore how LLMs could enhance your web browsing productivity.

Real-time capture taken in Chrome.

Model is accelerated via WebGPU running on an NVIDIA 1070 GPU entirely in the web browser.

Screenshot of a laptop displaying a demo of Google's Gemma 2B Web AI model with text highlighting and actions, such as defining a word and explaining a phrase.

Talk with PDFs

Created by Nico Martin

Try it yourself: https://pdf.nico.dev/

Screenshot of a desktop showing a file browser with a PDF document named "EU - Twitter Terms of Service.pdf" selected and a preview of its contents.

https://goo.gle/DoesVideoContain

Jason shows a video and describes what's on screen.

The start of a new era: AI enabled web apps

Illustration of a button with gradient fill labeled 'Get creative!'

Hardware will evolve over time

Allowing even more powerful models to run on device in the future...

Images of two smartphones with similar designs, showing an older model on the left and a newer one on the right, symbolizing technological evolution.

Large colored text of the numbers and letters "2-8B" with the caption "Smaller LLMs ❤️ Web AI" below.

Hybrid Web AI could be a good starting point today

Progressive enhancement

web.dev/explore/ai

A gradient button with the text "AI and the Web ecosystem".

Community show & tell demos

See them all in full at:

https://goo.gl/made-with-tfjs

Screenshot of a video playlist featuring demos made with TensorFlow.js. The playlist includes various topics like image segmentation, reinforcement learning, and AR motion capture.

Jsasn then shows a series of videos that highlight real world use cases

Image of two people in a video call with colored microphones against an orange background with abstract line patterns.

MRI scan image with highlighted region in blue, accompanied by an interface for adjusting visualization settings, such as pointer size and mask opacity.

A video frame showing a person with facial recognition markers displayed on their face, overlaid on a background with an orange geometric pattern.

Brad Dwyer

Software Engineer

augmented-COCO-transferLearning

Screenshot of a webpage showcasing an object detection project on roboflow. The interface includes project details, performance metrics, and options to visualize or upload files.

Knee Extension

Image showing a person performing a seated knee extension exercise in a room with a plant and TV in the background.

3D Character Animation

Created by Richard Yee

Learn More: https://goo.gl/37FtC2g

Image of a 3D animated character with cat-like features in a virtual room setting. There is a small inset video of a person in the top-left corner.

WebGL

WebGPU

Comparison of two laptops displaying similar abstract graphics, titled 'WebGL' and 'WebGPU'.

object detection
image recognition
speech recognition
token extraction
natural language processing
large language models
diffusion models
generative ai
translation
pose estimation
face keypoint estimation
hand pose estimation
body segmentation
semantic segmentation
decision trees
decision forests
multi layer perceptron
depth estimation
text toxicity detection
spam classification
universal sentence encoder
gesture recognition
gans

An image with a large, central question mark overlaid on a background of various terms related to AI and machine learning.

Object Recognition

Image of two dogs detected by an object recognition system with confidence levels of 98% and 91%.

Four screenshots of a pet cam application in use, showing screen interfaces and detection features, including bowl and dog detection alerts.

Pet Cam

Client

Image of a dog with labeled detection boxes indicating "dog - with 87% confidence" and "bowl - with 98% confidence" from a pet camera interface.

Text Toxicity Detection

Is a comment toxic or not?

Automatically filter out comments before they are even posted.

Or maybe you could hide offensive things via a chrome extension if a paragraph on page is deemed toxic.

What would you make?

A table showing various comments with corresponding columns for identity attack, insult, obscene, severe toxicity, sexual explicit, threat, and overall toxicity ratings. Includes a text input area with a button labeled 'CLASSIFY'.

Selfie Depth Estimation

We have also produced a brand new model type for depth estimation of selfie photos.

Try it: https://goo.gle/selfiedepth

Image showing four pairs of selfies and their corresponding depth estimations in color-coded heatmaps.

Image relighting

Try it: https://goo.gl/ImageRelight

Image of a man sitting at a desk with light effects applied, possibly demonstrating image relighting technology. The top left corner shows TensorFlow.js branding, and there is a performance metric overlay in the top right.

Face Mesh

Just 3MB in size

Recognize 468 facial landmarks

The image includes a series of visuals demonstrating facial recognition technology. The left shows a man's face with a digital wireframe overlay, highlighting facial landmarks. The top right features a woman with similar facial mapping, and the bottom right shows another woman using a makeup application interface with color selection options.

Hand Pose Estimation

Track multiple hands in 3D or 2D and with higher precision than before.

Try it: https://goo.gl/e/HandPose

Three images are shown: a 3D graph depicting hand movements, a close-up of a hand with tracking points, and a person with tracked hands in a video feed.

2D / 3D Pose

BlazePose GHUM 3D model is now available through TensorFlow.js recognizing 33 key points. Can run at 150 FPS for real time results.

Image of a person squatting in an outdoor gym setup with overlaid 2D and 3D pose estimation graphics highlighting key body points and a visual graph chart illustrating the pose.

Full Body Segmentation

Builds upon the pose model you just saw allowing you to get both pose and segmentation returned simultaneously.

Image showing a person performing a handstand on the left with overlaid pose lines, and a corresponding segmentation image on the right depicting the silhouette in blue against a red background. A button below reads "Pose + Segmentation".

Selfie Segmentation

Learn More: https://goo.gl/SelfieSeg

Image of a person in front of a modern building, likely green screened.

Photograph of an action figure wearing a red superhero suit with a lightning bolt emblem, standing on a rooftop with a cityscape in the background. The sky is clear and blue.

Try it yourself: https://goo.gl/VisualBlocks

A man and a woman sitting at a table with two laptops displaying visual content. A presentation screen is visible in the background.

Learn how to use:

https://goo.gl/LearnVisualBlocks

Screenshot showing a graphical interface with nodes for input and model configuration, featuring images including a dog and a strawberry.

Create advanced web apps to prototype faster

Bring AI ideas to life with this Hugging Face and Visual Blocks collaboration.

Make 2D images 3D, find key tokens in sentences to make smarter choices, or translate between languages locally!

Illustrations showing three examples of advanced web applications: depth estimation with an image input, token classification highlighting locations and organizations, and translation with text inputs in different languages.

Screenshot of a web application interface with a text field for describing a pipeline and a submit button.

A single small green plant growing among a bed of rocks.

Screenshot of a flight search interface showing prices and destinations in Hawaii, with routes from San Francisco to Honolulu and other Hawaiian islands.

Screenshot showing a list of returning flights between SFO and HNL, ranked by price and convenience, with details on time, airline, stopovers, CO2 emissions, and prices.