Quantization from the ground up | ngrok blog

March 30, 2026

Text "Quantization from the ground up" on a dark gradient background transitioning from green to blue and purple.

Qwen-3-Coder-Next is an 80 billion
parameter model 159.4GB in size. That’s roughly how much RAM you
would need to run it, and that’s before thinking about long context windows.
This is not considered a big model. Rumors have it that frontier models have
over 1 trillion parameters, which would require at least 2TB of RAM. The
last time I saw that much RAM in one machine was never.

But what if I told you we can make LLMs 4x smaller and 2x faster, enough to run very capable models on your laptop, all while losing only 5-10% accuracy.

Source

Sam Rose writes, develops, incredible visual essays that explain complex concepts in very approachable ways like this piece on prompt caching from late last year.

He returns with a long and detailed explanation of quantization, the seemingly magical property of LLMs that by reducing the precision of the floating point numbers which capture the weights of the model substantially, we don’t actually reduce the capability of the model much at all. Which then allows us to run models with less memory and greater performance.

Related videos