Vision Language Models (Better, faster, stronger)
May 14, 2025

Vision Language Models (VLMs) are the talk of the town. In a previous blog post (from April 2024), we talked a lot about VLMs. A major chunk was about LLaVA, the first successful and easily reproducible open-source vision language model, along with tips on how to discover, evaluate, and fine-tune open models.
Since then, so much has changed. Models have become smaller yet more powerful. We’ve seen the rise of new architectures and capabilities (reasoning, agency, long video understanding, etc.). In parallel, entirely new paradigms, such as multimodal Retrieval Augmented Generation (RAG) and multimodal agents have taken shape.
In this blog post, we’ll take a look back and unpack everything that happened with vision language models the past year. You’ll discover key changes, emerging trends, and notable developments.
Vision language models are a relatively new category of generative model that work with images as well as text. Even relatively small models that can run on consumer GPUs (and potentially even in the browser) can be surprisingly capable.
This is a fantastic in-depth round up of the current state of the art.