Molmo2

VLM

Multimodal

A fully open Vision-Language Model with video grounding capabilities

Author

Naoto Iwase

Published

February 3, 2026

Last Updated

May 20, 2026

Molmo2 (Multimodal Open Language Model 2) is a fully open Vision-Language Model (VLM) family developed by the Allen Institute for AI (AI2) and the University of Washington. Its key distinguishing feature is video grounding capability, which enables the model to precisely indicate “when and where” specific events or objects occur within a video.

Using 9 new datasets (constructed entirely without relying on proprietary models), Molmo2 achieves state-of-the-art performance among open-source models. In particular, it surpasses proprietary models such as Gemini 3 Pro in video pointing and tracking.

Paper: arXiv:2601.10611

Code: github.com/allenai/molmo2

Demo: playground.allenai.org