Molmo2
VLM
Multimodal
A fully open Vision-Language Model with video grounding capabilities
Molmo2 (Multimodal Open Language Model 2) is a fully open Vision-Language Model (VLM) family developed by the Allen Institute for AI (AI2) and the University of Washington. Its key distinguishing feature is video grounding capability, which enables the model to precisely indicate “when and where” specific events or objects occur within a video.
Using 9 new datasets (constructed entirely without relying on proprietary models), Molmo2 achieves state-of-the-art performance among open-source models. In particular, it surpasses proprietary models such as Gemini 3 Pro in video pointing and tracking.
Paper: arXiv:2601.10611
Code: github.com/allenai/molmo2
Demo: playground.allenai.org