Vision-Language Connector

Overview

The Vision-Language Connector is a critical module that transforms visual features extracted by the Vision Transformer (ViT) into a format that the Large Language Model (LLM) can process. Molmo2 follows the standard VLM architecture (Deitke et al. 2024) and adopts a design that can uniformly process both images and video. For the end-to-end pipeline that the Connector sits inside, see the Molmo2 architecture figure in the overview chapter.

Architecture Details

Multi-Layer Feature Usage

The Molmo2 Vision-Language Connector extracts features from multiple layers of the ViT, rather than a single layer.

Third-to-last layer: High-level semantic features
Ninth-from-last layer: Mid-level features

This design follows the prior work Molmo (Deitke et al. 2024), combining visual information at different levels of abstraction to achieve richer representations.

Attention Pooling

Attention Pooling is used to reduce patch-level features. The mean of the patches serves as the query, and each patch window is aggregated into a single vector.

Images: 2x2 Pooling

Input Patches (4x4 example):

Table 1: 4x4 input patch layout

	Col 1	Col 2	Col 3	Col 4
Row 1	P₁	P₂	P₃	P₄
Row 2	P₅	P₆	P₇	P₈
Row 3	P₉	P₁₀	P₁₁	P₁₂
Row 4	P₁₃	P₁₄	P₁₅	P₁₆

After 2x2 Attention Pooling:

Table 2: Token layout after 2x2 Attention Pooling (16 → 4 tokens, 1/4 reduction)

	Col 1	Col 2
Row 1	T₁ (P₁~P₆)	T₂ (P₃~P₈)
Row 2	T₃ (P₉~P₁₄)	T₄ (P₁₁~P₁₆)

Video Frames: 3x3 Pooling

Since videos have many frames, a 3x3 window is used to further reduce the token count.

Input Patches (9x9 example):

Table 3: 9x9 input patch layout

	C1	C2	C3	C4	C5	C6	C7	C8	C9
R1	P₁	P₂	P₃	P₄	P₅	P₆	P₇	P₈	P₉
R2	…	…	…	…	…	…	…	…	…
…	…	…	…	…	…	…	…	…	…

After 3x3 Attention Pooling:

Table 4: Token layout after 3x3 Attention Pooling (81 → 9 tokens, 1/9 reduction)

	Col 1	Col 2	Col 3
Row 1	T₁ (9 patches)	T₂ (9 patches)	T₃ (9 patches)
…	…	…	…

Shared MLP Projection

Finally, the pooled features are projected by a Shared MLP. This MLP shares parameters between images and video frames, learning a unified visual representation.

The overall data flow from raw ViT features to LLM input tokens – multi-layer feature concatenation, image- or video-specific attention pooling, and the shared MLP projection – can be seen on the left half of the architecture figure.

Cropping Strategy

Image Cropping

Molmo2 employs a multi-crop strategy.

One downscaled full crop + up to K overlapping tile crops
During training: K = 8
During inference: K = 24 (high-resolution processing)

Images that cannot be tiled with K crops are downscaled. That is, from a single source image, one downscaled full crop and K overlapping tile crops \(C_1, C_2, \dots, C_K\) are generated, and both are fed to the ViT.

Column Tokens

For multi-crop images, column tokens are included in the input to the LLM. This conveys aspect ratio information of the image to the LLM.

Column tokens are not included for single-crop images (which are always square).

Video Cropping

For video, the following strategy is adopted to reduce computational cost:

Sampling rate: S = 2 fps (1 frame every 2 seconds)
Each frame is processed as a single crop (downscaled as needed)
Maximum frame count: F = 128 (standard training) or F = 384 (long-context training)

If the video length exceeds F/S seconds, F frames are uniformly sampled, and the last frame is always included.

Special Handling of the Last Frame

The last frame of a video is always included. This is because many video players display the last frame after playback ends, making it potentially significant to the user.

Bi-directional Attention

In Molmo2, the LLM is designed so that image tokens can mutually attend to each other when processing visual tokens (Shen et al. 2024; Gao et al. 2025; Team et al. 2025).

In a standard LLM, each token can only attend to tokens preceding it due to the causal mask. However, Molmo2 allows bi-directional attention for visual tokens.

Standard Causal Attention:

Table 5: Standard causal attention mask

	T₁	T₂	T₃	T₄
T₁	●	×	×	×
T₂	●	●	×	×
T₃	●	●	●	×
T₄	●	●	●	●

Bi-directional Attention on Vision Tokens:

Table 6: Bi-directional attention mask on vision tokens (V: Vision tokens, T: Text tokens)

	V₁	V₂	V₃	T₁	T₂
V₁	●	●	●	×	×
V₂	●	●	●	×	×
V₃	●	●	●	×	×
T₁	●	●	●	●	×
T₂	●	●	●	●	●

This allows visual tokens from different frames or different images to exchange information, enabling learning of spatiotemporal relationships.

Effect of Bi-directional Attention

Ablation studies confirmed that bi-directional attention on visual tokens improves performance.

It is particularly effective for tasks that require capturing relationships between multiple frames/images, such as video tracking and multi-image understanding.

Input Format to the LLM

Visual tokens generated by the Vision-Language Connector are fed to the LLM in the following formats.

Video

<image_start> [Visual Tokens for Frame1] <timestamp>0.5s</timestamp>
<image_start> [Visual Tokens for Frame2] <timestamp>1.0s</timestamp>
...
[Subtitle text] <timestamp>0.5s-2.0s</timestamp>

Timestamps are appended to each frame’s visual tokens
If subtitles are available, they are added as timestamped text

Multi-Image

<image_start> [Visual Tokens for Image1] <image>1</image>
<image_start> [Visual Tokens for Image2] <image>2</image>
...

An image index is appended to each image

Multi-Crop Images

<image_start> [Column Tokens] [Visual Tokens for Full Crop]
[Visual Tokens for Crop1] [Visual Tokens for Crop2] ...

Column tokens convey aspect ratio
Tokens from the full crop and partial crops are concatenated

Summary

The Molmo2 Vision-Language Connector has the following characteristics:

Multi-layer features: Extracts features from multiple ViT layers (3rd-to-last, 9th-from-last)
Adaptive pooling: 2x2 Attention Pooling for images, 3x3 for video frames
Shared parameters: Unified MLP projection for images and video
Multi-crop strategy: Uses up to 24 crops for high-resolution processing
Efficient video processing: 2 fps sampling + last frame retention
Bi-directional attention: Allows mutual interaction among visual tokens (improves performance)
Column tokens: Conveys aspect ratio information for multi-crop images

This design enables Molmo2 to process images and video uniformly while balancing computational efficiency and representational power.

References

Deitke, Matt, Christopher Clark, Sangho Lee, et al. 2024. “Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models.” arXiv Preprint arXiv:2409.17146. https://arxiv.org/abs/2409.17146.

Gao, Mingze, Jingyu Liu, Mingda Li, et al. 2025. “TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations.” Proceedings of the AAAI Conference on Artificial Intelligence. https://arxiv.org/abs/2409.03206.

Shen, Xiaoqian, Yunyang Xiong, Changsheng Zhao, et al. 2024. “LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding.” arXiv Preprint arXiv:2410.17434. https://arxiv.org/abs/2410.17434.

Team, Gemma, Aishwarya Kamath, Johan Ferret, et al. 2025. “Gemma 3 Technical Report.” arXiv Preprint arXiv:2503.19786. https://arxiv.org/abs/2503.19786.