Vision-Language Connector

Overview

The Vision-Language Connector is a critical module that transforms visual features extracted by the Vision Transformer (ViT) into a format that the Large Language Model (LLM) can process. Molmo2 follows the standard VLM architecture (Deitke et al. 2024) and adopts a design that can uniformly process both images and video. For the end-to-end pipeline that the Connector sits inside, see the Molmo2 architecture figure in the overview chapter.

Architecture Details

Multi-Layer Feature Usage

The Molmo2 Vision-Language Connector extracts features from multiple layers of the ViT, rather than a single layer.

  • Third-to-last layer: High-level semantic features
  • Ninth-from-last layer: Mid-level features

This design follows the prior work Molmo (Deitke et al. 2024), combining visual information at different levels of abstraction to achieve richer representations.

Attention Pooling

Attention Pooling is used to reduce patch-level features. The mean of the patches serves as the query, and each patch window is aggregated into a single vector.

Images: 2x2 Pooling

Input Patches (4x4 example):

Table 1: 4x4 input patch layout
Col 1 Col 2 Col 3 Col 4
Row 1 P₁ P₂ P₃ P₄
Row 2 P₅ P₆ P₇ P₈
Row 3 P₉ P₁₀ P₁₁ P₁₂
Row 4 P₁₃ P₁₄ P₁₅ P₁₆

After 2x2 Attention Pooling:

Table 2: Token layout after 2x2 Attention Pooling (16 → 4 tokens, 1/4 reduction)
Col 1 Col 2
Row 1 T₁ (P₁~P₆) T₂ (P₃~P₈)
Row 2 T₃ (P₉~P₁₄) T₄ (P₁₁~P₁₆)

Video Frames: 3x3 Pooling

Since videos have many frames, a 3x3 window is used to further reduce the token count.

Input Patches (9x9 example):

Table 3: 9x9 input patch layout
C1 C2 C3 C4 C5 C6 C7 C8 C9
R1 P₁ P₂ P₃ P₄ P₅ P₆ P₇ P₈ P₉
R2

After 3x3 Attention Pooling:

Table 4: Token layout after 3x3 Attention Pooling (81 → 9 tokens, 1/9 reduction)
Col 1 Col 2 Col 3
Row 1 T₁ (9 patches) T₂ (9 patches) T₃ (9 patches)

Shared MLP Projection

Finally, the pooled features are projected by a Shared MLP. This MLP shares parameters between images and video frames, learning a unified visual representation.

The overall data flow from raw ViT features to LLM input tokens – multi-layer feature concatenation, image- or video-specific attention pooling, and the shared MLP projection – can be seen on the left half of the architecture figure.

Cropping Strategy

Image Cropping

Molmo2 employs a multi-crop strategy.

  • One downscaled full crop + up to K overlapping tile crops
  • During training: K = 8
  • During inference: K = 24 (high-resolution processing)

Images that cannot be tiled with K crops are downscaled. That is, from a single source image, one downscaled full crop and K overlapping tile crops \(C_1, C_2, \dots, C_K\) are generated, and both are fed to the ViT.

NoteColumn Tokens

For multi-crop images, column tokens are included in the input to the LLM. This conveys aspect ratio information of the image to the LLM.

Column tokens are not included for single-crop images (which are always square).

Video Cropping

For video, the following strategy is adopted to reduce computational cost:

  • Sampling rate: S = 2 fps (1 frame every 2 seconds)
  • Each frame is processed as a single crop (downscaled as needed)
  • Maximum frame count: F = 128 (standard training) or F = 384 (long-context training)

If the video length exceeds F/S seconds, F frames are uniformly sampled, and the last frame is always included.

TipSpecial Handling of the Last Frame

The last frame of a video is always included. This is because many video players display the last frame after playback ends, making it potentially significant to the user.

Bi-directional Attention

In Molmo2, the LLM is designed so that image tokens can mutually attend to each other when processing visual tokens (Shen et al. 2024; Gao et al. 2025; Team et al. 2025).

In a standard LLM, each token can only attend to tokens preceding it due to the causal mask. However, Molmo2 allows bi-directional attention for visual tokens.

Standard Causal Attention:

Table 5: Standard causal attention mask
T₁ T₂ T₃ T₄
T₁ × × ×
T₂ × ×
T₃ ×
T₄

Bi-directional Attention on Vision Tokens:

Table 6: Bi-directional attention mask on vision tokens (V: Vision tokens, T: Text tokens)
V₁ V₂ V₃ T₁ T₂
V₁ × ×
V₂ × ×
V₃ × ×
T₁ ×
T₂

This allows visual tokens from different frames or different images to exchange information, enabling learning of spatiotemporal relationships.

ImportantEffect of Bi-directional Attention

Ablation studies confirmed that bi-directional attention on visual tokens improves performance.

It is particularly effective for tasks that require capturing relationships between multiple frames/images, such as video tracking and multi-image understanding.

Input Format to the LLM

Visual tokens generated by the Vision-Language Connector are fed to the LLM in the following formats.

Video

<image_start> [Visual Tokens for Frame1] <timestamp>0.5s</timestamp>
<image_start> [Visual Tokens for Frame2] <timestamp>1.0s</timestamp>
...
[Subtitle text] <timestamp>0.5s-2.0s</timestamp>
  • Timestamps are appended to each frame’s visual tokens
  • If subtitles are available, they are added as timestamped text

Multi-Image

<image_start> [Visual Tokens for Image1] <image>1</image>
<image_start> [Visual Tokens for Image2] <image>2</image>
...
  • An image index is appended to each image

Multi-Crop Images

<image_start> [Column Tokens] [Visual Tokens for Full Crop]
[Visual Tokens for Crop1] [Visual Tokens for Crop2] ...
  • Column tokens convey aspect ratio
  • Tokens from the full crop and partial crops are concatenated

Summary

The Molmo2 Vision-Language Connector has the following characteristics:

  1. Multi-layer features: Extracts features from multiple ViT layers (3rd-to-last, 9th-from-last)
  2. Adaptive pooling: 2x2 Attention Pooling for images, 3x3 for video frames
  3. Shared parameters: Unified MLP projection for images and video
  4. Multi-crop strategy: Uses up to 24 crops for high-resolution processing
  5. Efficient video processing: 2 fps sampling + last frame retention
  6. Bi-directional attention: Allows mutual interaction among visual tokens (improves performance)
  7. Column tokens: Conveys aspect ratio information for multi-crop images

This design enables Molmo2 to process images and video uniformly while balancing computational efficiency and representational power.

References

Deitke, Matt, Christopher Clark, Sangho Lee, et al. 2024. “Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models.” arXiv Preprint arXiv:2409.17146. https://arxiv.org/abs/2409.17146.
Gao, Mingze, Jingyu Liu, Mingda Li, et al. 2025. TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations.” Proceedings of the AAAI Conference on Artificial Intelligence. https://arxiv.org/abs/2409.03206.
Shen, Xiaoqian, Yunyang Xiong, Changsheng Zhao, et al. 2024. LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding.” arXiv Preprint arXiv:2410.17434. https://arxiv.org/abs/2410.17434.
Team, Gemma, Aishwarya Kamath, Johan Ferret, et al. 2025. Gemma 3 Technical Report.” arXiv Preprint arXiv:2503.19786. https://arxiv.org/abs/2503.19786.