Two Bridges, One Pathway

From VLMs to Generalizable VLAs with Embodied Trajectory-Coupled Data

Linqi Yin^1,2,* Shiduo Zhang^1,2,*,‡ Shenling Qiu^1,* Chenxin Li¹ Zhaoyang Fu¹ Lei Xiao²
Xiang Wang² Chenchen Yang^1,2 Zhe Xu^1,2 Pengfang Qian^1,2
Jingjing Gong² Xipeng Qiu^1,2 Xuanjing Huang¹ Yu-Gang Jiang^1,†

¹Fudan University ²Shanghai Innovation Institute

^*Equal Contribution ^‡Project Lead ^†Corresponding Author

Paper

Data Code Model Video

Abstract

Vision-language models (VLMs) are powerful general-purpose reasoners, yet converting them into robot control policies (VLAs) is surprisingly difficult. The root cause is a two-fold gap: VLMs are trained on internet-scale images with language-understanding objectives, while VLAs must perceive robot scenes and predict motor actions. Fine-tuning a VLM directly on robot action data forces the model to cross both gaps at once—the learning curve is steep and the rich generalizations learned during pretraining tend to degrade rather than transfer.

We argue that this gap can be bridged gradually with the right intermediate data. We introduce embodied trajectory-coupled (ETC) data—vision-language supervision derived from the same robot scenes and trajectories used for action learning. Because ETC data shares the visual context of robot operation while retaining familiar language-understanding objectives, it provides a natural stepping stone between VLM pretraining and VLA fine-tuning.

Building on this, we design a three-stage training recipe. Distribution Bridging first adapt the VLM to embodied visual-language semantics. Objective Bridging then gradually shifts the model toward action prediction while preserving the acquired representations. Retentive Adaptation finally specializes the policy to the target deployment domain. We further show that mixing task-relevant out-of-distribution ETC data with a small amount of action data enables the model to generalize to novel visual-language conditions without requiring additional robot demonstrations. Simulation and real-robot experiments confirm that this gradual bridging strategy is the key to transferring VLM generalization into robust, deployable robot policies.

Method

Method Overview

ETC first moves the VLM onto the embodied visual-language manifold, then preserves this alignment while action prediction is introduced, and finally carries the retained embodied competence into deployment adaptation.

Dual Mismatch: Why VLMs Do Not Directly Become VLAs

VLM-to-VLA adaptation must cross two mismatches at once: the distribution gap between general web-scale vision-language data and embodied robot scenes, and the objective gap between language understanding and executable action prediction. Action-only training jumps directly across both gaps, while generic multimodal co-training remains distant from robot manipulation. ETC provides the missing intermediate supervision by sharing embodied scene semantics with robot action data while retaining the familiar VLM objective.

ETC data bridges general VLM data and VLA action data — Figure 1(d): ETC as intermediate supervision between general VLM data and VLA data.

What is ETC?

Embodied Trajectory-Coupled data is vision-language supervision constructed from the same robot scenes and trajectories as action data. It is not action supervision, but it stays manipulation-relevant by sharing embodied scene and trajectory context. ETC forms a graded supervision spectrum: scene-grounding for task-agnostic perception, task-oriented ETC for affordance and planning, and action-aligned planning ETC for motion-coupled future gripper trajectories.

Examples of scene-grounding, task-oriented, and action-aligned planning ETC — Figure 1(b): ETC categories progress from scene grounding to task reasoning and motion-coupled planning.

Two Bridges: Distribution + Objective Bridging

Because ETC is objective-aligned with VLMs and distribution-aligned with robot data, it supports two explicit bridges. Distribution Bridging adapts the VLM to robot scenes before action learning while keeping the next-token objective unchanged. Objective Bridging then introduces action-token prediction on top of an already embodied representation, with continued ETC co-training preserving embodied alignment during action learning.

Distribution Bridging

ETC-only training moves the VLM from general visual-language data to embodied robot scenes.

Objective Bridging

Action + ETC co-training introduces action prediction while anchoring the representation to embodied scenes.

One Pathway toward Generalizable VLAs

These two bridges form one continuous pathway from VLM initialization to deployable VLA policies. Retentive Adaptation continues pairing action data with target-scene ETC, preserving reusable embodied competence while expanding the policy capability boundary. The pathway does not force a VLM to become a robot policy in one jump; ETC guides the model through embodied alignment, action grounding, and deployment adaptation.

ETC pathway provides a smoother route toward generalizable VLA policies — Figure 1(c): ETC gives a smoother optimization route than direct action-only adaptation.

Experiments

Simulation and real-robot results

We evaluate the three-stage ETC pathway on LIBERO, SimplerEnv, VLABench, and real WidowX tasks. The experiments isolate how ETC helps at each adaptation stage: initialization, action learning, and downstream deployment.

Distribution Bridging strengthens the VLM initialization

ETC-based Distribution Bridging injects embodied task structure into the VLM representation, improves held-out embodied understanding, and gives stronger downstream action-learning results than raw PaliGemma. General VLM pretraining is still insufficient: the best initialization requires adapting the visual-language backbone to embodied manipulation scenes, including the visual tower.

Distribution bridging analysis for VLM initialization — Figure 2: Distribution Bridging aligns the VLM to the embodied manifold.

Table 1 comparing Stage 2 supervision strategies across Distribution Bridging settings — Table 1: Stage 2 supervision strategies across Distribution Bridging settings.

Objective Bridging preserves embodied alignment under action learning

Action-only Objective Bridging causes the backbone to drift from the embodied representation learned in Stage 1, while ETC co-training keeps the representation closer and improves policy success, especially on VLABench. The ETC categories are complementary: action-aligned planning gives the strongest execution signal, task-oriented ETC improves intention understanding, scene-grounding ETC provides broad perceptual support, and full ETC gives the strongest overall in-domain result.

CKA comparison for action-only and ETC co-training — Figure 3(a): CKA to the Distribution-Bridged VLM.

Figure 3b VLABench ETC category ablation table — Figure 3(b): VLABench performance under ETC category ablations.

Retentive Adaptation expands capability without knowledge erosion

Downstream adaptation extends capability without eroding previously learned embodied competence only when ETC is retained through the pathway. Real-robot results show that continuing ETC through Stage 2 and Stage 3 is strongest, while adding ETC only at Stage 3 is less effective. For unseen visual-language conditions, OOD ETC further improves generalization by redirecting attention from distractors to the requested target, without requiring matching OOD action demonstrations.

Figure 4a three-stage ETC adaptation ablation table on WidowX — Figure 4(a): three-stage ETC adaptation ablation on WidowX.

Attention heatmaps under OOD targets — Figure 4(b): attention heatmaps under OOD targets.

Video Presentation