Robot Learning

Why retrieval belongs in the robot learning stack

Jun 12, 2026 · 6 min read · back to blog

Robots act in an open world. The set of tasks a home robot might be asked to do is effectively unbounded, so no matter how large we make an offline dataset, it will never hold a demonstration that matches the next task exactly. For a while the field's answer was to collect more teleoperation and then fine-tune. That works, and fine-tuning on in-domain data still gives large gains, but it scales linearly with human effort and it treats every new task as a fresh start.

There is a cheaper idea hiding in plain sight. A corpus that lacks an exact match almost always contains something close. If I want a robot to cover a cup, a diverse manipulation dataset may already hold put the block on the square, which is the same underlying motion of placing one object onto another. The question is not whether the right behavior exists in the data. It is whether we can find it and adapt it.

Language models already made this bet

Language modeling settled a version of this debate years ago. When a fixed set of weights cannot hold all the knowledge a task needs, the productive move is not only to grow the model but to attach a non parametric memory and retrieve from it. Retrieval augmented generation showed that conditioning a generator on retrieved documents improves factual accuracy[1], and the nearest neighbor language model showed that simply interpolating a trained model with neighbors from a datastore lowers perplexity with no extra training[2]. The lesson carries over. A vision language action policy maps an instruction and an observation to an action chunk,

$$ \pi_\theta\bigl(a_t,\dots,a_{t+H} \mid \ell,\, o_t\bigr), $$

and it has finite capacity facing an open world, which is exactly the regime where retrieval earns its keep.

The neighbors really are in the data now

This only pays off if good neighbors exist, and at current scale they do. Open X-Embodiment pooled more than a million trajectories across twenty two embodiments[3], and DROID added seventy six thousand in the wild manipulation episodes across hundreds of scenes[4]. At that size, no exact match and no useful neighbor are very different statements, and the second one is usually false.

Retrieval is already quietly working

Several lines of work already retrieve experience rather than try to memorize it. Behavior Retrieval queries a large unlabeled dataset with a handful of target demonstrations[5]. STRAP retrieves at the sub trajectory level using vision foundation features and dynamic time warping[6]. RICL gives a pretrained policy in context adaptability, so it improves on new tasks by pulling relevant snippets into context with no gradient updates[7]. R+X retrieves clips from unlabeled human video and executes them through in context imitation[8]. The shape is consistent. Retrieve relevant experience, then condition the policy on it.

Retrieval is not enough, you have to adapt

Here is the catch that separates robotics from text. A retrieved document can be read as is, but a retrieved demonstration was executed in a different scene with objects in different poses, so its raw trajectory rarely transfers. The neighbor is a behavioral prior, not an answer. This is where adaptation matters, and the tools are mature. Object centric and scene graph representations capture which objects matter and how they relate, as in ORION's open world object graphs[9]. Dense 3D grounded correspondence from methods like MASt3R can align a retrieved scene to the current one[10]. And a policy can consume a coarse geometric sketch instead of a perfect demonstration, which is what visual trace conditioning showed with RT-Trajectory[11] and TraceVLA[12].

The backbones are ready to be augmented

Modern policies are well suited to this. The line from RT-1[13] and RT-2[14] through OpenVLA[15], the flow based models from Physical Intelligence[16][17], and GR00T N1[18] all inherit internet scale priors. The same semantic grounding those models stand on, built from features like DINOv2[19], CLIP[20], and SigLIP[21], is exactly what lets a retriever find the right neighbor in the first place.

What RA-VLA does

My recent work, RA-VLA, makes this concrete with a retrieve then warp recipe over DROID. It encodes the instruction and a scene graph of the current observation, retrieves semantically matching demonstrations, then uses a multimodal language model to warp the retrieved trajectory to keypoint correspondences in the live scene. The warped trajectory is overlaid on the robot's camera views as a scene grounded reference, and we fine-tune a pi-0.5 backbone to act on these overlays. Candidates are ranked by a blend of language similarity and scene graph structure,

$$ d_{\text{final}}(d) = \alpha\,\bigl(1 - \cos(\mathbf{q}_\ell, \mathbf{e}^d_\ell)\bigr) + (1-\alpha)\,\bigl(1 - s_{\text{sg}}(d)\bigr), $$

where the first term measures instruction similarity and the second measures structural similarity between scene graphs. Across the RoboLab-120 simulator and real Franka tasks, conditioning on adapted neighbors improves success over a strong fine-tuned baseline with no new demonstrations collected. The retriever builds on our work on structured perception in ESCA[22].

The thesis. Teleoperation scales with human hours. Retrieval scales with data you already have, and it improves every time the corpus grows. Retrieval belongs in the robot learning stack for the same reason it belongs in language modeling. It is the cheapest way to give a fixed capacity policy access to an open ended world.

References

  1. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Patrick Lewis et al., NeurIPS 2020. arXiv:2005.11401
  2. Generalization through Memorization: Nearest Neighbor Language Models. Urvashi Khandelwal et al., ICLR 2020. arXiv:1911.00172
  3. Open X-Embodiment: Robotic Learning Datasets and RT-X Models. Open X-Embodiment Collaboration, ICRA 2024. arXiv:2310.08864
  4. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. Alexander Khazatsky et al., RSS 2024. arXiv:2403.12945
  5. Behavior Retrieval: Few-Shot Imitation Learning by Querying Unlabeled Datasets. Maximilian Du et al., RSS 2023. arXiv:2304.08742
  6. STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning. Marius Memmel et al., ICLR 2025. arXiv:2412.15182
  7. RICL: Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models. Kaustubh Sridhar et al., CoRL 2025. arXiv:2508.02062
  8. R+X: Retrieval and Execution from Everyday Human Videos. Georgios Papagiannis et al., 2024. arXiv:2407.12957
  9. Vision-based Manipulation from Single Human Video with Open-World Object Graphs (ORION). Yifeng Zhu et al., 2024. arXiv:2405.20321
  10. Grounding Image Matching in 3D with MASt3R. Vincent Leroy et al., ECCV 2024. arXiv:2406.09756
  11. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches. Jiayuan Gu et al., ICLR 2024. arXiv:2311.01977
  12. TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies. Ruijie Zheng et al., ICLR 2025. arXiv:2412.10345
  13. RT-1: Robotics Transformer for Real-World Control at Scale. Anthony Brohan et al., RSS 2023. arXiv:2212.06817
  14. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. Anthony Brohan et al., CoRL 2023. arXiv:2307.15818
  15. OpenVLA: An Open-Source Vision-Language-Action Model. Moo Jin Kim et al., CoRL 2024. arXiv:2406.09246
  16. pi-0: A Vision-Language-Action Flow Model for General Robot Control. Kevin Black et al., Physical Intelligence, 2024. arXiv:2410.24164
  17. pi-0.5: a Vision-Language-Action Model with Open-World Generalization. Physical Intelligence, 2025. arXiv:2504.16054
  18. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. NVIDIA, 2025. arXiv:2503.14734
  19. DINOv2: Learning Robust Visual Features without Supervision. Maxime Oquab et al., TMLR 2024. arXiv:2304.07193
  20. Learning Transferable Visual Models From Natural Language Supervision. Alec Radford et al., ICML 2021. arXiv:2103.00020
  21. Sigmoid Loss for Language Image Pre-Training (SigLIP). Xiaohua Zhai et al., ICCV 2023. arXiv:2303.15343
  22. ESCA: Contextualizing Embodied Agents via Scene-Graph Generation. Jiani Huang, Amish Sethi et al., NeurIPS 2025 Spotlight. arXiv:2510.15963

← All posts