Why retrieval belongs in the robot learning stack

Robots act in an open world. The set of tasks a home robot might be asked to do is effectively unbounded, so no matter how large we make an offline dataset, it will never hold a demonstration that matches the next task exactly. For a while the field's answer was to collect more teleoperation and then fine-tune. That works, and fine-tuning on in-domain data still gives large gains, but it scales linearly with human effort and it treats every new task as a fresh start.

There is a cheaper idea hiding in plain sight. A corpus that lacks an exact match almost always contains something close. If I want a robot to cover a cup, a diverse manipulation dataset may already hold put the block on the square, which is the same underlying motion of placing one object onto another. The question is not whether the right behavior exists in the data. It is whether we can find it and adapt it.

Language models already made this bet

Language modeling settled a version of this debate years ago. When a fixed set of weights cannot hold all the knowledge a task needs, the productive move is not only to grow the model but to attach a non parametric memory and retrieve from it. Retrieval augmented generation showed that conditioning a generator on retrieved documents improves factual accuracy^[1], and the nearest neighbor language model showed that simply interpolating a trained model with neighbors from a datastore lowers perplexity with no extra training^[2]. The lesson carries over. A vision language action policy maps an instruction and an observation to an action chunk,

$$ \pi_\theta\bigl(a_t,\dots,a_{t+H} \mid \ell,\, o_t\bigr), $$

and it has finite capacity facing an open world, which is exactly the regime where retrieval earns its keep.

The neighbors really are in the data now

This only pays off if good neighbors exist, and at current scale they do. Open X-Embodiment pooled more than a million trajectories across twenty two embodiments^[3], and DROID added seventy six thousand in the wild manipulation episodes across hundreds of scenes^[4]. At that size, no exact match and no useful neighbor are very different statements, and the second one is usually false.

Retrieval is already quietly working

Several lines of work already retrieve experience rather than try to memorize it. Behavior Retrieval queries a large unlabeled dataset with a handful of target demonstrations^[5]. STRAP retrieves at the sub trajectory level using vision foundation features and dynamic time warping^[6]. RICL gives a pretrained policy in context adaptability, so it improves on new tasks by pulling relevant snippets into context with no gradient updates^[7]. R+X retrieves clips from unlabeled human video and executes them through in context imitation^[8]. The shape is consistent. Retrieve relevant experience, then condition the policy on it.

Retrieval is not enough, you have to adapt

Here is the catch that separates robotics from text. A retrieved document can be read as is, but a retrieved demonstration was executed in a different scene with objects in different poses, so its raw trajectory rarely transfers. The neighbor is a behavioral prior, not an answer. This is where adaptation matters, and the tools are mature. Object centric and scene graph representations capture which objects matter and how they relate, as in ORION's open world object graphs^[9]. Dense 3D grounded correspondence from methods like MASt3R can align a retrieved scene to the current one^[10]. And a policy can consume a coarse geometric sketch instead of a perfect demonstration, which is what visual trace conditioning showed with RT-Trajectory^[11] and TraceVLA^[12].

The backbones are ready to be augmented

Modern policies are well suited to this. The line from RT-1^[13] and RT-2^[14] through OpenVLA^[15], the flow based models from Physical Intelligence^[16][17], and GR00T N1^[18] all inherit internet scale priors. The same semantic grounding those models stand on, built from features like DINOv2^[19], CLIP^[20], and SigLIP^[21], is exactly what lets a retriever find the right neighbor in the first place.

What RA-VLA does

My recent work, RA-VLA, makes this concrete with a retrieve then warp recipe over DROID. It encodes the instruction and a scene graph of the current observation, retrieves semantically matching demonstrations, then uses a multimodal language model to warp the retrieved trajectory to keypoint correspondences in the live scene. The warped trajectory is overlaid on the robot's camera views as a scene grounded reference, and we fine-tune a pi-0.5 backbone to act on these overlays. Candidates are ranked by a blend of language similarity and scene graph structure,

$$ d_{\text{final}}(d) = \alpha\,\bigl(1 - \cos(\mathbf{q}_\ell, \mathbf{e}^d_\ell)\bigr) + (1-\alpha)\,\bigl(1 - s_{\text{sg}}(d)\bigr), $$

where the first term measures instruction similarity and the second measures structural similarity between scene graphs. Across the RoboLab-120 simulator and real Franka tasks, conditioning on adapted neighbors improves success over a strong fine-tuned baseline with no new demonstrations collected. The retriever builds on our work on structured perception in ESCA^[22].

The thesis. Teleoperation scales with human hours. Retrieval scales with data you already have, and it improves every time the corpus grows. Retrieval belongs in the robot learning stack for the same reason it belongs in language modeling. It is the cheapest way to give a fixed capacity policy access to an open ended world.

Language models already made this bet

The neighbors really are in the data now

Retrieval is already quietly working

Retrieval is not enough, you have to adapt

The backbones are ready to be augmented

What RA-VLA does

References