Ask why an embodied agent failed and the answer is usually not that it made a bad plan. It is that it misread the scene. In our work on ESCA we found that up to 69 percent of agent failures came from perception errors rather than reasoning[1]. The pattern shows up well beyond our setting. On BLINK, strong multimodal models sit barely above chance on core visual perception that people find trivial[2], and on EmbodiedBench the best model reaches only about 29 percent average success while doing far better at high level reasoning than at the low level perception that grounds it[3]. The bottleneck is seeing, not thinking.
Flat features blur the things that matter
Most vision language and vision language action models consume an image as a grid of features. That representation is wonderful for semantics and weak for relations. Whether the cup is to the left of the bowl or behind it is exactly the fact that a global feature vector smears together, and benchmarks built to probe spatial reasoning confirm that current models struggle with it[4]. A policy that cannot reliably tell where things sit relative to each other will act on a guess.
Scene graphs make structure explicit
There is an old and well grounded alternative. Represent the scene as a graph,
$$ G = (V, E), $$
where each node in $V$ is an object with its attributes and each edge in $E$ is a relation between two objects. The scene graph formalism began as a tool for image retrieval[5], became learnable end to end[6], and was grounded in the dense annotations of Visual Genome[7]. For robots the idea extends naturally into three dimensions. 3D scene graphs organize space into objects, rooms, and their relations[8], and systems like Hydra build and optimize them from sensor data in real time[9]. Structure is not a research toy here. It already runs on robots.
Why a graph is the right interface
A graph is the representation a planner actually wants. It is compact, it is object centric in the spirit of learned slot representations[10], and it speaks in objects and relations, which is the level at which language conditioned planners operate. SayCan grounds instructions in affordances[11], Code as Policies and ProgPrompt write programs over detected objects[12][13], and VoxPoser and ReKep reason over 3D spatial structure and relational keypoints between perception and action[14][15]. Each of these wants a clean, structured view of the world. Handing it a graph rather than a bag of pixels removes a guessing step.
What ESCA does
This is the bet behind ESCA. We built VINE, a foundation model that turns video into spatio-temporal scene graphs, and fed those graphs to multimodal models as explicit spatial context. The graphs are produced through a neurosymbolic pipeline in the tradition of Scallop[16] and DeepProbLog[17], which lets us train an open vocabulary scene graph generator without hand labeled graphs. Giving agents this structured context raised success and cut perception errors on EmbodiedBench without retraining the underlying models[1]. The gain came from fixing what the agent sees, not from making it think harder.
Structure pays twice
The same structure that helps an agent act also helps it retrieve. In my work on RA-VLA the scene graph is the key we search with, since matching on objects and relations finds a more useful neighbor than matching on raw appearance. A simple relational score over matched edges,
$$ s_{\text{sg}}(d) = \frac{1}{|M|}\sum_{(i,j)\in M}\cos\!\bigl(\mathbf{r}^{q}_{i},\, \mathbf{r}^{d}_{j}\bigr), $$
rewards demonstrations whose relational layout resembles the current scene. Recent policies are starting to make the graph a first class part of control as well, orchestrating low level skills over a symbolic scene graph[18].