Perception

Scene graphs as the interface between perception and action

Apr 30, 2026 · 5 min read · back to blog

Ask why an embodied agent failed and the answer is usually not that it made a bad plan. It is that it misread the scene. In our work on ESCA we found that up to 69 percent of agent failures came from perception errors rather than reasoning[1]. The pattern shows up well beyond our setting. On BLINK, strong multimodal models sit barely above chance on core visual perception that people find trivial[2], and on EmbodiedBench the best model reaches only about 29 percent average success while doing far better at high level reasoning than at the low level perception that grounds it[3]. The bottleneck is seeing, not thinking.

Flat features blur the things that matter

Most vision language and vision language action models consume an image as a grid of features. That representation is wonderful for semantics and weak for relations. Whether the cup is to the left of the bowl or behind it is exactly the fact that a global feature vector smears together, and benchmarks built to probe spatial reasoning confirm that current models struggle with it[4]. A policy that cannot reliably tell where things sit relative to each other will act on a guess.

Scene graphs make structure explicit

There is an old and well grounded alternative. Represent the scene as a graph,

$$ G = (V, E), $$

where each node in $V$ is an object with its attributes and each edge in $E$ is a relation between two objects. The scene graph formalism began as a tool for image retrieval[5], became learnable end to end[6], and was grounded in the dense annotations of Visual Genome[7]. For robots the idea extends naturally into three dimensions. 3D scene graphs organize space into objects, rooms, and their relations[8], and systems like Hydra build and optimize them from sensor data in real time[9]. Structure is not a research toy here. It already runs on robots.

Why a graph is the right interface

A graph is the representation a planner actually wants. It is compact, it is object centric in the spirit of learned slot representations[10], and it speaks in objects and relations, which is the level at which language conditioned planners operate. SayCan grounds instructions in affordances[11], Code as Policies and ProgPrompt write programs over detected objects[12][13], and VoxPoser and ReKep reason over 3D spatial structure and relational keypoints between perception and action[14][15]. Each of these wants a clean, structured view of the world. Handing it a graph rather than a bag of pixels removes a guessing step.

What ESCA does

This is the bet behind ESCA. We built VINE, a foundation model that turns video into spatio-temporal scene graphs, and fed those graphs to multimodal models as explicit spatial context. The graphs are produced through a neurosymbolic pipeline in the tradition of Scallop[16] and DeepProbLog[17], which lets us train an open vocabulary scene graph generator without hand labeled graphs. Giving agents this structured context raised success and cut perception errors on EmbodiedBench without retraining the underlying models[1]. The gain came from fixing what the agent sees, not from making it think harder.

Structure pays twice

The same structure that helps an agent act also helps it retrieve. In my work on RA-VLA the scene graph is the key we search with, since matching on objects and relations finds a more useful neighbor than matching on raw appearance. A simple relational score over matched edges,

$$ s_{\text{sg}}(d) = \frac{1}{|M|}\sum_{(i,j)\in M}\cos\!\bigl(\mathbf{r}^{q}_{i},\, \mathbf{r}^{d}_{j}\bigr), $$

rewards demonstrations whose relational layout resembles the current scene. Recent policies are starting to make the graph a first class part of control as well, orchestrating low level skills over a symbolic scene graph[18].

The thesis. Perception should hand the policy a clean object level model of the world, not a wall of features. Scene graphs are a natural form for that handoff. They are compact, they are spatial, and they match how both planners and retrievers want to reason. If most embodied failures are perception failures, the fix is to give perception a structure worth trusting.

References

  1. ESCA: Contextualizing Embodied Agents via Scene-Graph Generation. Jiani Huang, Amish Sethi et al., NeurIPS 2025 Spotlight. arXiv:2510.15963
  2. BLINK: Multimodal Large Language Models Can See but Not Perceive. Xingyu Fu et al., ECCV 2024. arXiv:2404.12390
  3. EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents. Rui Yang et al., ICML 2025. arXiv:2502.09560
  4. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities. Boyuan Chen et al., CVPR 2024. arXiv:2401.12168
  5. Image Retrieval using Scene Graphs. Justin Johnson et al., CVPR 2015. CVF open access
  6. Scene Graph Generation by Iterative Message Passing. Danfei Xu et al., CVPR 2017. arXiv:1701.02426
  7. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Ranjay Krishna et al., IJCV 2017. IJCV 123(1)
  8. 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera. Iro Armeni et al., ICCV 2019. CVF open access
  9. Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimization. Nathan Hughes et al., RSS 2022. arXiv:2201.13360
  10. Object-Centric Learning with Slot Attention. Francesco Locatello et al., NeurIPS 2020. arXiv:2006.15055
  11. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan). Michael Ahn et al., CoRL 2022. arXiv:2204.01691
  12. Code as Policies: Language Model Programs for Embodied Control. Jacky Liang et al., ICRA 2023. arXiv:2209.07753
  13. ProgPrompt: Generating Situated Robot Task Plans using Large Language Models. Ishika Singh et al., ICRA 2023. arXiv:2209.11302
  14. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models. Wenlong Huang et al., CoRL 2023. arXiv:2307.05973
  15. ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation. Wenlong Huang et al., CoRL 2024. arXiv:2409.01652
  16. Scallop: A Language for Neurosymbolic Programming. Ziyang Li et al., PLDI 2023. arXiv:2304.04812
  17. DeepProbLog: Neural Probabilistic Logic Programming. Robin Manhaeve et al., NeurIPS 2018. NeurIPS proceedings
  18. GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies. Maelic Neau et al., 2025. arXiv:2511.04357

← All posts