World models, and why inverse dynamics beats joint prediction

World models are having a moment. Systems like Genie 3 generate interactive worlds you can steer in real time, and large video generators are now openly described as simulators of the physical world. The excitement is justified, but it hides a design question that matters for anyone who wants to act and not only watch. If the point of a world model is control, what exactly should it predict?

There are two answers. One learns a forward generative model that predicts what the world will look like next. The other learns an inverse model that recovers the action that explains a change which already happened. My claim is that for control, the inverse view is usually the better use of video, and the field's own results keep pointing that way.

The generative forward model

The classic recipe learns to imagine. The original World Models work trained a latent forward model and evolved a small controller inside it^[1]. The Dreamer line scaled the idea, learning behaviors purely from rollouts in a learned latent model and eventually mastering many domains from scratch^[2][3][4]. Transformer world models such as IRIS made the loop sample efficient^[5], and TD-MPC2 showed that a decoder free latent model can drive strong continuous control^[6]. These are real achievements, and planning inside a learned model is genuinely powerful.

What pixel prediction spends its capacity on

The trouble appears when the forward model has to predict observations directly. Pixels are enormous and most of their entropy is appearance that has little to do with what the robot should do. A model that predicts every future frame spends most of its capacity rendering texture, lighting, and background that the policy will ignore. Write the one step joint distribution over the next observation and the action as

$$ p(o_{t+1},\, a_t \mid o_t) = \underbrace{p(o_{t+1} \mid o_t, a_t)}_{\text{forward dynamics}}\; \underbrace{p(a_t \mid o_t)}_{\text{policy}}. $$

Learning the full forward term means modeling the entire observation distribution. For control we mostly care about the action, which is small and structured by comparison.

The inverse view scales to the video we actually have

An inverse dynamics model flips the target. Rather than predicting the next frame, it predicts the action that connects two frames,

$$ q_\phi(a_t \mid o_t,\, o_{t+1}). $$

This is a low dimensional and decision relevant target, and it unlocks the largest resource we have, which is video with no action labels. VPT trained an inverse dynamics model on a small labeled set, used it to label a huge amount of unlabeled web video, then pretrained a strong behavior prior on the result^[7]. LAPO went further and recovered latent actions, a latent inverse model, and a latent action policy from action free video alone^[8]. LAPA carried the idea into vision language action pretraining and, strikingly, matched or beat a model trained on ground truth actions^[9]. Genie learned a latent action model from unlabeled video and turned it into a controllable world^[10]. The pattern is hard to miss. When the supervision target is the action rather than the pixels, unlabeled video becomes training data.

Why this is not only an efficiency trick

There is a representational reason as well. An inverse model is naturally invariant to much of the appearance variation that a forward model must reproduce. Two scenes that look different yet require the same motion map to the same action, so the learned latent organizes experience by what to do rather than by how things look. That is the property a policy actually needs.

Where forward models still earn their place

None of this retires generative models. They remain the right tool for planning and for simulation. Dreamer plans by imagining^[4], and a line of work from Yilun Du treats video generation itself as a universal policy and as a learned dynamics model for planning, from UniPi^[11] to Video Language Planning^[12], with Diffusion Forcing giving stable long horizon rollouts^[13]. The broader position that video can serve as a shared interface for perception, planning, and simulation is worth taking seriously^[14], and driving world models like GAIA-1^[15] together with the Genie 2 and Genie 3 systems^[16][17] show how far the generative side has come. Even here the trend is to stop predicting raw pixels. Sora frames video diffusion as a world simulator^[18], yet V-JEPA argues for predicting in a learned latent space rather than reconstructing frames^[19], which is the same instinct that makes inverse and latent action models work.

Where I am headed

This is the thread I am most excited to pull on in my PhD at Harvard University. I do not think the real question is forward against inverse in the abstract. The win comes from learning an action relevant latent, whether we get it from an inverse model on real video, from latent actions distilled out of a generative model, or from both. Pixels are a means. The action is the target.

The thesis. If the goal is control, predict the thing you will act on. Inverse dynamics and latent action models get there with the abundant unlabeled video we already have, while a forward model that must render the future pays for detail the policy throws away. Generative world models still shine for planning and simulation, yet even they are drifting away from pixels and toward latents. That convergence is the real signal.

The generative forward model

What pixel prediction spends its capacity on

The inverse view scales to the video we actually have

Why this is not only an efficiency trick

Where forward models still earn their place

Where I am headed

References