AstraNav-World ✨

World Model for Foresight Control and Consistency

Junjun Hu^1,*, Jintao Chen^1,2,*, Haochen Bai^1,*, Minghua Luo¹, Shichao Xie¹, Ziyi Chen¹, Fei Liu¹, Zedong Chu¹, Xinda Xue^1,2, Botao Ren^1,3, Xiaolong Wu¹, Mu Xu¹, Shanghang Zhang²,

¹Amap, Alibaba Group, ²Peking University
³Tsinghua University

Contact with us: {hujunjun.hjj, anyi.cjt, baihaochen.bhc}@alibaba-inc.com, cjt@stu.pku.edu.cn

Abstract

Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We introduce a unified generative world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our approach integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated together. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. At inference, the model alternates between forecasting plausible future frames and refining the action plan given both language instructions and evolving visual rollouts. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled “predict-then-plan” pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision–action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. The model further provides interpretable, step-by-step future visualizations that expose planning rationale and uncertainties, facilitating diagnosis and robust deployment. Overall, by unifying foresight and control within a single generative model, we move closer to reliable, interpretable, and general-purpose embodied agents that operate robustly in open-ended real-world settings.

Consistency Visualization

( Zero Shot )

Approach

An overview of our AstraNav-World architecture. Our model is composed of three jointly trained modules: (a) a VLM planner (τθ) that processes instructions and visual history to generate high-level conditioning tokens; (b) a Diffusion Policy (ϕθ) that generates future actions based on the VLM’s guidance; and (c) a VLM-conditioned Video Generator (υθ) that predicts future visual scenes consistent with the VLM’s plan.

Experiment

Real-World Visualization

( Zero Shot )

Habitat Visualization

BibTeX Citation

                @misc{hu2025astranavworldworldmodelforesight,
                    title={AstraNav-World: World Model for Foresight Control and Consistency}, 
                    author={Junjun Hu and Jintao Chen and Haochen Bai and Minghua Luo and Shichao Xie and Ziyi Chen and Fei Liu and Zedong Chu and Xinda Xue and Botao Ren and Xiaolong Wu and Mu Xu and Shanghang Zhang},
                    year={2025},
                    eprint={2512.21714},
                    archivePrefix={arXiv},
                    primaryClass={cs.CV},
                    url={https://arxiv.org/abs/2512.21714}, 
                }