AstraNav-World ✨

World Model for Foresight Control and Consistency

📄 Research Paper

Junjun Hu1,*, Jintao Chen1,2,*, Haochen Bai1,*, Minghua Luo1, Shichao Xie1, Ziyi Chen1, Fei Liu1, Zedong Chu1, Xinda Xue1,2, Botao Ren1,3, Xiaolong Wu1, Mu Xu1, Shanghang Zhang2,

1Amap, Alibaba Group, 2Peking University
3Tsinghua University

Contact with us: {hujunjun.hjj, anyi.cjt, baihaochen.bhc}@alibaba-inc.com, cjt@stu.pku.edu.cn

Abstract

Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We introduce a unified generative world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our approach integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated together. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. At inference, the model alternates between forecasting plausible future frames and refining the action plan given both language instructions and evolving visual rollouts. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled “predict-then-plan” pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision–action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. The model further provides interpretable, step-by-step future visualizations that expose planning rationale and uncertainties, facilitating diagnosis and robust deployment. Overall, by unifying foresight and control within a single generative model, we move closer to reliable, interpretable, and general-purpose embodied agents that operate robustly in open-ended real-world settings.

Consistency Visualization

( Zero Shot )

Approach

JanusVLN Framework Approach
An overview of our AstraNav-World architecture. Our model is composed of three jointly trained modules: (a) a VLM planner (τθ) that processes instructions and visual history to generate high-level conditioning tokens; (b) a Diffusion Policy (ϕθ) that generates future actions based on the VLM’s guidance; and (c) a VLM-conditioned Video Generator (υθ) that predicts future visual scenes consistent with the VLM’s plan.

Experiment

results
results
results

Real-World Visualization

( Zero Shot )

Habitat Visualization

BibTeX Citation

                @misc{hu2025astranavworldworldmodelforesight,
                    title={AstraNav-World: World Model for Foresight Control and Consistency}, 
                    author={Junjun Hu and Jintao Chen and Haochen Bai and Minghua Luo and Shichao Xie and Ziyi Chen and Fei Liu and Zedong Chu and Xinda Xue and Botao Ren and Xiaolong Wu and Mu Xu and Shanghang Zhang},
                    year={2025},
                    eprint={2512.21714},
                    archivePrefix={arXiv},
                    primaryClass={cs.CV},
                    url={https://arxiv.org/abs/2512.21714}, 
                }