OmniNav ✨

A Unified Framework for Prospective Exploration and Visual-Language Navigation

📄 Research Paper

Xinda Xue1,2, Junjun Hu1, Minghua Luo1, Shichao Xie1, Jintao Chen1,2, Zixun Xie1, Kuichen Quan1, Wei Guo1, Mu Xu1, Zedong Chu1,

1Amap, Alibaba Group, 2Peking University

Abstract

Embodied navigation is a foundational challenge for intelligent robots, demanding the ability to comprehend visual environments, follow natural language instructions, and explore autonomously. However, existing models struggle to provide a unified solution across heterogeneous navigation paradigms, often yielding low success rates and limited generalization. We present OmniNav, a unified frame work that handles instruct-goal, object-goal, point-goal navigation, and frontier-based exploration within a single architecture. First, we introduce a lightweight, low-latency policy that predicts continuous-space waypoints (coordinates and orientations) with high accuracy, outperforming action-chunk methods in precision and supporting real-world deployment with control frequencies up to 5 Hz. Second, at the architectural level, OmniNav proposes a fast-slow system design: a fast module performs waypoint generation from relatively short-horizon visual context and subtasks, while a slow module conducts deliberative planning using long-horizon observations and candidate frontiers to select the next subgoal and subtask. This collaboration improves path efficiency and maintains trajectory coherence in exploration and memory-intensive settings. Notably, we find that the primary bottleneck lies not in navigation policy learning per se, but in robust understanding of general instructions and objects. To enhance generalization, we incorporate large-scale general-purpose training datasets including those used for image captioning and referring/grounding into a joint multi-task regimen, which substantially boosts success rates and robustness. Extensive experiments demonstrate state-of-the-art performance across diverse navigation benchmarks, and real-world deployment further validates the approach. OmniNav offers practical insights for embodied navigation and points to a scalable path toward versatile, highly generalizable robotic intelligence.

Real-World Visualization

Approach

OmniVLN Framework Approach
An overview of our OmniNav architecture:The fast system can independently handle multi-task navigation, using the VLM backbone and a flow-matching policy to rapidly generate waypoints. Building on this, a slow thinking module is integrated to enable long-term memory and planning: it constructs long-range spatial and semantic memory using frontiers and images, and provides subgoal cues. The collaboration between the slow and fast proceeds as follows: the slow system uses frontiers or memory to generate high-level subgoals, once a subgoal is determined, the fast system takes over and progressively produces low-level waypoint sequences, ultimately reaching the target.

Experiment

results
results

Habitat Visualization

BibTeX Citation

@article{xue2025omninav,
  title={OmniNav: A Unified Framework for Prospective Exploration and Visual-Language Navigation},
  author={Xue, Xinda and Hu, Junjun and Luo, Minghua and Shichao, Xie and Chen, Jintao and Xie, Zixun and Kuichen, Quan and Wei, Guo and Xu, Mu and Chu, Zedong},
  journal={arXiv preprint arXiv:2509.25687},
  year={2025}
}