AstraNav-Memory ✨

Contexts Compression for Long Memory

📄 Research Paper

Botao Ren1,2*, Junjun Hu1*, Xinda Xue1,3*, Minghua Luo1, Jintao Chen1,3, Haochen Bai1, Liangliang You1, Mu Xu1,

1Amap, Alibaba Group, 2Tsinghua University
3Peking University

Abstract

Lifelong embodied navigation requires agents to accumulate, retain, and exploit spatial–semantic experience across tasks, enabling efficient exploration in novel environments and rapid goal reaching in familiar ones. While object-centric memory is interpretable, it depends on detection and reconstruction pipelines that limit robustness and scalability. We propose an image-centric memory framework that achieves long-term implicit memory via an efficient visual context compression module end-to-end coupled with a Qwen2.5-VL–based navigation policy. Built atop a ViT backbone with frozen DINOv3 features and lightweight PixelUnshuffle+Conv blocks, our visual tokenizer supports configurable compression rates; for example, under a representative 16$\times$ compression setting, each image is encoded with about 30 tokens, expanding the effective context capacity from tens to hundreds of images. Experimental results on GOAT-Bench and HM3D-OVON show that our method achieves state-of-the-art navigation performance, improving exploration in unfamiliar environments and shortening paths in familiar ones. Ablation studies further reveal that moderate compression provides the best balance between efficiency and accuracy. These findings position compressed image-centric memory as a practical and scalable interface for lifelong embodied agents, enabling them to reason over long visual histories and navigate with human-like efficiency.

Motivation

JanusVLN Framework Approach
Our agent operates in a lifelong learning setting. For the initial task in an unseen environment, it uses frontier-based exploration to locate the target. Critically, the environment and agent state are preserved across tasks. For subsequent instructions, the agent first consults its memory. If the target object has been previously observed, the agent plans a direct path to its location, bypassing the need for re-exploration.

Approach

JanusVLN Framework Approach
Overview of AstraNav-Memory with the proposed compressed vision encoder. During navigation, up to 300 images are first encoded by a DINOv3 ViT into 598 visual tokens per image, which are then compressed by several lightweight compression heads into 30 tokens each, making them compatible with the original Qwen2.5-VL ViT. The compact visual tokens and the language command are fed into Qwen2.5-VL-3B, enabling long-horizon navigation reasoning over large visual memories at low computational cost.

Experiment

results
results
results
results
results