Exo → Ego Cross-View Segmentation
VGGT-Segmentor:
Geometry-Enhanced Cross-View Segmentation
We propose VGGT-Segmentor, a geometry-aware framework for segmenting the same physical object across egocentric and exocentric views. Built upon VGGT[1], with powerful multi-view geometric representations, VGGT-Segmentor combines mask prompt fusion, point-guided prediction, and iterative refinement to achieve robust cross-view segmentation under extreme viewpoint, scale, and occlusion changes.
Overview presentation
Hope you enjoy our overview video! Please check it out for a fun and intuitive introduction to VGGT-Segmentor and our geometry-enhanced cross-view segmentation framework.
The Model
VGGT-Segmentor consists of a VGGT Encoder and a lightweight Union Segmentation Head. The Union Segmentation Head is composed of three stages: Mask Prompt Fusion, Point-Guided Prediction, and Mask Refinement. During the design of the Union Segmentation Head, we drew significant inspiration from Segment Anything Model 2[2].
The Results
We evaluate our method on the Ego-Exo4D benchmark and report the results here. Our approach achieves 67.7% IoU on Ego→Exo and 68.0% IoU on Exo→Ego, surpassing the previous state-of-the-art method, DOMR, by 18.0% and 12.8%, respectively. Compared to the LLM-based ObjectRelator, our method outperforms it by 22.3% and 17.1% in the two directions. In the zero-shot learning(ZSL) setting, our model achieves 54.1% IoU on Ego→Exo and 58.4% IoU on Exo→Ego.
Results on EgoExo4D
Citation
If you use VGGT-Segmentor in your research, please use the following BibTeX entry.
@article{gao2026vggt,
title={VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation},
author={Gao, Yulu and Zhang, Bohao and Tang, Zongheng and Liao, Jitong and Wu, Wenjun and Liu, Si},
journal={arXiv preprint arXiv:2604.13596},
year={2026}
}