VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation

CVPR 2026

Seoul National University
† Corresponding author
subtitle of the main image

Subtitle of the Main Image

Abstract

We propose VIRST (Video-Instructed Reasoning Assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel-level mask prediction for Referring Video Object Segmentation (RVOS). VIRST bridges semantic reasoning and segmentation through Spatio-Temporal Fusion (STF), which injects segmentation-aware video features into a vision-language backbone. It further introduces a Temporal Dynamic Anchor Updater (TDAU) to maintain temporally adjacent anchor frames, providing stable temporal cues under large motion, occlusion, and object reappearance. VIRST achieves state-of-the-art results across diverse RVOS benchmarks under realistic and challenging conditions.

Method

Results

Table 1 Performance comparison on the ReVOS benchmark. Best results are in bold; second-best are underlined.
Model Venue Referring Reasoning Overall R
JFJ&F JFJ&F JFJ&F
Segmentation Expert
MTTRECCV'22 29.830.230.0 20.421.521.0 25.125.925.55.6
ReferFormerCVPR'22 31.234.332.7 21.325.623.4 26.229.928.18.8
LMPMICCV'23 29.039.134.1 13.324.819.0 21.227.126.83.8
MLLM-based Segmentation Method
LISA-7BCVPR'24 44.347.145.7 33.838.436.1 39.142.740.99.3
VISA-7BECCV'24 49.252.650.9 40.645.443.0 44.949.046.915.5
VISA-13BECCV'24 55.659.157.4 42.046.744.3 48.852.950.915.5
HyperSegCVPR'25 56.060.958.5 50.255.853.0 53.158.455.7--
VRS-HQ-7BCVPR'25 59.864.562.1 53.558.756.1 56.661.659.119.7
VRS-HQ-13BCVPR'25 61.165.563.3 54.159.456.8 57.662.560.018.9
InstructSegICCV'25 54.859.257.0 49.254.751.9 52.056.954.5--
RGA3-7BICCV'25 58.762.360.5 53.157.755.4 55.960.058.028.6
ViLLa-6BICCV'25 ------ ------ 54.959.157.0--
Ours (VIRST)CVPR'26 68.872.870.8 63.968.366.1 66.370.668.4 21.8
Table 2 Performance comparison with previous methods on the validation sets of RVOS datasets. The best results are shown in bold, and the second-best results are underlined.
Model Venue MeViS Ref-YT-VOS Ref-DAVIS17
JFJ&F JFJ&F JFJ&F
Segmentation Expert
ReferFormerCVPR'22 29.832.231.0 61.364.662.9 58.164.161.1
OnlineReferICCV'23 --- 61.665.563.5 61.667.764.8
SAMWISECVPR'25 49.546.652.4 69.267.870.6 70.667.474.5
MPG-SAM2ICCV'25 50.756.753.7 71.776.173.9 68.876.072.4
ReferDINOICCV'25 44.753.949.3 67.071.569.3 65.172.968.9
MLLM-based Segmentation Method
LISA-7BCVPR'24 35.139.437.2 53.454.353.9 62.267.364.8
VISA-7BECCV'24 40.746.343.5 59.863.261.5 66.372.569.4
VISA-13BECCV'24 41.847.144.5 61.464.763.0 67.073.870.4
VideoLISANeurIPS'24 41.347.644.4 61.765.763.7 64.972.768.8
VideoGLaMMCVPR'25 42.148.245.2 65.468.266.8 65.673.369.5
HyperSegCVPR'25 --- --68.5 --71.2
VRS-HQ-7BCVPR'25 47.653.750.6 68.372.570.4 72.679.476.0
VRS-HQ-13BCVPR'25 48.053.750.9 69.073.171.0 71.077.974.4
InstructSegICCV'25 --- 65.469.567.5 67.374.971.1
ViLLa-6BICCV'25 46.552.349.4 64.670.467.5 70.678.074.3
Ours (VIRST)-- 60.465.462.9 72.276.174.2 75.983.179.5

BibTeX

@misc{virst2026,
  author    = {Jihwan Hong and Jaeyoung Do},
  title     = {VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation},
  journal   = {CVPR},
  year      = {2026},
}