VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation

CVPR 2026

Seoul National University
† Corresponding author

Abstract

We propose VIRST (Video-Instructed Reasoning Assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel-level mask prediction for Referring Video Object Segmentation (RVOS). VIRST bridges semantic reasoning and segmentation through Spatio-Temporal Fusion (STF), which injects segmentation-aware video features into a vision-language backbone. It further introduces a Temporal Dynamic Anchor Updater (TDAU) to maintain temporally adjacent anchor frames, providing stable temporal cues under large motion, occlusion, and object reappearance. VIRST achieves state-of-the-art results across diverse RVOS benchmarks under realistic and challenging conditions.

Method

VIRST unifies video-level reasoning and pixel-level mask prediction in a single end-to-end RVOS framework. Instead of relying on a fixed keyframe or an external propagation pipeline, VIRST lets the vision-language model reason over the video while segmentation-aware features provide the spatial detail needed for accurate masks.

Spatio-Temporal Fusion

STF bridges semantic video tokens and dense segmentation features. Learnable [ST] tokens are fused with segmentation-aware video features before and after VLM reasoning, producing frame-specific prompts for the mask decoder.

Dynamic Anchors

TDAU maintains multiple temporally nearby anchor frames and updates them as the video progresses. This gives the decoder stable cues under large motion, occlusion, and object reappearance.

Progressive Training

Training gradually aligns language reasoning, spatial grounding, and temporal consistency so the unified model can handle both referring and reasoning-oriented video segmentation.

VIRST architecture and spatio-temporal fusion module
VIRST architecture and the two-stage spatio-temporal fusion module.
Temporal Dynamic Anchor Updater anchor frame selection
TDAU dynamically selects temporally local anchor frames to support robust mask prediction.

Results

VIRST benchmark results summary
Table 1 Performance comparison on the ReVOS benchmark. Best results are in bold; second-best are underlined.
Model Venue Referring Reasoning Overall R
JFJ&F JFJ&F JFJ&F
Segmentation Expert
MTTRECCV'22 29.830.230.0 20.421.521.0 25.125.925.55.6
ReferFormerCVPR'22 31.234.332.7 21.325.623.4 26.229.928.18.8
LMPMICCV'23 29.039.134.1 13.324.819.0 21.227.126.83.8
MLLM-based Segmentation Method
LISA-7BCVPR'24 44.347.145.7 33.838.436.1 39.142.740.99.3
VISA-7BECCV'24 49.252.650.9 40.645.443.0 44.949.046.915.5
VISA-13BECCV'24 55.659.157.4 42.046.744.3 48.852.950.915.5
HyperSegCVPR'25 56.060.958.5 50.255.853.0 53.158.455.7--
VRS-HQ-7BCVPR'25 59.864.562.1 53.558.756.1 56.661.659.119.7
VRS-HQ-13BCVPR'25 61.165.563.3 54.159.456.8 57.662.560.018.9
InstructSegICCV'25 54.859.257.0 49.254.751.9 52.056.954.5--
RGA3-7BICCV'25 58.762.360.5 53.157.755.4 55.960.058.028.6
ViLLa-6BICCV'25 ------ ------ 54.959.157.0--
Ours (VIRST)CVPR'26 68.872.870.8 63.968.366.1 66.370.668.4 21.8
Table 2 Performance comparison with previous methods on the validation sets of RVOS datasets. The best results are shown in bold, and the second-best results are underlined.
Model Venue MeViS Ref-YT-VOS Ref-DAVIS17
JFJ&F JFJ&F JFJ&F
Segmentation Expert
ReferFormerCVPR'22 29.832.231.0 61.364.662.9 58.164.161.1
OnlineReferICCV'23 --- 61.665.563.5 61.667.764.8
SAMWISECVPR'25 49.546.652.4 69.267.870.6 70.667.474.5
MPG-SAM2ICCV'25 50.756.753.7 71.776.173.9 68.876.072.4
ReferDINOICCV'25 44.753.949.3 67.071.569.3 65.172.968.9
MLLM-based Segmentation Method
LISA-7BCVPR'24 35.139.437.2 53.454.353.9 62.267.364.8
VISA-7BECCV'24 40.746.343.5 59.863.261.5 66.372.569.4
VISA-13BECCV'24 41.847.144.5 61.464.763.0 67.073.870.4
VideoLISANeurIPS'24 41.347.644.4 61.765.763.7 64.972.768.8
VideoGLaMMCVPR'25 42.148.245.2 65.468.266.8 65.673.369.5
HyperSegCVPR'25 --- --68.5 --71.2
VRS-HQ-7BCVPR'25 47.653.750.6 68.372.570.4 72.679.476.0
VRS-HQ-13BCVPR'25 48.053.750.9 69.073.171.0 71.077.974.4
InstructSegICCV'25 --- 65.469.567.5 67.374.971.1
ViLLa-6BICCV'25 46.552.349.4 64.670.467.5 70.678.074.3
Ours (VIRST)CVPR'26 60.465.462.9 72.276.174.2 75.983.179.5

Qualitative Examples

We show representative Referring Video Object Segmentation examples. Each row compares the original video with VIRST's predicted mask overlay for the corresponding language expression.

ReVOS

Expression: The aircraft that is most likely to be low on fuel.

Original
Masked

Expression: The wineglass in which the wine may be finished first.

Original
Masked

Expression: Grey parrot soaking its feathers in liquid.

Original
Masked

Expression: The panda(s) that remains seated without moving throughout.

Original
Masked

MeViS

Expression: The three little bears following the big bear.

Original
Masked

Expression: The big bear is leading three small bear cubs across the road.

Original
Masked

Expression: The one with a deeper shade of fur among the two dogs roughhousing and having fun.

Original
Masked

Expression: A bear standing tall and amusingly swinging a hula hoop with its neck.

Original
Masked

BibTeX

@inproceedings{hong2026virst,
  author    = {Jihwan Hong and Jaeyoung Do},
  title     = {VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation},
  booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
  year      = {2026},
}