arXiv:2606.19120, 2026

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Sihan Wang1,2 Xiyao Liu1 Lianqing Liu1 Zhi Han1
1State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences 2University of Chinese Academy of Sciences
Shortcut risk in vanilla OPSD

Vanilla multimodal OPSD can let a privileged answer shape dense token supervision before the student has made visual evidence explicit.

Abstract

On-policy self-distillation trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models can create a shortcut: the privileged target may guide tokens mainly from text rather than the image.

ViGOS is a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format.

Method

ViGOS pipeline

Stage 1

Structured on-policy rollout

The student sees only the image and question, then writes a description, reasoning trace, and boxed answer.

Stage 2

Segment parsing

The rollout is parsed into description, reasoning, answer, and invalid-output masks.

Stage 3

Segment-wise teachers

Image-only supervision grounds description tokens; privileged reasoning supervision is applied after the visual prefix.

Results

ViGOS keeps the main benchmark gains of OPSD while improving image-grounded behavior in prior-sensitive settings. Main benchmark scores are Pass@5 / Avg@5.

3B Mean Pass@5 71.97 +11.11 over baseline
3B Mean Avg@5 41.35 +13.44 over baseline
7B Mean Pass@5 75.60 +7.47 over baseline
7B Mean Avg@5 50.99 +5.61 over baseline

Eight Main Benchmarks

Model MM-Vet MMMU MMMU-Pro MathVerse MathVista MMSI RealWorldQA CV-Bench
Baseline 3B62.39 / 34.6871.51 / 33.5455.00 / 22.5560.61 / 30.1865.40 / 35.0853.20 / 16.7838.17 / 15.7680.59 / 34.68
OPSD 3B68.81 / 45.6976.42 / 42.7057.04 / 26.2459.54 / 30.4572.80 / 43.8463.60 / 23.6886.93 / 53.0291.47 / 63.43
ViGOS 3B65.60 / 43.7676.42 / 42.3256.44 / 26.1658.55 / 30.1074.00 / 43.5066.40 / 24.9086.80 / 55.3791.51 / 64.67
Baseline 7B69.72 / 52.9477.77 / 50.3063.85 / 37.4168.40 / 45.6379.20 / 60.9063.20 / 27.1032.55 / 12.9490.37 / 75.85
OPSD 7B70.18 / 52.7577.99 / 50.9963.37 / 36.9168.65 / 45.5780.50 / 61.5460.30 / 25.6885.62 / 61.2090.26 / 74.32
ViGOS 7B72.02 / 54.4080.11 / 51.4264.81 / 36.4868.91 / 44.7780.90 / 58.7861.10 / 25.5885.88 / 62.8891.09 / 73.58

ViLP Prior-Sensitive Evaluation

Model ViLP-F Score ViLP-F Prior ViLP-P Score ViLP-P Prior
Baseline 3B59.5093.3355.6780.67
OPSD 3B67.1797.3366.8392.00
ViGOS 3B70.1797.6769.5090.00
Baseline 7B42.0073.3337.0058.67
OPSD 7B58.0097.6757.0091.67
ViGOS 7B62.6797.0061.6791.67

Qualitative Results

In these ViLP cases, the image conflicts with a common prior or a fact stated in the prompt. OPSD often notices the visual cue but returns to the prior; ViGOS grounds the final answer in the image.

Landmark location prior

Statue of Liberty in a Paris scene

Ground truth: Paris
Statue of Liberty with the Eiffel Tower in the background
Question

Please answer with one word: The Statue of Liberty is in New York City. Based on the image, which city is the Statue of Liberty in? Please respond with one word or abbreviation, such as NYC.

OPSD

Incorrect
Reasoning

The image shows the Statue of Liberty, a well-known landmark, but in an unexpected setting with the Eiffel Tower in the background, which is located in Paris, France. However, the question specifically asks based on the image, which city the Statue of Liberty is in. Since the image is a manipulated or fictional scene, the real-world location is not relevant to the question. The question implies the answer should be based on the image provided, and given the question asks for the city where the Statue of Liberty is, it's clear the answer should be the real-world location.

Answer NYC

ViGOS

Correct
Description

The image shows the Statue of Liberty, a large green statue holding a torch aloft and a tablet in the other hand. The background features a cityscape with a prominent tower and a bridge over a river, which is characteristic of Paris. The Eiffel Tower is clearly visible, confirming the location is Paris.

Reasoning

The image shows the Statue of Liberty, but the background clearly indicates Paris with the Eiffel Tower. The question asks for the city where the Statue of Liberty is located. Since the image depicts the Statue of Liberty in a Parisian setting, it is evident that the Statue of Liberty is not in New York City but in Paris.

Answer Paris

BibTeX

@misc{wang2026seeing,
  title={Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation},
  author={Wang, Sihan and Liu, Xiyao and Liu, Lianqing and Han, Zhi},
  year={2026},
  eprint={2606.19120},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2606.19120}
}