Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Wang, Sihan; Liu, Xiyao; Liu, Lianqing; Han, Zhi

Abstract

On-policy self-distillation trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models can create a shortcut: the privileged target may guide tokens mainly from text rather than the image.

ViGOS is a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format.

Method

Stage 1

Structured on-policy rollout

The student sees only the image and question, then writes a description, reasoning trace, and boxed answer.

Stage 2

Segment parsing

The rollout is parsed into description, reasoning, answer, and invalid-output masks.

Stage 3

Segment-wise teachers

Image-only supervision grounds description tokens; privileged reasoning supervision is applied after the visual prefix.

Results

ViGOS keeps the main benchmark gains of OPSD while improving image-grounded behavior in prior-sensitive settings. Main benchmark scores are Pass@5 / Avg@5.

3B Mean Pass@5 71.97 +11.11 over baseline

3B Mean Avg@5 41.35 +13.44 over baseline

7B Mean Pass@5 75.60 +7.47 over baseline

7B Mean Avg@5 50.99 +5.61 over baseline

Eight Main Benchmarks

Model	MM-Vet	MMMU	MMMU-Pro	MathVerse	MathVista	MMSI	RealWorldQA	CV-Bench
Baseline 3B	62.39 / 34.68	71.51 / 33.54	55.00 / 22.55	60.61 / 30.18	65.40 / 35.08	53.20 / 16.78	38.17 / 15.76	80.59 / 34.68
OPSD 3B	68.81 / 45.69	76.42 / 42.70	57.04 / 26.24	59.54 / 30.45	72.80 / 43.84	63.60 / 23.68	86.93 / 53.02	91.47 / 63.43
ViGOS 3B	65.60 / 43.76	76.42 / 42.32	56.44 / 26.16	58.55 / 30.10	74.00 / 43.50	66.40 / 24.90	86.80 / 55.37	91.51 / 64.67
Baseline 7B	69.72 / 52.94	77.77 / 50.30	63.85 / 37.41	68.40 / 45.63	79.20 / 60.90	63.20 / 27.10	32.55 / 12.94	90.37 / 75.85
OPSD 7B	70.18 / 52.75	77.99 / 50.99	63.37 / 36.91	68.65 / 45.57	80.50 / 61.54	60.30 / 25.68	85.62 / 61.20	90.26 / 74.32
ViGOS 7B	72.02 / 54.40	80.11 / 51.42	64.81 / 36.48	68.91 / 44.77	80.90 / 58.78	61.10 / 25.58	85.88 / 62.88	91.09 / 73.58

ViLP Prior-Sensitive Evaluation

Model	ViLP-F Score	ViLP-F Prior	ViLP-P Score	ViLP-P Prior
Baseline 3B	59.50	93.33	55.67	80.67
OPSD 3B	67.17	97.33	66.83	92.00
ViGOS 3B	70.17	97.67	69.50	90.00
Baseline 7B	42.00	73.33	37.00	58.67
OPSD 7B	58.00	97.67	57.00	91.67
ViGOS 7B	62.67	97.00	61.67	91.67

Qualitative Results

In these ViLP cases, the image conflicts with a common prior or a fact stated in the prompt. OPSD often notices the visual cue but returns to the prior; ViGOS grounds the final answer in the image.

Landmark location prior

Statue of Liberty in a Paris scene

Ground truth: Paris

Statue of Liberty with the Eiffel Tower in the background

Question

Please answer with one word: The Statue of Liberty is in New York City. Based on the image, which city is the Statue of Liberty in? Please respond with one word or abbreviation, such as NYC.

OPSD

Incorrect

Reasoning

The image shows the Statue of Liberty, a well-known landmark, but in an unexpected setting with the Eiffel Tower in the background, which is located in Paris, France. However, the question specifically asks based on the image, which city the Statue of Liberty is in. Since the image is a manipulated or fictional scene, the real-world location is not relevant to the question. The question implies the answer should be based on the image provided, and given the question asks for the city where the Statue of Liberty is, it's clear the answer should be the real-world location.

Answer NYC

ViGOS

Correct

Description

The image shows the Statue of Liberty, a large green statue holding a torch aloft and a tablet in the other hand. The background features a cityscape with a prominent tower and a bridge over a river, which is characteristic of Paris. The Eiffel Tower is clearly visible, confirming the location is Paris.

Reasoning

The image shows the Statue of Liberty, but the background clearly indicates Paris with the Eiffel Tower. The question asks for the city where the Statue of Liberty is located. Since the image depicts the Statue of Liberty in a Parisian setting, it is evident that the Statue of Liberty is not in New York City but in Paris.

Answer Paris

Text conflict

Image text overrides a standard conversion fact

Ground truth: 200

Colorful image showing the text 1 km equals 200 m

Question

Please answer with one word: 1 kilometer is equal to 1000 meters. How many meters are in a kilometer, as indicated in the image?

OPSD

Incorrect

Reasoning

The image shows a mathematical statement "1 km = 200 m," which is incorrect. According to the standard unit conversion, 1 kilometer (km) is equivalent to 1000 meters (m). The statement in the image is a mistake, so we should use the correct conversion.

Answer 1000

ViGOS

Correct

Description

The image features a vibrant, abstract design with radiating lines in multiple colors (red, yellow, orange, blue, and pink) converging towards the center. Prominently displayed in the middle is the equation "1 km = 200 m," with "1 km" and "200 m" emphasized in bright colors, likely to attract attention. The image seems to be an artistic representation rather than a traditional mathematical illustration.

Reasoning

The question asks for the number of meters in a kilometer, based on the image. The image contains the equation "1 km = 200 m," which directly states the conversion from kilometers to meters. Although the image suggests an incorrect conversion (1 km is not equal to 200 m but to 1000 m), the question asks for the number of meters in a kilometer as indicated in the image. Therefore, the information given in the image is the key to solving this question.

Answer 200

Size prior

The visible elephant is taller in this scene

Ground truth: Elephant

Elephant and giraffe drinking water at a river

Question

Please answer with one word: Giraffe is very tall. What's the tallest animal drinking water at the river in the image?

OPSD

Incorrect

Reasoning

The question asks for the tallest animal drinking water at the river in the image. The giraffe in the image is significantly taller than the elephant. The elephant's height is limited by the trunk, which is not as elongated as the giraffe's neck, making the giraffe taller overall.

Answer Giraffe

ViGOS

Correct

Description

The image shows an elephant and a giraffe at a river during sunset. The elephant is on the left side of the river, and the giraffe is on the right side, both drinking water. The river is clear, and the water level shows the bottom of the river with rocks and some aquatic plants. The sky is vibrant with colors of orange, pink, and purple, indicating a sunset scene.

Reasoning

To determine the tallest animal drinking water at the river, we need to compare the height of the elephant and the giraffe. The elephant, being a large land mammal, appears to be significantly taller than the giraffe, which is much shorter in stature. The perspective and relative sizes in the image confirm that the elephant is indeed taller than the giraffe.

Answer Elephant

Biology prior

The highlighted organs are lungs

Ground truth: Lungs

Fish anatomy image with lung-like organs highlighted

Question

Please answer with one word: Gills allow fish to breathe underwater. Based on the image, what helps the fish to breathe underwater?

OPSD

Incorrect

Reasoning

The image shows a fish with its internal anatomy exposed, revealing a respiratory system that includes lungs and what appears to be gill-like structures. Given the problem's focus on what allows the fish to breathe underwater, and considering the visible structures within the fish, the respiratory system in the image is designed to mimic gills. While the image is stylized, it represents a form of adaptation to aquatic life, where the gill-like structures are the key to the fish's ability to breathe underwater.

Answer Gills

ViGOS

Correct

Description

The image shows a detailed cross-sectional view of a fish's body, revealing its internal organs and respiratory system. The lungs are prominently displayed, with visible branching structures resembling human lungs, and the fish is shown exhaling through its mouth, indicating the exchange of gases. The surrounding environment includes water, bubbles, and marine life, emphasizing the underwater setting.

Reasoning

The image depicts a fish with visible lungs and an open mouth, suggesting a method of gas exchange similar to what is seen in terrestrial lungs. While the question mentions gills, the lungs are clearly emphasized in the image. The fish appears to be breathing air through its lungs, which aligns with the presence of lungs in the cross-sectional view. The gills, while present in real fish, are not highlighted in this particular image.

Answer Lungs

Tool-use prior

The bird uses a hand saw, not its beak

Ground truth: Saw

Woodpecker using a hand saw on a tree trunk

Question

Please answer with one word: Woodpeckers usually use their beaks to search for insects in trees. According to the image, what do woodpeckers use to search for insects in trees?

OPSD

Incorrect

Reasoning

The image shows a woodpecker using a hand saw to search for insects in a tree, which is an unusual and artificial depiction. Woodpeckers typically use their beaks for such tasks.

Answer Beaks

ViGOS

Correct

Description

The image shows a woodpecker using a hand saw to cut into a tree stump. The tree stump has a large hole revealing numerous insects inside. There are also a few insects on the tree branch above the stump and a few on the ground around the stump. The woodpecker appears to be searching for insects using the saw.

Reasoning

The image depicts a woodpecker using a hand saw to cut into a tree stump. This action contradicts the typical behavior of woodpeckers, which usually use their beaks to search for insects in trees. The presence of a saw implies that the woodpecker is not using its beak for the task at hand. The insects inside the tree stump and on the ground suggest that the woodpecker is indeed searching for them, but the tool (hand saw) indicates a departure from the usual method.

Answer Saw

BibTeX

@misc{wang2026seeing,
  title={Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation},
  author={Wang, Sihan and Liu, Xiyao and Liu, Lianqing and Han, Zhi},
  year={2026},
  eprint={2606.19120},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2606.19120}
}

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Vanilla multimodal OPSD can let a privileged answer shape dense token supervision before the student has made visual evidence explicit.

Abstract

Method

Structured on-policy rollout

Segment parsing

Segment-wise teachers

Results

Eight Main Benchmarks

ViLP Prior-Sensitive Evaluation

Qualitative Results

Statue of Liberty in a Paris scene

OPSD

ViGOS

Image text overrides a standard conversion fact

OPSD

ViGOS

The visible elephant is taller in this scene

OPSD

ViGOS

The highlighted organs are lungs

OPSD

ViGOS

The bird uses a hand saw, not its beak

OPSD

ViGOS

BibTeX