SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

Abstract

Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception.

SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.

Motivation

Comparison of patch usage between baseline VLM and SECOND — Figure 1. Unlike uniform patch usage in baselines, SECOND selectively accumulates object-relevant patches while suppressing background noise.

VLMs often suffer from perceptual hallucination, mainly because they uniformly integrate multi-scale patches, mixing object signals with background noise. Inspired by human coarse-to-fine perception, SECOND tackles this by selectively keeping salient patches and enforcing contrastive consistency between coarse and fine stages.

Method

SECOND (Selective and Contrastive Decoding) is a training-free multi-stage framework designed to mitigate perceptual hallucinations in Vision-Language Models (VLMs). SECOND combines selective multi-scale feature integration with multi-stage contrastive decoding to progressively refine object-centric representations and suppress hallucinated outputs.

1. Selective Multi-Scale Feature Integration

SECOND constructs a multi-stage visual hierarchy by progressively expanding resolution from coarse to fine. At stage $s$, the set of patches $ \mathcal{P}^{(s)} $ is selected based on an entropy-guided rule:

$ p_{\text{select}} = \frac{\exp(\lambda \cdot H(V)) - 1}{\exp(\lambda) - 1}, $

where $H(V)$ is the entropy of the visual attention distribution, and $\lambda$ is a scaling hyperparameter. Patches with the top $p_{\text{select}}\%$ attention scores are retained for the next stage, ensuring that object-relevant regions are progressively emphasized while background noise is suppressed.

2. Multi-Stage Contrastive Decoding

Building on the hierarchy of outputs, SECOND introduces multi-stage contrastive decoding. Standard Contrastive Decoding contrasts an expert output with a single amateur:

$ \text{logit}_{\text{single}} = \text{logit}_{\text{expert}} + \alpha(\text{logit}_{\text{expert}} - \text{logit}_{\text{amateur}}). $

SECOND generalizes this into a multi-stage setting, leveraging all intermediate “amateur” outputs. For a 4-stage setup:

$$ \text{logit}_{\text{SECOND}} = \text{logit}_{\text{expert}} + \alphac\!\big(\text{logit}_{\text{expert}}-\text{logit}_{\text{amateur3}}\big) + \betac\!\big(\text{logit}_{\text{amateur3}}-\text{logit}_{\text{amateur2}}\big) + \gammac\!\big(\text{logit}_{\text{amateur2}}-\text{logit}_{\text{amateur1}}\big). $$

This hierarchical contrast exploits the progressive refinement of patch selection, amplifying consistent object evidence while canceling out hallucinated signals from earlier stages.

SECOND pipeline: stage-wise patch selection (coarse→fine) and multi-stage contrastive decoding — Figure 2. The *SECOND* pipeline — entropy-guided patch selection across stages (coarse→fine), followed by hierarchical contrastive decoding of amateur and expert outputs.

Main Results

We report main results on the POPE hallucination benchmark, comparing SECOND with baselines and VCD across multiple VLM backbones (LLaVA-Next, LLaVA-OneVision, Yi-VL) and LLMs (Vicuna-7B, Mistral-7B, Qwen2-0.5B, Yi-6B). SECOND consistently outperforms prior methods, achieving 11 out of 12 wins, with substantial gains in recall, accuracy, and F1. These improvements demonstrate SECOND’s effectiveness in mitigating perceptual hallucination while preserving reasoning ability.

Model	LLM	Method	CD	MSCOCO			OKVQA			GQA
Model	LLM	Method	CD	Recall	Acc.	F1	Recall	Acc.	F1	Recall	Acc.	F1
LLaVA-Next (CLIP-336)	Vicuna-7B	baseline	✗	78.8	87.7	86.5	86.7	89.1	88.8	84.8	86.6	86.3
		VCD	✓	81.1	88.2	87.3	88.3	88.0	88.1	86.3	84.6	84.9
		SECOND	✗	80.1	88.6	87.5	87.6	89.9	89.6	84.9	86.5	86.3
		SECOND	✓	85.1	89.7	89.2	90.5	90.3	90.4	85.5	89.4	87.4
LLaVA-Next (CLIP-336)	Mistral-7B	baseline	✗	80.2	88.3	87.3	88.2	88.7	88.7	88.2	84.2	84.8
		VCD	✓	80.8	87.4	86.6	88.0	88.2	88.3	88.0	84.5	85.1
		SECOND	✗	79.5	88.1	86.9	86.8	88.4	88.2	87.2	84.8	85.2
		SECOND	✓	84.8	89.3	88.8	92.5	89.9	90.7	92.1	85.3	87.5
LLaVA-OneVision (SigLIP-384)	Qwen2-0.5B	baseline	✗	80.0	88.4	87.4	85.6	89.3	88.9	83.1	86.8	86.3
		VCD	✓	79.7	87.4	86.4	86.7	89.0	88.7	84.3	86.7	86.4
		SECOND	✗	78.1	87.6	86.3	83.8	88.7	88.1	82.2	87.4	86.7
		SECOND	✓	79.7	87.9	86.9	85.4	89.3	89.1	83.3	87.8	87.2
Yi-VL (CLIP-448)	Yi-6B	baseline	✗	70.3	82.0	79.6	77.0	84.0	82.8	74.5	81.0	79.7
		VCD	✓	73.0	80.1	78.6	79.1	82.1	81.6	78.2	79.9	79.5
		SECOND	✗	73.5	83.5	82.0	80.1	85.6	84.8	76.6	82.5	81.4
		SECOND	✓	83.4	84.5	84.3	87.7	86.3	86.5	83.3	82.8	82.9
Higher is better for Recall / Accuracy / F1. SECOND achieved the best results in 11 of 12 cases.

Table 1. Results of POPE benchmark. SECOND consistently outperforms baselines and VCD across multiple backbones.

Beyond POPE. On general VQA benchmarks including VQAv2(lite), MMStar, and MMBench(lite), SECOND(+CD) consistently achieves strong performance across diverse backbones and LLMs, further demonstrating its effectiveness beyond hallucination-specific evaluation.

Model	LLM	Method	CD	VQAv2 (lite)	MMStar	MMBench (lite)
LLaVA-Next (CLIP-336)	Vicuna-7B	baseline	✗	76.4	37.3	75.8
		VCD	✓	72.9	38.1	74.2
		SECOND	✗	76.5	37.5	78.0
		SECOND	✓	77.5	38.6	80.0
LLaVA-Next (CLIP-336)	Mistral-7B	baseline	✗	72.0	32.4	74.2
		VCD	✓	70.1	34.0	71.2
		SECOND	✗	73.6	34.1	71.2
		SECOND	✓	74.5	36.2	70.5
LLaVA-OneVision (SigLIP-384)	Qwen2-0.5B	baseline	✗	74.6	38.9	73.5
		VCD	✓	55.0	36.2	70.5
		SECOND	✗	73.6	39.6	72.7
		SECOND	✓	75.1	39.9	73.5
Yi-VL (CLIP-448)	Yi-6B	baseline	✗	64.3	34.8	77.3
		VCD	✓	61.9	34.4	79.5
		SECOND	✗	63.6	37.4	82.6
		SECOND	✓	65.3	39.8	84.8
Higher is better. Best numbers within each backbone–LLM block are highlighted.

Table 2. Results on VQAv2(lite), MMStar, and MMBench(lite).

BibTeX

@inproceedings{park2025second, title = {SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding}, author = {Park, Woohyeon and Kim, Woojin and Kim, Jaeik and Do, Jaeyoung}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)}, year = {2025}, series = {Proceedings of Machine Learning Research}, publisher = {PMLR} }