MMPB: It's Time for Multi-Modal Personalization

NeurIPS 2025
Seoul National University, 1IPAI 2ECE
111 concepts 10k+ image–query pairs 15 tasks (3×5) 23 VLMs evaluated

🔥 What’s New

  • [2025.09.19] 🎉🎉 MMPB is accepted by NeurIPS 2025!
  • [2025.05.14] 🚀 Hugging Face dataset and evaluation code are available!

TL;DR We build the first benchmark focused on personalized VLM reasoning, spanning 10k image–query pairs and 111 concepts, and deliver a systematic analysis across 23 models that exposes how safety policies, dialogue drift, and weak visual grounding jointly block real personalization.

Overview of MMPB and representative failures

Examples of personalized queries across task types and representative failure cases of recent VLMs.
(a) Some closed-source models exhibit evasive responses. (b) Most VLMs fail to be personalized.

Abstract

Visual personalization is essential for user-facing AI systems, from smart homes to healthcare, where models must adapt to human-centric concepts and individual preferences. Yet, despite their broad capabilities, recent Vision–Language Models (VLMs) remain underexplored in this regard. We present MMPB, the first large-scale benchmark for VLM personalization, comprising 10k image–query pairs and 111 concepts across four categories (humans, animals, objects, characters). The human category further includes preference-grounded queries to test alignment with user-specific needs. MMPB evaluates personalization across three key task types—Awareness, Appropriateness, and Coherency—under a three-stage protocol: concept injection, multi-turn dialogue, and personalized querying. Using 23 widely adopted VLMs (both open- and closed-source), we show that most models struggle: they under-personalize, fail to maintain consistency across dialogue, and rarely exploit visual cues effectively. By surfacing these systematic limitations—from refusal behaviors to long-context forgetting—MMPB provides a scalable and rigorous foundation for advancing truly personalized multi-modal AI.

Key Takeaways

1️⃣ Personalization Gap

Preference following ≪ Recognition, High general VQA rank ≠ strong personalization.

Personalization gap across vision-language models
2️⃣ Under-Personalization

Models favor saying no over affirming a valid personalized match, and degrade over long turns.

Examples of biased refusals during personalization
3️⃣ Visual Cues Underutilized

Simple text beats multiple images for concept injection; visual cues are under-utilized by current VLMs.

Comparison between text-only and multi-image concept injection
4️⃣ Safety vs Utility

Closed models often evade human-centric queries; alignment policies can suppress personalization.

Refusal behavior comparison highlighting safety versus utility

Results

Model Text Image
Turn 0 Turn 10 Turn 0 Turn 10
Ovis2-34B76.266.172.462.0
Qwen2-VL-72B73.068.571.960.0
InternVL2.5-38B-MPO72.263.766.346.4
Ovis2-16B71.564.671.764.9
Qwen2.5-VL-72B70.463.968.157.9
Claude-3.5-Sonnet68.854.340.441.6
DeepSeek-VL-V268.258.556.060.9
LLaVA-OV-72B67.461.458.756.5
Qwen2-VL-7B66.662.960.659.1
Gemini-2.0-Flash66.558.466.452.2
Gemini-1.5-Flash66.461.464.156.2
GPT-4o66.164.749.150.0
InternVL2.5-26B-MPO65.058.158.353.3
Ovis2-8B64.560.262.658.4
Qwen2.5-VL-7B62.757.659.155.0
InternVL2.5-8B-MPO60.656.361.455.2
Llama-3.2-11B60.256.957.256.7
InternVL2.5-7B-MPO60.047.251.640.9
LLaVA-NeXT-34B57.852.459.452.9
LLaVA-NeXT-32B57.558.654.955.4
LLaVA-OV-7B56.852.749.849.1
LLaVA-1.5-13B53.050.354.550.4
Claude-3.7-Sonnet37.033.615.814.6

Higher is better for accuracy metrics. Numbers are from the MMPB benchmark.

 Top