MMPB: It's Time for Multi-Modal Personalization.

MMPB: It's Time for Multi-Modal Personalization

NeurIPS 2025

▶ Seoul National University, ¹IPAI ²ECE

111 concepts 10k+ image–query pairs 15 tasks (3×5) 23 VLMs evaluated

[2025.09.19] 🎉🎉 MMPB is accepted by NeurIPS 2025!
[2025.05.14] 🚀 Hugging Face dataset and evaluation code are available!

Abstract

Visual personalization is essential for user-facing AI systems, from smart homes to healthcare, where models must adapt to human-centric concepts and individual preferences. Yet, despite their broad capabilities, recent Vision–Language Models (VLMs) remain underexplored in this regard. We present MMPB, the first large-scale benchmark for VLM personalization, comprising 10k image–query pairs and 111 concepts across four categories (humans, animals, objects, characters). The human category further includes preference-grounded queries to test alignment with user-specific needs. MMPB evaluates personalization across three key task types—Awareness, Appropriateness, and Coherency—under a three-stage protocol: concept injection, multi-turn dialogue, and personalized querying. Using 23 widely adopted VLMs (both open- and closed-source), we show that most models struggle: they under-personalize, fail to maintain consistency across dialogue, and rarely exploit visual cues effectively. By surfacing these systematic limitations—from refusal behaviors to long-context forgetting—MMPB provides a scalable and rigorous foundation for advancing truly personalized multi-modal AI.

Key Takeaways

1️⃣ Personalization Gap

Preference following ≪ Recognition, High general VQA rank ≠ strong personalization.

Personalization gap across vision-language models

2️⃣ Under-Personalization

Models favor saying no over affirming a valid personalized match, and degrade over long turns.

Examples of biased refusals during personalization

3️⃣ Visual Cues Underutilized

Simple text beats multiple images for concept injection; visual cues are under-utilized by current VLMs.

Comparison between text-only and multi-image concept injection

4️⃣ Safety vs Utility

Closed models often evade human-centric queries; alignment policies can suppress personalization.

Refusal behavior comparison highlighting safety versus utility

Results

Model	Text		Image
Model	Turn 0	Turn 10	Turn 0	Turn 10
Ovis2-34B	76.2	66.1	72.4	62.0
Qwen2-VL-72B	73.0	68.5	71.9	60.0
InternVL2.5-38B-MPO	72.2	63.7	66.3	46.4
Ovis2-16B	71.5	64.6	71.7	64.9
Qwen2.5-VL-72B	70.4	63.9	68.1	57.9
Claude-3.5-Sonnet	68.8	54.3	40.4	41.6
DeepSeek-VL-V2	68.2	58.5	56.0	60.9
LLaVA-OV-72B	67.4	61.4	58.7	56.5
Qwen2-VL-7B	66.6	62.9	60.6	59.1
Gemini-2.0-Flash	66.5	58.4	66.4	52.2
Gemini-1.5-Flash	66.4	61.4	64.1	56.2
GPT-4o	66.1	64.7	49.1	50.0
InternVL2.5-26B-MPO	65.0	58.1	58.3	53.3
Ovis2-8B	64.5	60.2	62.6	58.4
Qwen2.5-VL-7B	62.7	57.6	59.1	55.0
InternVL2.5-8B-MPO	60.6	56.3	61.4	55.2
Llama-3.2-11B	60.2	56.9	57.2	56.7
InternVL2.5-7B-MPO	60.0	47.2	51.6	40.9
LLaVA-NeXT-34B	57.8	52.4	59.4	52.9
LLaVA-NeXT-32B	57.5	58.6	54.9	55.4
LLaVA-OV-7B	56.8	52.7	49.8	49.1
LLaVA-1.5-13B	53.0	50.3	54.5	50.4
Claude-3.7-Sonnet	37.0	33.6	15.8	14.6

Higher is better for accuracy metrics. Numbers are from the MMPB benchmark.

Citation

@inproceedings{
  kim2025mmpb,
  title={MMPB: It's Time for Multi-Modal Personalization},
  author={Jaeik Kim and Woojin Kim and Woohyeon Park and Jaeyoung Do},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2025},
}

MMPB: It's Time for Multi-Modal Personalization

🔥 What’s New

Examples of personalized queries across task types and representative failure cases of recent VLMs.
(a) Some closed-source models exhibit evasive responses. (b) Most VLMs fail to be personalized.

Abstract

Key Takeaways

Results

Citation

MMPB: It's Time for Multi-Modal Personalization

🔥 What’s New

Examples of personalized queries across task types and representative failure cases of recent VLMs. (a) Some closed-source models exhibit evasive responses. (b) Most VLMs fail to be personalized.

Abstract

Key Takeaways

Results

Citation

Examples of personalized queries across task types and representative failure cases of recent VLMs.
(a) Some closed-source models exhibit evasive responses. (b) Most VLMs fail to be personalized.