TL;DR We build the first benchmark focused on personalized VLM reasoning, spanning 10k image–query pairs and 111 concepts, and deliver a systematic analysis across 23 models that exposes how safety policies, dialogue drift, and weak visual grounding jointly block real personalization.
Preference following ≪ Recognition, High general VQA rank ≠ strong personalization.
Models favor saying no over affirming a valid personalized match, and degrade over long turns.
Simple text beats multiple images for concept injection; visual cues are under-utilized by current VLMs.
Closed models often evade human-centric queries; alignment policies can suppress personalization.
| Model | Text | Image | ||
|---|---|---|---|---|
| Turn 0 | Turn 10 | Turn 0 | Turn 10 | |
| Ovis2-34B | 76.2 | 66.1 | 72.4 | 62.0 |
| Qwen2-VL-72B | 73.0 | 68.5 | 71.9 | 60.0 |
| InternVL2.5-38B-MPO | 72.2 | 63.7 | 66.3 | 46.4 |
| Ovis2-16B | 71.5 | 64.6 | 71.7 | 64.9 |
| Qwen2.5-VL-72B | 70.4 | 63.9 | 68.1 | 57.9 |
| Claude-3.5-Sonnet | 68.8 | 54.3 | 40.4 | 41.6 |
| DeepSeek-VL-V2 | 68.2 | 58.5 | 56.0 | 60.9 |
| LLaVA-OV-72B | 67.4 | 61.4 | 58.7 | 56.5 |
| Qwen2-VL-7B | 66.6 | 62.9 | 60.6 | 59.1 |
| Gemini-2.0-Flash | 66.5 | 58.4 | 66.4 | 52.2 |
| Gemini-1.5-Flash | 66.4 | 61.4 | 64.1 | 56.2 |
| GPT-4o | 66.1 | 64.7 | 49.1 | 50.0 |
| InternVL2.5-26B-MPO | 65.0 | 58.1 | 58.3 | 53.3 |
| Ovis2-8B | 64.5 | 60.2 | 62.6 | 58.4 |
| Qwen2.5-VL-7B | 62.7 | 57.6 | 59.1 | 55.0 |
| InternVL2.5-8B-MPO | 60.6 | 56.3 | 61.4 | 55.2 |
| Llama-3.2-11B | 60.2 | 56.9 | 57.2 | 56.7 |
| InternVL2.5-7B-MPO | 60.0 | 47.2 | 51.6 | 40.9 |
| LLaVA-NeXT-34B | 57.8 | 52.4 | 59.4 | 52.9 |
| LLaVA-NeXT-32B | 57.5 | 58.6 | 54.9 | 55.4 |
| LLaVA-OV-7B | 56.8 | 52.7 | 49.8 | 49.1 |
| LLaVA-1.5-13B | 53.0 | 50.3 | 54.5 | 50.4 |
| Claude-3.7-Sonnet | 37.0 | 33.6 | 15.8 | 14.6 |
Higher is better for accuracy metrics. Numbers are from the MMPB benchmark.
@inproceedings{
kim2025mmpb,
title={{MMPB}: It{\textquoteright}s Time for Multi-Modal Personalization},
author={Jaeik Kim and Woojin Kim and Woohyeon Park and Jaeyoung Do},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2025},
}