TL;DR We build the first benchmark focused on personalized VLM reasoning, spanning 10k image–query pairs and 111 concepts, and deliver a systematic analysis across 23 models that exposes how safety policies, dialogue drift, and weak visual grounding jointly block real personalization.
Preference following ≪ Recognition, High general VQA rank ≠ strong personalization.
Models favor saying no over affirming a valid personalized match, and degrade over long turns.
Simple text beats multiple images for concept injection; visual cues are under-utilized by current VLMs.
Closed models often evade human-centric queries; alignment policies can suppress personalization.
Model | Text | Image | ||
---|---|---|---|---|
Turn 0 | Turn 10 | Turn 0 | Turn 10 | |
Ovis2-34B | 76.2 | 66.1 | 72.4 | 62.0 |
Qwen2-VL-72B | 73.0 | 68.5 | 71.9 | 60.0 |
InternVL2.5-38B-MPO | 72.2 | 63.7 | 66.3 | 46.4 |
Ovis2-16B | 71.5 | 64.6 | 71.7 | 64.9 |
Qwen2.5-VL-72B | 70.4 | 63.9 | 68.1 | 57.9 |
Claude-3.5-Sonnet | 68.8 | 54.3 | 40.4 | 41.6 |
DeepSeek-VL-V2 | 68.2 | 58.5 | 56.0 | 60.9 |
LLaVA-OV-72B | 67.4 | 61.4 | 58.7 | 56.5 |
Qwen2-VL-7B | 66.6 | 62.9 | 60.6 | 59.1 |
Gemini-2.0-Flash | 66.5 | 58.4 | 66.4 | 52.2 |
Gemini-1.5-Flash | 66.4 | 61.4 | 64.1 | 56.2 |
GPT-4o | 66.1 | 64.7 | 49.1 | 50.0 |
InternVL2.5-26B-MPO | 65.0 | 58.1 | 58.3 | 53.3 |
Ovis2-8B | 64.5 | 60.2 | 62.6 | 58.4 |
Qwen2.5-VL-7B | 62.7 | 57.6 | 59.1 | 55.0 |
InternVL2.5-8B-MPO | 60.6 | 56.3 | 61.4 | 55.2 |
Llama-3.2-11B | 60.2 | 56.9 | 57.2 | 56.7 |
InternVL2.5-7B-MPO | 60.0 | 47.2 | 51.6 | 40.9 |
LLaVA-NeXT-34B | 57.8 | 52.4 | 59.4 | 52.9 |
LLaVA-NeXT-32B | 57.5 | 58.6 | 54.9 | 55.4 |
LLaVA-OV-7B | 56.8 | 52.7 | 49.8 | 49.1 |
LLaVA-1.5-13B | 53.0 | 50.3 | 54.5 | 50.4 |
Claude-3.7-Sonnet | 37.0 | 33.6 | 15.8 | 14.6 |
Higher is better for accuracy metrics. Numbers are from the MMPB benchmark.