How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

Huanyu Zhang*†1,2, Xuehai Bai*3, Chengzu Li*4, Chen Liang2, Haochen Tian1,2, Haodong Li5, Ruichuan An6, Yifan Zhang1,2, Anna Korhonen4, Zhang Zhang1,2, Liang Wang1,2, Tieniu Tan7
1UCAS 2CASIA 3HDU 4Cambridge 5SCUT 6PKU 7NJU
*Equal Contribution Corresponding Author
Teaser Fan Image
Teaser Overview Image

VIBE organizes visual instruction-guided image editing into a three-level interaction hierarchy with increasing task complexity. The Deictic Level treats visual instructions as selectors that specify localized regions or objects for basic spatial operations. The Morphological Level interprets visual instructions as blueprints that define abstract structural constraints. The Causal Level views visual instructions as catalysts that encode underlying physical or logical dynamics.

VIBE Leaderboard

Model Multi-Img Deictic Level Morphological Level Causal Level Overall
AD RM RP TR Avg PC RO DI Avg LC FS BI Avg
Nano Banana Pro 82.17 94.07 88.26 74.80 84.83 72.33 36.04 88.02 65.46 60.34 59.25 15.92 45.17 65.15
Nano Banana 81.34 93.50 79.05 46.53 75.11 67.71 33.45 85.60 62.25 34.75 52.64 1.87 29.75 55.70
GPT-image-1 55.61 69.00 62.63 47.00 58.56 64.39 11.09 77.32 50.93 25.18 39.48 4.73 23.13 44.21
Seedream 4.5 81.24 95.82 81.82 48.93 76.95 66.79 20.11 82.33 56.41 50.50 45.55 2.99 33.01 55.46
Seedream 4.0 74.02 93.04 79.35 33.29 69.93 58.27 30.37 72.09 53.58 47.00 43.59 4.11 31.57 51.69
Wan 2.6 66.01 92.90 68.15 40.95 67.00 59.66 34.89 80.23 58.26 44.46 50.79 9.08 34.78 53.35
Wan 2.5 73.59 96.90 76.99 36.80 71.07 55.76 25.77 78.78 53.44 33.33 51.98 7.84 31.05 51.85
FLUX2-dev 64.57 8.00 54.40 5.58 33.14 28.76 22.77 60.68 37.40 33.74 30.81 2.36 22.30 30.95
Qwen-Image-Edit-2509 55.28 14.38 30.13 14.48 28.57 15.14 17.67 21.38 18.06 28.00 44.40 2.36 24.92 23.85
Qwen-Image-Edit 44.20 24.88 30.48 11.11 27.67 - 21.33 54.65 - 22.00 32.38 3.11 19.16 23.42
Edit-R1-Qwen-Image-Edit 56.77 4.86 29.47 11.33 25.61 16.23 20.27 15.42 17.31 25.67 40.22 2.24 22.71 21.87
BAGEL-think 40.44 14.59 35.38 14.46 26.22 8.50 23.60 50.34 27.48 21.17 28.04 5.22 18.14 23.95
BAGEL 33.87 11.33 29.26 14.05 18.21 7.61 28.05 48.23 27.96 21.57 33.63 5.97 20.39 22.19
Step1X-Edit-v1p2 33.92 12.59 28.17 13.52 22.05 - 25.17 71.48 - 25.00 29.67 0.37 18.35 20.20
OmniGen2 26.29 26.20 20.84 4.51 19.46 11.40 17.44 30.51 19.78 17.33 17.61 2.74 12.56 17.27
UniWorld-V1 15.18 14.52 22.03 3.59 13.83 11.95 16.57 34.43 20.98 15.50 9.28 0.00 8.26 14.36
OmniGen 2.63 7.48 5.26 1.23 4.15 5.79 13.88 3.93 7.87 2.33 2.00 0.00 1.44 4.49

Synergy Between Textual and Visual Instructions

MY ALT TEXT

Two representative cases illustrating how textual and visual instructions interact. The first case shows that visual instructions can resolve target ambiguity that detailed text alone fails to address. The second case demonstrates that complex semantic constraints require the joint use of detailed textual and visual instructions. Together, these examples highlight that textual and visual instructions play distinct yet complementary roles in image editing.

Performance across image styles on the Deictic Level.

MY ALT TEXT

Left: Average Deictic Level scores across real-world, animation, and sketch images for four proprietary models.
Right: Metric-level heatmaps for Seedream~4.5 and GPT-Image-1, illustrating style-dependent variations in Instruction Adherence, Contextual Preservation, and Visual Coherence.

Qualitative examples of multi-task visual instruction following.

MY ALT TEXT

Abstract

Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.

BibTeX


        @misc{zhang2026vibe-benchmark,
          title={How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing}, 
              author={Huanyu Zhang and Xuehai Bai and Chengzu Li and Chen Liang and Haochen Tian and Haodong Li and Ruichuan An and Yifan Zhang and Anna Korhonen and Zhang Zhang and Liang Wang and Tieniu Tan},
              year={2026},
              eprint={2602.01851},
              archivePrefix={arXiv},
              primaryClass={cs.CV},
              url={https://arxiv.org/abs/2602.01851}, 
        }