Apply a wind effect to the entire scene. The wind direction must align precisely with the vector indicated by the red arrow.
Nano Banana Pro
Add a green dashed line trajectory to the image, starting from the white ball and moving in the direction of the black arrow. Draw a red bounding box around the first blue ball that is hit by the trajectory.
Nano Banana Pro
Add a green dashed line trajectory to the image, starting from the white ball and moving in the direction of the black arrow. Draw a red bounding box around the first blue ball that is hit by the trajectory.
Nano Banana Pro
Relocate the pencil highlighted by the red outline to the spot pointed to by the red arrow.
Nano Banana Pro
Remove all the objects in the marked region.
Nano Banana Pro
Apply a wind effect to the entire scene. The wind direction must align precisely with the vector indicated by the red arrow.
Nano Banana Pro
Based on the red sketch annotations in the image, make this woman's hair short.
Nano Banana Pro
Relight the entire scene, changing the primary light source direction to align precisely with the red arrow indicator.
Nano Banana Pro
Apply a wind effect to the entire scene. The wind direction must align precisely with the vector indicated by the red arrow.
Nano Banana Pro
Add a mountain picture hanging into the area delineated by the red bounding box.
Nano Banana Pro
Add some apples into the area delineated by the red bounding box.
Nano Banana Pro
Add a statue in the background holding a shield into the area delineated by the red bounding box.
Nano Banana Pro
Relocate the sticky notes highlighted by the red outline to the spot pointed to by the red arrow.
Nano Banana Pro
Relight the entire scene, changing the primary light source direction to align precisely with the red arrow indicator.
Nano Banana Pro
Remove all the objects in the marked region.
Nano Banana Pro
Change this woman's clothing to a strapless top shown in the red sketch annotations in the picture.
Nano Banana Pro
Relocate the stone highlighted by the red outline to the spot pointed to by the red arrow.
Nano Banana Pro
Remove all the objects in the marked region.
Nano Banana Pro
Remove all the objects in the marked region.
Nano Banana Pro
Add a cat into the area delineated by the red bounding box.
Nano Banana Pro
Relight the entire scene, changing the primary light source direction to align precisely with the red arrow indicator.
Nano Banana Pro
Relocate the pillow highlighted by the red outline to the spot pointed to by the red arrow.
Nano Banana Pro
Change this man's hairstyle to the style shown in the red sketch annotations in the picture.
Nano Banana Pro
| Model | Multi-Img | Deictic Level | Morphological Level | Causal Level | Overall | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AD | RM | RP | TR | Avg | PC | RO | DI | Avg | LC | FS | BI | Avg | |||
| Nano Banana Pro | ✓ | 82.17 | 94.07 | 88.26 | 74.80 | 84.83 | 72.33 | 36.04 | 88.02 | 65.46 | 60.34 | 59.25 | 15.92 | 45.17 | 65.15 |
| Nano Banana | ✓ | 81.34 | 93.50 | 79.05 | 46.53 | 75.11 | 67.71 | 33.45 | 85.60 | 62.25 | 34.75 | 52.64 | 1.87 | 29.75 | 55.70 |
| GPT-image-1 | ✓ | 55.61 | 69.00 | 62.63 | 47.00 | 58.56 | 64.39 | 11.09 | 77.32 | 50.93 | 25.18 | 39.48 | 4.73 | 23.13 | 44.21 |
| Seedream 4.5 | ✓ | 81.24 | 95.82 | 81.82 | 48.93 | 76.95 | 66.79 | 20.11 | 82.33 | 56.41 | 50.50 | 45.55 | 2.99 | 33.01 | 55.46 |
| Seedream 4.0 | ✓ | 74.02 | 93.04 | 79.35 | 33.29 | 69.93 | 58.27 | 30.37 | 72.09 | 53.58 | 47.00 | 43.59 | 4.11 | 31.57 | 51.69 |
| Wan 2.6 | ✓ | 66.01 | 92.90 | 68.15 | 40.95 | 67.00 | 59.66 | 34.89 | 80.23 | 58.26 | 44.46 | 50.79 | 9.08 | 34.78 | 53.35 |
| Wan 2.5 | ✓ | 73.59 | 96.90 | 76.99 | 36.80 | 71.07 | 55.76 | 25.77 | 78.78 | 53.44 | 33.33 | 51.98 | 7.84 | 31.05 | 51.85 |
| FLUX2-dev | ✓ | 64.57 | 8.00 | 54.40 | 5.58 | 33.14 | 28.76 | 22.77 | 60.68 | 37.40 | 33.74 | 30.81 | 2.36 | 22.30 | 30.95 |
| Qwen-Image-Edit-2509 | ✓ | 55.28 | 14.38 | 30.13 | 14.48 | 28.57 | 15.14 | 17.67 | 21.38 | 18.06 | 28.00 | 44.40 | 2.36 | 24.92 | 23.85 |
| Qwen-Image-Edit | ✗ | 44.20 | 24.88 | 30.48 | 11.11 | 27.67 | - | 21.33 | 54.65 | - | 22.00 | 32.38 | 3.11 | 19.16 | 23.42 |
| Edit-R1-Qwen-Image-Edit | ✓ | 56.77 | 4.86 | 29.47 | 11.33 | 25.61 | 16.23 | 20.27 | 15.42 | 17.31 | 25.67 | 40.22 | 2.24 | 22.71 | 21.87 |
| BAGEL-think | ✓ | 40.44 | 14.59 | 35.38 | 14.46 | 26.22 | 8.50 | 23.60 | 50.34 | 27.48 | 21.17 | 28.04 | 5.22 | 18.14 | 23.95 |
| BAGEL | ✓ | 33.87 | 11.33 | 29.26 | 14.05 | 18.21 | 7.61 | 28.05 | 48.23 | 27.96 | 21.57 | 33.63 | 5.97 | 20.39 | 22.19 |
| Step1X-Edit-v1p2 | ✗ | 33.92 | 12.59 | 28.17 | 13.52 | 22.05 | - | 25.17 | 71.48 | - | 25.00 | 29.67 | 0.37 | 18.35 | 20.20 |
| OmniGen2 | ✓ | 26.29 | 26.20 | 20.84 | 4.51 | 19.46 | 11.40 | 17.44 | 30.51 | 19.78 | 17.33 | 17.61 | 2.74 | 12.56 | 17.27 |
| UniWorld-V1 | ✓ | 15.18 | 14.52 | 22.03 | 3.59 | 13.83 | 11.95 | 16.57 | 34.43 | 20.98 | 15.50 | 9.28 | 0.00 | 8.26 | 14.36 |
| OmniGen | ✓ | 2.63 | 7.48 | 5.26 | 1.23 | 4.15 | 5.79 | 13.88 | 3.93 | 7.87 | 2.33 | 2.00 | 0.00 | 1.44 | 4.49 |
Two representative cases illustrating how textual and visual instructions interact. The first case shows that visual instructions can resolve target ambiguity that detailed text alone fails to address. The second case demonstrates that complex semantic constraints require the joint use of detailed textual and visual instructions. Together, these examples highlight that textual and visual instructions play distinct yet complementary roles in image editing.
Left: Average Deictic Level scores across real-world, animation, and sketch images for four proprietary models.
Right: Metric-level heatmaps for Seedream~4.5 and GPT-Image-1, illustrating style-dependent variations in Instruction Adherence, Contextual Preservation, and Visual Coherence.
Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.
@misc{zhang2026vibe-benchmark,
title={How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing},
author={Huanyu Zhang and Xuehai Bai and Chengzu Li and Chen Liang and Haochen Tian and Haodong Li and Ruichuan An and Yifan Zhang and Anna Korhonen and Zhang Zhang and Liang Wang and Tieniu Tan},
year={2026},
eprint={2602.01851},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.01851},
}