BLINK : Multimodal Large Language Models Can See but Not Perceive

1University of Pensylvania  2University of Washington  3Allen Institute for AI 
4University of California, Davis  5Columbia University  *Equal Contribution  Equal Advising

ECCV 2024

What is BLINK?

BLINK is a benchmark containing 14 visual perception tasks that can be solved by humans “within a blink”, but pose significant challenges for current multimodal large language models (LLMs).


Example tasks in BLINK. The answers of the examples: relative depth: B; jigsaw: A; multi-view reasoning: right; visual correspondence: A; semantic correspondence: C; forensics detection: final image; IQ test: D; visual similarity: upper one; functional correspondence: A; relative reflectance: they are about the same.

Leaderboard

We provide a leaderboard on the validation set for the community to track progress. All models are evaluated in a zero-shot setting using promts provided in the dataset.

BLINK -- Charateristic and Statistics

  • BLINK incorporates diverse visual prompts, like circles, boxes, and image masks, while previous benchmarks only have text questions and answers.
  • BLINK evaluates a more comprehensive range of visual perception abilities, like multi-view reasoning, depth estimation, and reflectance estimation. Prior benchmarks are generally more focused on recognition-based VQA.
  • BLINK contains "visual" commonsense problems that humans can answer within seconds, while prior benchmarks like MMMU require domain knowledge

BLINK has several novel features different from previous benchmarks.

  • BLINK covers 14 perception-demanding tasks, inspired by classical computer vision problems. While these problems only takes human a "blink" to solve, they exceed the capabilities of current multimodal large language models.

Abstract

We introduce BLINK, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the BLINK tasks can be solved by humans “within a blink” (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. BLINK reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, BLINK is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not “emerged” yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe BLINK will stimulate the community to help multimodal LLMs catch up with human-level visual perception.

Qualitative Results

For each task, we show the choice of LLaVAv1.6-34B, Qwen-VL-Max, Gemini Pro, GPT-4V, and humans. Red choice indicates the ground truth. Notice that the markers are intentionally enlarged for visualization purposes, and we make some images inset images to save space. For IQ test, the third image is constructed by overlaying the first and second images.

Quantitative Results

Results of different models on the BLINK test set. The first row shows task names and number of test data.

The mean accuracy of 7B and 13B open-source Multimodal LLMs hover around 35–42%, which is similar to random guess (38.09%). The most proficient open-source model, LLaVA-v1.6-34B, achieves an accuracy of 45.05%. Even the most advanced models, GPT-4V and Gemini Pro and Claude 3 OPUS, achieve accuracies of only 51.26%, 45.72%, and 44.11% respectively. Their performance are merely 13.17%, 7.63% and 6.02% better than random guessing and lag behind human performance by 44.44%, 49.98% and 51.59%. Notably, for certain tasks such as jigsaw, semantic correspondence, multi-view reasoning, object localization, and relative reflectance, some multimodal LLMs even underperform compared to random guessing.

Experiment Analysis


1. Is dense captioning all you need for a multimodal LLM benchmark?

To answer the question, we convert images into task-agnostic dense image captions with GPT-4V, and follow with a text-based LLM. The dense caption describes detailed information about the image and the visual prompts (e.g., where each circle is), using language. We experiment with BLINK, MMBench and MMMU. Surprisingly, we find that the Caption + LLM setting achieves better results on MMBench and MMMU then BLINK. These results indicate that image captions carry a large portion of visual information needed to answer other benchmarks. Meanwhile, BLINK requires advanced perceptual abilities beyond what is currently attainable with general captions.

Caption+LLM achieves good results on MMBench and MMMU, but failes on BLINK.


2. Visual prompting can effect a lot on multimodal LLMs.

We analyze the effect of circle sizes and colors on multiple tasks in BLINK. The experiments suggest that visual prompting can have a big impact on multimodal LLM performance, and improving visual prompts or improving model robustness to prompt variation is a promising direction for future research.

We find that the optimal circle size is task-dependent and on average 10px circles work the best. Also, red is better than gray for all tasks.

3. Can specialist models solve Blink tasks?

Specialists can serve as a proxy upper bound of how good multimodal LLMs could be. This sheds light on the possibility that multimodal LLMs may progress on these tasks given the correct data and training strategy.

The specialists perform much better than GPT-4V and Gemini Pro, outperforming multimodal LLMs by 18% to 57% on these tasks.

BLINK Examples with GPT-4V Outputs


We show random selected actual-sized example data for the 14 tasks from BLINK, with GPT-4V predictions attached.

BibTeX


        @article{fu2024blink,
          title={BLINK: Multimodal Large Language Models Can See but Not Perceive},
          author={Fu, Xingyu and Hu, Yushi and Li, Bangzheng and Feng, Yu and Wang, Haoyu and Lin, Xudong and Roth, Dan and Smith, Noah A and Ma, Wei-Chiu and Krishna, Ranjay},
          journal={arXiv preprint arXiv:2404.12390},
          year={2024}
        }