Abstract

Large multimodal models (LMMs) have evolved from large language models (LLMs) to integrate multiple input modalities, such as visual inputs. This integration augments the capacity of LLMs in tasks requiring visual comprehension and reasoning. However, the extent and limitations of their enhanced abilities are not fully understood. To address this gap, we introduce GlitchBench, a novel benchmark designed to test and evaluate the common-sense reasoning and visual recognition capabilities of large multimodal models. Our dataset is curated from a variety of unusual, infrequent, and glitched scenarios from video game content and aims to challenge both the visual and linguistic reasoning powers of LMMs in detecting and interpreting out-of-the-ordinary events and scene composition.

GPT-4V fails to focus on the rain inside the room

Raining inside the room

Ground truth: It is raining inside the building

Question: What is wrong with this image?

GPT-4V: There's nothing inherently "wrong" with this image; however, if you're referring to inconsistencies or peculiarities that might not align with the established Batman canon, the red Batman suit is unusual since Batman is traditionally known in black or very dark colors. The use of red could be an artistic choice for the game, an alternate suit available within the game's universe, or indicative of a different character who takes on the Batman mantle.

The image depicts a screenshot in which it rains inside a room. While the rain should be what is wrong with the image, GPT-4V fails to reason correctly and instead focuses on the color of Batman's costume. Note that the ground truth is never presented as part of the prompt in our study.


Benchmark Results

Performance of various LMMs on GlitchBench. Numbers highlighted in      represent the average results of Q1 and Q2, which are the main results of the benchmark. Numbers related to Q3 serve as a visual perception test to measure the ability of models to report glitches in a relaxed manner. Numbers highlighted in      show the maximum agreement achievable with ground truth as perceived by Llama-2's judgment (%).

Question GPT-4V [1] LLaVA-1.5 7B [2] LLaVA-1.5 13B [3] SPHINX 7B [4] SPHINX 13B [5] InstructBLIP 7B [6] InstructBLIP 13B [7] OtterHD 8B [8] Qwen-VL 10B [9] MiniGPT-v2 7B [10] Fuyu 8B [11]
Q1. What is unusual about this image? 57.2 35.2 36.3 19.2 25.3 25.3 21.9 24.8 21.2 19.1 8.6
Q2. What is wrong with this image? 29.5 23.9 34.7 30.9 30.5 13.8 8.9 23.3 9.3 17.9 8.4
Average 43.4 29.6 35.5 25.0 27.9 19.6 15.4 24.0 15.2 18.5 8.5
Q3. Describe the image in detail. 64.9 28.0 30.5 17.5 21.9 16.0 11.8 21.6 14.0 16.0 7.6
Maximum Agreement 64.9 35.2 36.3 30.9 30.5 25.3 21.9 24.8 21.2 19.1 8.6

Qualitative Observations

GPT-4V fails to recognize the absence of facial skin

Character with incorrectly rendered face mesh

Ground truth: The character's face mesh is not rendered correctly.

Question: What is wrong with this image?

GPT-4V: The image appears to be from a video game or some form of CGI (computer-generated imagery) with very high-quality rendering. The characters depicted are very lifelike, suggesting a significant level of detail and investment in graphics. However, there's nothing inherently "wrong" with the image in terms of content; it seems to portray a close-up moment between two characters.

Llama-2 (Judge): No, the generated description does not accurately convey the key information and themes of the ground truth. While it acknowledges the high quality of the graphics and the lifelike nature of the characters, it does not address the issue of the character's face mesh not being rendered correctly, which is the main focus of the ground truth.

One of the several cases in which GPT-4V fails to detect a problem with facial features.

GPT-4V fails to detect unnatural body configuration

Basketball player with unusual hand position

Ground truth: The left hand of the basketball player is in an unusual position and is stretched backward.

Question: What is unusual about this image?

GPT-4V: The image appears to be a screenshot from a basketball video game, possibly from the NBA 2K series. The unusual aspects of this image include: One of the players has a bright orange beard that stands out from the rest. Another player seems to be kissing or blowing at the basketball, which is not a typical basketball move or gesture. The characters and graphics look computer-generated, which indicates it's from a video game rather than a real-life game.

The interface at the bottom of the image shows various controls and buttons, which further suggests that it's from a video game.

Llama-2 (Judge): No, the generated description does not accurately convey the key information and themes of the ground truth. The ground truth focuses on the unusual position of the basketball player's left hand, while the generated description highlights three unusual aspects of the image, none of which relate to the player's hand position.

The image shows a basketball player with an unnatural hand position. GPT-4V fails to focus on small details such as body configuration and is unable to report this issue.


Dataset Explorer