Large multimodal models (LMMs) have evolved from large language models (LLMs) to integrate multiple input modalities, such as visual inputs. This integration augments the capacity of LLMs in tasks requiring visual comprehension and reasoning. However, the extent and limitations of their enhanced abilities are not fully understood. To address this gap, we introduce GlitchBench, a novel benchmark designed to test and evaluate the common-sense reasoning and visual recognition capabilities of large multimodal models. Our dataset is curated from a variety of unusual, infrequent, and glitched scenarios from video game content and aims to challenge both the visual and linguistic reasoning powers of LMMs in detecting and interpreting out-of-the-ordinary events and scene composition.
A person stuck in a piece of furniture
Two people driving an invisible car
A rifle floating in the air
A person is floating in the air
The gun in the hand is missing
The table has a placeholder texture
Ground truth: It is raining inside the building
Question: What is wrong with this image?
GPT-4V: There's nothing inherently "wrong" with this image; however, if you're referring to inconsistencies or peculiarities that might not align with the established Batman canon, the red Batman suit is unusual since Batman is traditionally known in black or very dark colors. The use of red could be an artistic choice for the game, an alternate suit available within the game's universe, or indicative of a different character who takes on the Batman mantle.
The image depicts a screenshot in which it rains inside a room. While the rain should be what is wrong with the image, GPT-4V fails to reason correctly and instead focuses on the color of Batman's costume. Note that the ground truth is never presented as part of the prompt in our study.
Performance of various LMMs on GlitchBench. Numbers highlighted in represent the average results of Q1 and Q2, which are the main results of the benchmark. Numbers related to Q3 serve as a visual perception test to measure the ability of models to report glitches in a relaxed manner. Numbers highlighted in show the maximum agreement achievable with ground truth as perceived by Llama-2's judgment (%).
Question | GPT-4V [1] | LLaVA-1.5 7B [2] | LLaVA-1.5 13B [3] | SPHINX 7B [4] | SPHINX 13B [5] | InstructBLIP 7B [6] | InstructBLIP 13B [7] | OtterHD 8B [8] | Qwen-VL 10B [9] | MiniGPT-v2 7B [10] | Fuyu 8B [11] |
---|---|---|---|---|---|---|---|---|---|---|---|
Q1. What is unusual about this image? | 57.2 | 35.2 | 36.3 | 19.2 | 25.3 | 25.3 | 21.9 | 24.8 | 21.2 | 19.1 | 8.6 |
Q2. What is wrong with this image? | 29.5 | 23.9 | 34.7 | 30.9 | 30.5 | 13.8 | 8.9 | 23.3 | 9.3 | 17.9 | 8.4 |
Average | 43.4 | 29.6 | 35.5 | 25.0 | 27.9 | 19.6 | 15.4 | 24.0 | 15.2 | 18.5 | 8.5 |
Q3. Describe the image in detail. | 64.9 | 28.0 | 30.5 | 17.5 | 21.9 | 16.0 | 11.8 | 21.6 | 14.0 | 16.0 | 7.6 |
Maximum Agreement | 64.9 | 35.2 | 36.3 | 30.9 | 30.5 | 25.3 | 21.9 | 24.8 | 21.2 | 19.1 | 8.6 |
Ground truth: The character's face mesh is not rendered correctly.
Question: What is wrong with this image?
GPT-4V: The image appears to be from a video game or some form of CGI (computer-generated imagery) with very high-quality rendering. The characters depicted are very lifelike, suggesting a significant level of detail and investment in graphics. However, there's nothing inherently "wrong" with the image in terms of content; it seems to portray a close-up moment between two characters.
Llama-2 (Judge): No, the generated description does not accurately convey the key information and themes of the ground truth. While it acknowledges the high quality of the graphics and the lifelike nature of the characters, it does not address the issue of the character's face mesh not being rendered correctly, which is the main focus of the ground truth.
One of the several cases in which GPT-4V fails to detect a problem with facial features.
Ground truth: The left hand of the basketball player is in an unusual position and is stretched backward.
Question: What is unusual about this image?
GPT-4V: The image
appears to be a screenshot from a basketball
video game, possibly from the NBA 2K series. The unusual aspects of
this image include: One of the players has
a bright orange beard that stands out from the rest. Another player
seems to be kissing or blowing at the
basketball, which is not a typical basketball move or gesture. The
characters and graphics look
computer-generated, which indicates it's from a video game rather
than a real-life game.
The interface at the bottom of the image shows various controls and
buttons, which further suggests that it's
from a video game.
Llama-2 (Judge): No, the generated description does not accurately convey the key information and themes of the ground truth. The ground truth focuses on the unusual position of the basketball player's left hand, while the generated description highlights three unusual aspects of the image, none of which relate to the player's hand position.
The image shows a basketball player with an unnatural hand position. GPT-4V fails to focus on small details such as body configuration and is unable to report this issue.