Large multimodal models (LMMs) have evolved from large language models (LLMs) to integrate multiple input modalities, such as visual inputs. This integration augments the capacity of LLMs in tasks requiring visual comprehension and reasoning. However, the extent and limitations of their enhanced abilities are not fully understood. To address this gap, we introduce GlitchBench, a novel benchmark designed to test and evaluate the common-sense reasoning and visual recognition capabilities of large multimodal models. Our dataset is curated from a variety of unusual, infrequent, and glitched scenarios from video game content and aims to challenge both the visual and linguistic reasoning powers of LMMs in detecting and interpreting out-of-the-ordinary events and scene composition.
Ground truth: It is raining inside the building
Question: What is wrong with this image?
GPT-4V: There's nothing inherently "wrong" with this image; however, if you're referring to inconsistencies or peculiarities that might not align with the established Batman canon, the red Batman suit is unusual since Batman is traditionally known in black or very dark colors. The use of red could be an artistic choice for the game, an alternate suit available within the game's universe, or indicative of a different character who takes on the Batman mantle.
Question | GPT-4V [1] | LLaVA-1.5 7B [2] | LLaVA-1.5 13B [3] | SPHINX 7B [4] | SPHINX 13B [5] | InstructBLIP 7B [6] | InstructBLIP 13B [7] | OtterHD 8B [8] | Qwen-VL 10B [9] | MiniGPT-v2 7B [10] | Fuyu 8B [11] |
---|---|---|---|---|---|---|---|---|---|---|---|
Q1. What is unusual about this image? | 57.2 | 35.2 | 36.3 | 19.2 | 25.3 | 25.3 | 21.9 | 24.8 | 21.2 | 19.1 | 8.6 |
Q2. What is wrong with this image? | 29.5 | 23.9 | 34.7 | 30.9 | 30.5 | 13.8 | 8.9 | 23.3 | 9.3 | 17.9 | 8.4 |
Average | 43.4 | 29.6 | 35.5 | 25.0 | 27.9 | 19.6 | 15.4 | 24.0 | 15.2 | 18.5 | 8.5 |
Q3. Describe the image in detail. | 64.9 | 28.0 | 30.5 | 17.5 | 21.9 | 16.0 | 11.8 | 21.6 | 14.0 | 16.0 | 7.6 |
Maximum Agreement | 64.9 | 35.2 | 36.3 | 30.9 | 30.5 | 25.3 | 21.9 | 24.8 | 21.2 | 19.1 | 8.6 |
Ground truth: The character's face mesh is not rendered correctly.
Question: What is wrong with this image?
GPT-4V: The image appears to be from a video game or some form of CGI (computer-generated imagery) with very high-quality rendering. The characters depicted are very lifelike, suggesting a significant level of detail and investment in graphics. However, there's nothing inherently "wrong" with the image in terms of content; it seems to portray a close-up moment between two characters.
Llama-2 (Judge): No, the generated description does not accurately convey the key information and themes of the ground truth. While it acknowledges the high quality of the graphics and the lifelike nature of the characters, it does not address the issue of the character's face mesh not being rendered correctly, which is the main focus of the ground truth.
Ground truth: The left hand of the basketball player is in an unusual position and is stretched backward.
Question: What is unusual about this image?
GPT-4V: The image
appears to be a screenshot from a basketball
video game, possibly from the NBA 2K series. The unusual aspects of
this image include: One of the players has
a bright orange beard that stands out from the rest. Another player
seems to be kissing or blowing at the
basketball, which is not a typical basketball move or gesture. The
characters and graphics look
computer-generated, which indicates it's from a video game rather
than a real-life game.
The interface at the bottom of the image shows various controls and
buttons, which further suggests that it's
from a video game.
Llama-2 (Judge): No, the generated description does not accurately convey the key information and themes of the ground truth. The ground truth focuses on the unusual position of the basketball player's left hand, while the generated description highlights three unusual aspects of the image, none of which relate to the player's hand position.