Commonsense-T2I

What is Commonsense-T2I?

We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that fit commonsense in real life, which we call Commonsense-T2I. Commonsense-T2I presents an adversarial challenge, providing pairwise text prompts along with expected outputs.

Given two adversarial text prompts containing an identical set of action words with minor differences, such as **"a lightbulb without electricity" v.s. "a lightbulb with electricity", we evaluate whether T2I models can conduct visual-commonsense reasoning, eg. produce images that fit "The lightbulb is unlit" v.s. "The lightbulb is lit" correspondingly.

The dataset is carefully hand-curated by experts and annotated with fine-grained labels, such as commonsense type and likelihood of the expected outputs, to assist analyzing model behavior. We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos--even the DALL-E 3 model could only achieve 48.92% on Commonsense-T2I, and the Stable Diffusion XL model only achieves 24.92% accuracy.

Our experiments show that GPT-enriched prompts cannot** solve this challenge, and we include a detailed analysis about possible reasons for such deficiency.

One example from Commonsense-T2I, where P1, P2 are pairwise prompts; D1, D2 are descriptions for expected output images.

We aim for Commonsense-T2I to serve as a high-quality evaluation benchmark for T2I commonsense checking, fostering advancements in real life image generation.

Evaluation Pipeline

The evaluation pipeline for Commonsense-T2I. P1, P2 (text prompts) and D1, D2 (descriptions for expected output) are provided by Commonsense-T2I, while I1, I2 are generated images. As shown in the figure, a data example is correct only if both of the pairwise prompts are generated correctly.

Qualitative Results

We show error cases of DALL-E 3, and DALL-E 3 w/o revision, which turns off the GPT-revision function on prompts. Input prompts and expected outputs are in the green box. DALL-E 3 images are generated with the revised prompts returned by DALL-E 3 as default, and DALL-E 3 w/o revision images are generated with the original prompt. The highlighted sentences are (partially) correct expected output descriptions in revised prompts, but not illustrated in the output images.

Quantitative Results

Main results on the Commonsense-T2I challenge set. The columns row shows the T2I models that we evaluate on, and the first row shows the evaluator choices. The best performance model under each evaluator is DALL-E 3.

Experiment Analysis

Are T2I models limited by text embedding?

Since all the Stable Diffusion (based) T2I models score under 35% accuracy on Commonsense-T2I, we investigate the possible reason behind this phenomena: these models might be biased by the text embedding of the prompts. The motivation is follows: if the embeddings of P1 and P2, which are inputs to the T2I models, are very similar, then they could lead the T2I models to generate similar images for P1 and P2, while the expected outputs should different. We deploy the CLIP (Radford et al., 2021) (ViT/L14) encoder, which is the default text encoder for Stable Diffusion (based) models, to encode the pairwise prompts P1 and P2 in Commonsense-T2I. We compare the similarity between CLIP embedding of P1 and P2 against performance score as in Figure 5. Notice that we adopt min-max normalization to project the embedding similarity values into [0,1].

When the CLIP embedding similarity of prompts P1, P2 goes up, human evaluated performance scores go down. It suggests that T2I models perform badly when their text encoders fail to differentiate between P1 and P2.

More Examples

We show random selected error examples for models tested in Commonsense-T2I, find more in the visualizations!

Error cases of Stable Diffusion XL andPlayground v2.5. The prompt and expected output description are provided in green box for each example.

BibTeX


        @article{fu2024commonsenseT2I,
          title = {Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?},
          author = {Xingyu Fu and Muyu He and Yujie Lu and William Yang Wang and Dan Roth},
          journal={arXiv preprint arXiv:2406.07546},
          year = {2024},
          }

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

COLM 2024

An example prompt in Commonsense-T2I and failure cases from DALL-E 3 (Betker et al., 2023), Stable Diffusion XL (Rombach et al., 2022), Openjourney v4, and Playground v2.5 (Li et al., 2024). The expected output for the prompt is “The lightbulb is unlit”.

What is Commonsense-T2I?

One example from Commonsense-T2I, where P1, P2 are pairwise prompts; D1, D2 are descriptions for expected output images.

We aim for Commonsense-T2I to serve as a high-quality evaluation benchmark for T2I commonsense checking, fostering advancements in real life image generation.

Evaluation Pipeline

The evaluation pipeline for Commonsense-T2I. P1, P2 (text prompts) and D1, D2 (descriptions for expected output) are provided by Commonsense-T2I, while I1, I2 are generated images. As shown in the figure, a data example is correct only if both of the pairwise prompts are generated correctly.

Qualitative Results

Quantitative Results

Main results on the Commonsense-T2I challenge set. The columns row shows the T2I models that we evaluate on, and the first row shows the evaluator choices. The best performance model under each evaluator is DALL-E 3.

Experiment Analysis

Are T2I models limited by text embedding?

When the CLIP embedding similarity of prompts P1, P2 goes up, human evaluated performance scores go down. It suggests that T2I models perform badly when their text encoders fail to differentiate between P1 and P2.

More Examples

We show random selected error examples for models tested in Commonsense-T2I, find more in the visualizations!

Error cases of Stable Diffusion XL andPlayground v2.5. The prompt and expected output description are provided in green box for each example.

BibTeX