Dynamic Clue Bottlenecks: Interpretable by Design Visual Question Answering

University of Pensylvania
Paper Code

Most VQA methods are end-to-end blackbox models that humans struggle to understand. We propose DCLUB (Dynamic Clue Bottlenecks), which is an interpretable by design VQA model that provides faithful and human readable explanations for visual question answering.

MY ALT TEXT

DCLUB first provides visual clues in the image that could hint an answer, and then decides the answer based soly on the clues.



MY ALT TEXT

DCLUB on an example VQA data: with explicit steps of visual clue generation (gDCLUB) and entailment based scoring (fCLUB) for final prediction, DCLUB is interpretable by design.

Abstract

→Motivation: The design nature of end-to-end multimodal models prevents them from being interpretable to humans, undermining trust and applicability in critical domains. While post-hoc rationales offer certain insight into understanding model behavior, these explanations are not guaranteed to be faithful to the model. In this paper, we address these shortcomings by introducing an interpretable by design model that factors model decisions into intermediate humanlegible explanations, and allows people to easily understand why a model fails or succeeds.

→Model: We propose the Dynamic Clue Bottleneck Model (DCLUB), a method that is designed one step towards an inherently interpretable VQA system. DCLUB provides an explainable intermediate space before the VQA decision and is faithful from the beginning, while maintaining comparable performance to black-box systems. Given a question and an image, DCLUB first returns a set of visual clues: natural language statements of visually salient evidence from the image, and then generates the output based solely on the visual clues.

→Dataset: To supervise and evaluate the generation of VQA explanations within DCLUB, we collect a dataset of 1.7k questions with visual clues. Evaluations show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning focused questions while preserving 99.43% of performance on VQA-v2.

Collected Data

MY ALT TEXT

We collect visual clues from crowdsourcing, and use them as our training data.

Qualitative Examples

MY ALT TEXT

We demonstrate randomly selected examples from DCLUB outputs. Human annotated visual clues are in the grey boxes under the human icon, and DCLUB visual clues are in colored boxes.

Experiment Results

VQA performance comparisons between blackbox (BB) baseline model: BLIP-2, and DCLUB, which uses BLIP-2 fine-tuned visual clue generator. Blackbox model performance coverage percentage (our / blackbox) is calculated on the right column, showing that DCLUB can achieve a comparable result with its blackbox counterpart on both of VQA v2 and GQA: covering 99.43% and 95.24% of the blackbox model performances corresponding, and even reaches a higher accuracy score on our collected testing set, achieving 104.64% of the blackbox model performance.

Error Analysis

DCLUB can help humans understand when and where the model makes mistakes. We show three common reasons for errors in DCLUB: object fine-grained attributes, object status recognition, and small region recognition.

MY ALT TEXT

BibTeX

BibTex Code Here