Explainable AI: Making Visual Question Answering Systems More Transparent

Explainable AI: Making Visual Question Answering Systems More Transparent Raymond Mooney Department of Computer Science University of Texas at Austin with Nazneen Rajani Jialin Wu

Explainable AI (XAI) • AI systems are complex and their decisions are based on weighting and combining many sources of evidence. • People, understandably, don’t trust decisions from opaque “black boxes.” • AI systems should be able to “explain” their reasoning process to human users to engender trust.

History of XAI(personalized)

Explainable Rule-Based Expert Systems • MYCIN was one of the early rule-based “expert systems” developed at Stanford in the early 70’s that diagnosed blood infections. • Experiments showed that MYCIN was as accurate as medical experts, but, initially, doctors wouldn’t trust its decisions. • MYCIN was augmented to explain its reasoning by providing a trace of its rule-based inferences that supported its decision.

Sample MYCIN Explanation

Explaining Probabilistic Machine Learning • In the late 90’s I developed a content-based book-recommending system that used a naïve-Bayes bag-of-words text classifier to recommend books (Mooney & Roy, 2000). • It used training examples of rated books supplied by the user and textual information about the book extracted from Amazon. • It could explain its recommendations based on content words that most influenced its probabilistic conclusions.

Sample Book Recommendation Explanation

Sample Keyword Explanation

Evaluating Recommender Explanations • Herlocker et al. (2000) provided the first experimental evaluation of recommender system explanations. • They showed that some methods for explaining recommendations for a collaborative filtering system were “better” than others. • Explanations were evaluated based on how much they increased the likelihood that a user would agree to adopt a recommendation.

Satisfaction vs. Promotion • In our own evaluation of recommender system explanations (Bilgic & Mooney, 2005), we argued that user satisfaction with an adopted recommendation is more important than just convincing them to adopt it (promotion). • We experimentally evaluated how explanations impacted users satisfaction with the recommendations they adopted.

Evaluating Satisfaction • We asked users to predict their rating of a recommended book twice. • First, they predicted their rating of a book after just seeing the system’s explanation for why it was recommended. • Second, they predicted their rating again after they had a chance to examine detailed information about the book, i.e. all the info on Amazon about it. • An explanation was deemed “better” if the difference in these two ratings was less, i.e. the explanation allowed users to more accurately predict their final opinion.

Explanation Evaluation Experiment • We compared three styles of explanation • NSE: Neighborhood Style which shows the most similar training examples and was the best at promotion according to Herlocker et al. • KSE: Keyword Style which was used by our original content-base recommender. • ISE: Influence Style which presented the training examples that most influenced the recommendation.

Experimental Results • We found that in terms of satisfaction: KSE > ISE > NSE • NSE caused people to “over rate” recs. (for ratings on a 5 point scale)

XAI for Deep Learning

Deep Learning • Recent neural networks have made remarkable progress on challenging AI problems such as object recognition, machine translation, image and video captioning, and game playing (e.g. Go). • They have hundreds of layers and millions of parameters. • Their decisions are based on complex non-linear combinations of many “distributed” input representations (called “embeddings.”).

Types of Deep Neural Networks (DNNs) • Convolutional neural nets (CNNs) for vision. • Recurrent neural nets (RNNs) for machine translation, speech recognition, and image/video captioning. • Long Short Term Memory (LSTM) • Gated Recurrent Unit (GRU) • Deep reinforcement learning for game playing.

Recent Interest in XAI • The desire to make deep neural networks (and other modern AI systems) more transparent has renewed interest in XAI. • DARPA has a new XAI program that funds 12 diverse teams of researchers to develop more explainable AI systems. • The EU’s new General Data Protection Regulation (GDPR) gives consumers the “Right to Explanation” (Recital 71) when any automated decision is made about them.

XAI for VQA • I am part of a DARPA XAI team focused on making explainable deep-learning systems for Visual Question Answering (VQA) • BBN (prime): Bill Ferguson • GaTech: Dhruv Batra and Devi Parikh • MIT: Antonio Torralba • UT: Ray Mooney

Visual Question Answering(Agrawal, et al., 2016) • Answer natural language questions about an image.

VQA Architectures • Most systems are DNNs using both CNNs and RNNs.

Visual Explanation • Recent VQA research shows that deep learning models attend to relevant parts of image while answering the question (Goyalet al., 2016). • The parts of images that the models focus on can be viewed of as a visual explanation. • Heat-maps usedto visualize explanations as images. On the left is an image from the VQA dataset and on the right is the heat-map overlaid on the image for the question - ’What is the man eating?’

Generating Visual Explanations • GradCAM (Selvarajuet al., 2017) is used to generate heat-map explanations.

Ensembles • Combining multiple learned models has been a popular and successful approach since the 90’s. • Ensembling VQA systems produces better results (Fukui et al. 2016) but further complicates explaining their results. • Visual explanations can also be ensembled, and it improves explanation quality over those of the individual component models.

Ensembling Visual Explanations • Explain a complex VQA ensemble by ensembling the visual explanations of its component systems. • Average the explanatory heat-maps of systems that agree with the ensemble, weighted by their performance on validation data (Weighted Average, WA). • Can also subtract the explanatory heat-maps of systems that disagree with the ensemble (Penalized Weighted Average, PWA).

Sample Ensembling of Visual Explanations Combine heat-maps of MCB and HieCoAtt systems (whose correct answers agreed with the ensemble) with LSTM (whose answer disagreed) to get an improved heat-map for the ensemble.

Evaluating Ensembled Explanations • Crowd-sourced human judges were shown two visual explanations and asked: “Which picture highlights the part of the image that best supports the answer to the question?” • Our ensemble explanation was judged better 63% of the time compared to any individual system’s explanation.

Mechanical Turk Interface for Explanation Comparison

Explanation Comparison Results

Textual Explanations • Generate a natural-language sentence that justifies the answer (Hendricks et al. 2016). • Use an LSTM to generate an explanatory sentence given embeddings of: • image • question • answer • Train this LSTM on human-provided explanatory sentences.

Sample Textual Explanations(Park et al., 2018)

Multimodal Explanations • Combine both a visual and textual explanation to justify an answer to a question (Park et al., 2018).

VQA-X • Using crowdsourcing, Park et al., collected human multi-modal explanations for training and testing VQA explanations. Explanation: “Because he is on a snowy hill wearing skis”

Post Hoc Rationalizations • Previous textual and multimodal explanations for VQA (Park et al., 2018) are not “faithful” or “introspective.” • They do not reflect any details of the internal processing of the network or how it actually computed the answer. • They are just trained to mimic human explanations in an attempt to “justify” the answer and get humans to trust it (analogous to “promoting” a recommendation).

Faithful Multimodal Explanations • We are attempting to produce more faithful explanations that actually reflect important aspects of the VQA system’s internal processing. • Focus explanation on including detected objects that are highly attended to during the VQA network’s generation of the answer. • Trained to generate human explanations, but explicitly biased to include references to these objects.

Sample Faithful Multimodal Explanation

VQA with BUTD • To make it more explainable, we use a recent state-of-the-art VQA system BUTD (Bottom-Up-Top-Down) (Anderson et al. 2018). • BUTD first detects a wide range of objects and attributes trained on VisualGenome data, and attends to them when computing an answer.

Using Visual Segmentations • We use recent methods for using detailed image segmentations for VQA (VQS, Ganet al., 2017). • Provides more precise visual information than BUTD’s bounding boxes.

High-Level VQA Architecture

Detecting Explanatory Phrases • We first extract frequent phrases about objects (2-5 words ending in a common noun) appearing in human explanations. • We train visual detectors to identify relevant explanatory phrases from the detected image segmentations for an image.

High-Level System Architecture

Textual Explanation Generator • We finally train an LSTM to generate an explanatory sentence from embeddings of the segmented objects and the detected phrases. • Trained on VQA-X data to produce human-like textual explanations. • Trained to encourage the explanation to cover the segments highly attended to by VQA to make it faithfully reflect the focus of the network that computed the answer.

Multimodal Explanation Generator • Words generated while attending to a particular visual segment are highlighted and linked to the corresponding segmentation in the visual explanation by depicting them both in the same color.

High-Level System Architecture

Sample Explanation

Evaluating Textual Explanations • Compare system explanation to “gold standard” human explanations using standard machine translations metrics for judging similarity of sentences. • Ask human judges on Mechanical Turk to compare system explanation to human explanation and judge which is better (allowing for ties). • Report percentage of time algorithm beats or ties human.

Textual Evaluation Results Automated Metrics Human Eval

Evaluating Multimodal Explanations • Ask human judges on Mechanical Turk to qualitatively evaluate the final multimodal explanations by answering two questions: • “ How well do the highlighted image regions support the answer to the question?” • “How well do the colored image segments highlight the appropriate regions for the corresponding colored words in the explanation?”

Multimodal Explanation Results

Explainable AI: Making Visual Question Answering Systems More Transparent