Using Visual Question Answering to Generate Issue-Sensitive Captions

May 6, 2020


In the past two years, controlling generative models (such as GANs and VAEs) have been widely studied in the Computer Vision literature. The idea is that once these large capacity neural network models learn the data manifold of millions of images, it has internalized some knowledge about the world. The “knowledge” learned is often in the form of abstract concepts such as “skin tone”, “hair style” from faces, or “brightness”, “rotation angles” for general images. We can then explicitly control images generated from GAN or VAE with respect to these abstract concepts through some post-training manipulations.

Figure 1: (a) from OpenAI GLOW demo; (b) from ICLR 2020 paper: On the "steerability" of generative adversarial networks.

Despite such an impressive result, controlling generative models in text has been far more difficult. Limited successes have been achieved for properties like sentiment or tense (Hu et al.1). Further work from Lample et al.2 leveraged the domain where the data are collected to add additional attributes such as gender, race, age, and interests (like music, book, movie, etc.), and trained a neural network from scratch conditioned on these attributes. However, for vagueness, politeness, genericness, sarcasm, humor, and formality, we have made little progress in developing neural network based decoding method to control these aspects of text.

In this post, we will focus on one: Question-under-Discussion (QuD). When we communicate with each other, there is often an implicit question under discussion (hence the name). For example, when two strangers meet, the QuD would often be “tell something about yourself”, so people often start with self-introduction. However, instead of listing all the facts about oneself (starting from age, height), people usually choose to communicate more relvant facts about themselves given the situation. In a blind date, people talk about their interests, income, attitudes towards starting a family, dating preferences etc. In a business metting, people talk about their professional history.

The QuD model has been formalized in 1984 as a model of discourse. It emphasizes on the fact that we communicate the state of the world up to a partition. The partition often refers to context (or in this case, the QuD that specifies the context).

As a general discourse model, QuD model goes beyond simple conversation. It suggests that for every sentence we say or write, there is a corresponding underlying question; even for paragraphs of text that we write. QuD Model turns any paragraph of text into a self-conversation in a writer’s head.

Here is an example, if I start with the sentence “A few of us arrive at the classroom and the door is locked.” What should I write next? As you can see, QuD dictates how I write my next sentence. QuD decides what’s relevant in this situation and what’s not.

QuD Paragraph (next sentence in bold)
Why is the room locked? A few of us arrive at the classroom and the door is locked. We realize the teacher has not arrived.
Who has the key? A few of us arrive at the classroom and the door is locked. None of us has the key.
Does Stefanie have the key? A few of us arrive at the classroom and the door is locked. Stefanie does not have the key.
Is there another class in the room? A few of us arrive at the classroom and the door is locked. Through the window, the room appears to be empty.

Even though all of my next sentences are direct followup of the previous sentence (often referred to as “context” or “scenario”), they are all quite different because I’m addressing different QuDs. In another word, if our generative model for text is QuD-sensitive, we can control text generation by swapping QuDs.

Maybe you can already see what we want to do: an image can serve as scenario/context, questions from VQA (Visual Question Answering dataset) can serve as QuDs , and sentence generated by the image captioner are the “next sentence” we want to generate. By changing the question, our method will change the generated caption. Cool, huh? There is no attribute classifier, no additional annotated training data, just a question and an image.

If we are able to achive this – using a QuD (natural language question) to ask the model to focus on different features of the image, we open the door to truly flexible controllable text generation. For the same image, I can ask: “Can the man run far?”, and the caption can be “This is a runner who jogs through the entire park.” Or I can ask “Does this man look tired?”, and the caption can be “The man is energetically running.”

So the idea seems fun, but is this technologically possible? Remember, there is no additional training involved. How can an image captioner that is only trained on image caption dataset be sensitive to questions from a VQA dataset? To answer this question, we need to rethink what question is and introduce Rational Speech Act (RSA) framework. A more technical overview of this is described in my new paper with with Reuben Cohn-Gordon and Chris Potts3.

QuD in Image Captioning

Figure 2: Examples from applying our method to a SoTA 6-layer encoder-decoder Transformer with 55M parameters image captioner trained on the MS COCO dataset. The base caption shows what the Transformer originally would have outputted(last column). The penultimate column (Issue-sensitive caption) shows what our method would produce. We can produce captions that try to address a question.

VQA style questions focus on three things: object, attribute of the object, relationship between objects. If we think more carefully about a useful representation of an image (let’s say if we want an image to someone who can’t see), we will describe the first picture in Figure 2 as “There’s grass on the ground. The photo is captured through a net. There is a guy wearing orange jersey, white pants, and he’s throwing a baseball. This guy is a pitcher. He’s standing on a field”. We can write this more structurally into a list, where we would list every object, its every attribute (color, weight, texture, etc.), and its relationship to other object in the picture. Needless to say, this list will be incredibly long and I personally doubt if it can ever be “complete”.

This is how we write down captions for each image – we write down a small amount of features in the image that catch our attention. That is to say: human captions attempt to address an indeterminate question (or “all” questions). The captioning system learns to map features in an image to a grammatical sequence of words. To change the caption with respect to a question, what we mean is that we want the caption to reflect a particular feature the question is seeking. If the question is “What position is this man playing?”, then the caption needs to mention a feature which is this guy’s position (i.e., “pitcher”). We call this Issue-Sensitive Image Captioning (ISIC) because we want the generated caption to resolve an issue (question).

We show some preliminary success in Figure 2. By asking questions such as “what color is the sky”, we control the image captioner to focus on a particular nature of this photo (that this is a black & white photo). Instead of describing what is the scene of the picture (“airplane taking off”), the caption model describes the meta-level information of this picture: (“a black and white photo of an airplane”).

Examining controllable text generation in a multi-modal domain (image, text) allows us to clearly define the question (through a pre-collected VQA dataset). Since image datasets are often densely annotated (with object information, attributes of objects, etc.), we can automatically evaluate whether our generated caption addresses the question (from here on, we refer to this as “resolve the issue”). However, even with dense annotations, evaluating issue resolution is still a major challenge. We evaluate our proposed method on a restricted domain of bird images (CUB-2011), but we showed some promising examples in a more general dataset such as MSCOCO (in Figure 2).

Controlling via RSA: A Simple Tour

In the previous section, we have established that in order to be question-aware, captions need to describe a feature in the image that is the answer to the question. Then we can think of the process to make our model generate question-aware captions as the process of picking out the right features that will answer the question. This is when RSA comes in.

The Rational Speech Act (RSA) is a framework of Bayesian models for cognition and language. It turns out a simple model in this family is highly revelant for machine learning tasks, since it captures the idea of referential language generation (which I will describe below). Computing this model turns out to require only simple matrix operations, which is extremely suitable and modular to apply to any large neural network model.

Source: New Yorker; University of Michigan, Law School.

Here’s the intuition on what human thought process RSA is trying to capture. If you were asked to say a word to the police so that they can pick out number 5 (but you are not allowed to say numbers), you probably would not say “a man,” nor would you say “khaki pants”. The best word you can pick (assuming you are rational) is “baseball cap” because it uniquely identifies number 5. RSA is created to emulate this thought process – when you have a group of images, you want to pick the word that best represents it (so that it’s distinguishing the target image from the rest).

The computational process for the simplest RSA model can be easily achieved through Bayes rule – probability matrix normalization. In this formulation, a simplified version of our pre-trained image captioner model is $S_0(\mathbf{w} \vert \mathbf{i})$, a row-stochastic probability matrix (probabilities in a row sum up to 1) where the rows are “images”, represented as a list of features, and the columns are the features we can pick to represent the image. How is this a captioning model? Well, you can imagine a captioning model that only produces one word (represented as an emoji in our schema).

Figure 3: A simple RSA computing process.

Without considering the goal of contrasting two other images, for the first image, we will choose either baseball cap or mountain to describe it. However, if we want to uniquely identify this image (to a police officer, or to anyone who’s “listening”) against the other two images, we would realize that the other two rows also have mountains. What’s unique about the first row is the baseball cap.

The process described in Figure 3 is extremely simple but captures the process of reasoning about plausible alternatives and then decide what to select to achieve its best unambiguous outcome. We can see in bold number that $S_1$ will choose baseball cap to describe the first image, contrasting the other two.

More formally, these two computations can be described as:

L_1(\mathbf{i}|\mathbf{w}) &= \frac{S_0(\mathbf{w}|\mathbf{i}) P(\mathbf{i})}{P(\mathbf{w})} = \frac{S_0(\mathbf{w}|\mathbf{i}) P(\mathbf{i})}{\sum_{i \in \mathcal{I}} S_0(\mathbf{w}|\mathbf{i}) P(\mathbf{i})} \\
S_1(\mathbf{w}|\mathbf{i}) &= \frac{L_1(\mathbf{i}|\mathbf{w}) P(\mathbf{w})}{P(\mathbf{i})} = \frac{L_1(\mathbf{i}|\mathbf{w}) P(\mathbf{w})}{\sum_{w \in \mathcal{V}} L_1(\mathbf{i}|\mathbf{w}) P(\mathbf{w})} \\

In RSA books/papers, you often see the simplified version (skipping the normalization constant):

L_1(\mathbf{i}\vert\mathbf{w}) &\propto S_0(\mathbf{w}|\mathbf{i}) P(\mathbf{i}) \\
S_1(\mathbf{w}\vert\mathbf{i}) &\propto L_1(\mathbf{i}|\mathbf{w})P(\mathbf{w})

More detailed tutorial of RSA can be found in here.

RSA + VQA: Is Our Caption Question-Aware?

Equipped with the knowledge of RSA, in order to achieve our goal, all we need to do is to pick out a set of images so that for the target image, the caption we produce will contain the feature that is the answer to the question. We will now describe a procedure to select a set of images so that our caption model and our RSA re-weighting will put high probability on features that can answer our question.

Given a target image and a question: $(\mathbf{i}, \mathbf{q})$, we can directly apply any pre-trained VQA model to get an answer. Given a lot of images and the same question, we can get a lot of answers. Some of these answers are different from the answer for the target image, some are the same. For example, with our target image: {blue baseball cap, mountain}, we can ask the following question: Does the person wear a baseball cap?, where the answer is Yes or No.

Given this answer, we can partition a list of images into two groups: images where the person is wearing a baseball cap, and images where the person is NOT wearing a baseball cap. Note that the question can be extremely general and ask about various aspects of the image.

Suppose we happen to select these 6 images from a larger group of images. First three images have baseball cap, the next three do not. Can we just naively apply RSA and hope the item we pick will be about the baseball cap? The answer is unfortunately no. Before RSA, the caption model will randomly choose between baseball cap and mountain. After RSA, it will choose mountain, which is worse.

Figure 5: Directly applying RSA is NOT question-aware (or issue-sensitive, as defined in our paper).

OK. We ran into a problem: even though our VQA model partitioned 6 images into two cells (top 3 rows and bottom 3 rows), the RSA computation is unaware of this (cell structure). What it does is treating all 5 other images (rows) as distractors and try to find what’s unique about the target image (first row) against all else, which is the mountain.

Luckily, a solution has already been worked out by Kao et al. 4 The idea is pretty simple, why not just add up the probability within the cell (across the column) after computing $L_1$ probability matrix? More formally, this corresponds a different $S_1$:

U_1^{\mathbf{C}}(\mathbf{i}, \mathbf{w}, \mathbf{C}) &= \log \Big( \sum_{\mathbf{i}' \in \mathcal{I}}\delta_{\mathbf{C}(\mathbf{i})=\mathbf{C}(\mathbf{i}')} L_1(\mathbf{i}'|\mathbf{w}) \Big) \\
S_1^{\mathbf{C}}(\mathbf{w} \vert \mathbf{i}, \mathbf{C}) &\propto \text{exp} \big(\alpha U_1^{\mathbf{C}}(\mathbf{i}, \mathbf{w}, \mathbf{C}) - \text{cost}(\mathbf{w}) \big)

This formula redefines the pragmatic listener matrix $L_1$ as an informative utility $U_1^{\mathbf{C}}$, and compute $S_1$ probability matrix proportional to it. In the RSA literature, this is often referred to as the QuD-RSA (QuD: Question-under-Discussion). If we visualize this process with actual probability numbers, here’s the result:

Figure 6: We show the computational process of QuD-RSA.

As you can see, what we really did is just add up along the column for the original $L_1$ matrix, and then normalize over the row for the target image. This allows our RSA output to be issue-sensitive (question-aware). But wait wait wait, something is not right here! If you actually looked very closely at the $S_1^\mathbf{C}$ table, you’d realize what the RSA picked out is still wrong – it would randomly choose between skiing and mountain. What the heck is going on?

So, what QuD-RSA actually does is that, it creates an “equivalence class” between all images in the same cell. It converts the original objective, which is to “pick an item to best describe target image” to “pick an item to best describe the target cell”. QuD-RSA is designed to ignore the differences between within-cell images. Since picking mountain or skiing (neither appeared in the distractor images) would already best identify the target cell, there is no additional incentive to pick baseball cap. Suffice to say this is not what we want.

This last ingredient allows us to add a pressure to $S_1$ to select items that are shared amongst all images within the cell. Intuitively, since all images within the target cell share the same answer to the VQA question, whatever attribute/object in the image allows that answer will have a higher chance to appear in the resulting caption. This ingredient we choose to add is called information entropy, where flatter distribution (more uniformly distributed) will have a higher entropy, and peakier distribution will have a lower entropy. Since baseball cap is shared among all three images in the target cell, it will have the highest entropy.

More formally, we can write it out to combine both $U_1$ and $U_2$, with a balancing hyper-parameter $\beta \in [0, 1]$ that decides between how much weight we want to put on either utility:

U_2(\mathbf{i}, \mathbf{w}, \mathbf{C}) &= H(L_1(\mathbf{i}'|\mathbf{w}) \cdot \delta_{\mathbf{C}(\mathbf{i})=\mathbf{C}(\mathbf{i}')}) \\
S_1^{\mathbf{C}+H}(\mathbf{w} \vert \mathbf{i}, \mathbf{C}) &\propto \text{exp} \big( \alpha ((1-\beta)U_1 + \beta U_2) -\text{cost}(\mathbf{w}) \big)

And computationally it can be visualized as:

Figure 6: We show the computational process of QuD-Entropy-RSA.

Now the story for generating issue-sensitive (question-aware) captions is complete. With the added entropy reward, the $S_1$ matrix will finally pick baseball cap as the answer to What is the person wearing?.

We apply this RSA re-weighting process to every decoder step, which is called Incremental RSA, described in 5. In a seperate blog post, I will share how to write RSA computation as a modular component to add to any image captioning algorithm.

Evaluating on Birds (CUB)

Verifying whether we are successful at “controlling” the caption is difficult. We do have a question and an answer in VQA, but how do we determine if the answer is mentioned in the caption? Even for simple attributes like color, for example, “What color is the flower?” with an answer “red”, what captions would satisfy our definition of “addressing the question”?

  1. There is a scarlett-colored flower.
  2. There is a red humming bird flying over a flower.

The first caption addresses the question (resolves the issue), but the second doesn’t (because “red” is modifying the bird, not the flower). This highlights the type of challenges we need to address for general domain image datasets such as MSCOCO and VQA. So, can we try an automated evaluation for a restricted domain? Turns out, yes.

We use CalTech-UC-San-Diego Bird Dataset (CUB-2011), which has 312 features of birds annotated for each image (11788 images in total), and since it’s a restricted domain, these 312 features are exhaustive for each bird. These features include 26 body parts, such as “belly color”, “bill length”, etc. We can imagine that each body part can be the focus of one question: “What is the belly color of this bird?” and “What is the bill length of this bird”?

We are able to build a simple keyword based classifier that can identify body part mentions in caption as well as attributes (modifiers) of these body parts. We use this classifier to evaluate our method to control caption generation. We refer to this as the “feature-in-text” classifier.

Instead of using VQA, we use the feature matrix annotation as guide to select birds that share similar features (that describe the body parts) with the target image, and birds that don’t share similar features. We show some generated captions below:

Figure 7: Generated captions for CUB birds. Left bracket contains images that share the same feature (under discussion) as the target image. Right bracket contains images that don't.

We begin by assessing the extent to which our issues ensitive pragmatic models produce captions that are more richly descriptive of the target image than a base neural captioner. For CUB, we can simply count how many attributes the caption specifies according to our feature-in-text classifier. More precisely, for each image and each model, we generate captions under all resolvable issues, concatenate those captions, and then use the feature-in-text classifier to obtain a list of attributes, which we can then compare to the ground truth for the image as given by the CUB dataset.

Table 1

Table 1 reports on this evaluation. Precision for all models is very high; the underlying attributes in CUB are very comprehensive, so all high-quality captioners are likely to do well by this metric. In contrast, the recall scores vary substantially, and they clearly favor the issue-sensitive models, revealing them to be substantially more descriptive than $S_0$.

In Table 2, we provide a breakdown of these scores by body part. The issue-sensitive models are clear winners for all categories. It is noteworthy that the entropy term seems to help for some categories but not others, suggesting underlying variation in the categories themselves. It’s safe to say that some categories are rarely discussed in ground-truth captions thus leading to difficulty for the captioning model to generate them.

Table 2

Our previous evaluation shows that varying the issue has a positive effect on the captions generated by our issue-sensitive models, but it does not assess whether these captions resolve individual issues in an intuitive way. We now report on an assessment that quantifies issue-sensitivity in this sense.

The question posed by this method is as follows: for a given issue, does the produced caption precisely resolve the issue? We can divide this into two sub-questions. First, does the caption resolve the issue, which is a notion of recall. Second, does the caption avoid addressing issues that are different from what we want to address, which is a notion of precision. The recall pressure is arguably more important, but the precision one can be seen as assessing how often the caption avoids irrelevant and potentially distracting information.

Overall, the scores reveal that this is a very challenging problem, which traces to the fine-grained issues that CUB supports. Entropy in here proves to be incredibly important to resolve the issue.

Table 3


Obviously there are a lot left to be done. We defined the task of Issue-Sensitive Image Captioning (ISIC) and developed a Bayesian pragmatic model that allows us to address this task successfully using existing datasets and pretrained image captioning systems. We see two natural extensions of this approach that might be explored.

First, one might collect a dataset that exactly matched the structure of ISIC. This could allow for more free-form, naturalistic issues to arise, and would facilitate end-to-end training of models for ISIC. Such models could complement and extend the ones we can create using existing datasets and our issue-sensitive pragmatic captioning agents.

Second, one could extend our notion of issuesensitivity to other domains. As we saw in Figure 2, questions (as texts) naturally give rise to issues in our sense where the domain is sufficiently structured, so these ideas might find applicability in the context of question answering and other areas of controllable natural language generation.

  1. Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., & Xing, E. P. (2017, August). Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 1587-1596). JMLR. org. 

  2. Lample, G., Subramanian, S., Smith, E., Denoyer, L., Ranzato, M. A., & Boureau, Y. L. (2018). Multiple-attribute text rewriting. 

  3. Nie, Allen, Reuben Cohn-Gordon, and Chris Potts. “Pragmatic Issue-Sensitive Image Captioning .” arXiv preprint arXiv:2004.14451 (2020). 

  4. Kao, J. T., Wu, J. Y., Bergen, L., & Goodman, N. D. (2014). Nonliteral understanding of number words. Proceedings of the National Academy of Sciences, 111(33), 12002-12007. 

  5. Cohn-Gordon, R., Goodman, N., & Potts, C. (2018). Pragmatically informative image captioning with character-level inference. arXiv preprint arXiv:1804.05417.