A Generative Model of Discourse

January 21, 2019

Short Preface

There has been a lot interests on sentence representation learning, similar to the explosion of word embedding.

Based on my limited education in linguistics, I do not believe linguistis have an agreed upon and exhaustive set of “properties” that a sentence representation¹ needs to capture. As this is still an ongoing area of research, I am very excited to see more linguistic theories on discourse and how to better understand sentential relations.

This post tries to take a closer look at the principled work produced by Sanjeev Arora and his students: Linear Algebraic Structure of Word Senses, with Applications to Polysemy, that took a generative approach to computationally infer the distribution of sentence representation as a random vector.

Inspired by their theorem, I examined if a statistical test that asks this question: “can a word be recovered from its context?” can be used to provide a quantitative measure of sentence embedding quality. The experiments are conducted in the later sections.

A Generative Model of Sentence

Figure 1

Arora et al. assume a discourse random vector² $c \sim \mathcal{N}(0, \Sigma)$ that represents the meaning of a sentence, then every word of the sentence is generated by this random variable $c$. The probabilistic graphical representation is illustrated above. Then we can write the overall model as words in a window of size $n$ generated by $c$ as:

$$
P(w_1, w_2, ..., w_n \vert c) = \prod_{i=1}^n P(w_i \vert c)
$$

The probability of a word $w_i$ occur in the sequence can be described as a log-linear model:

$$
P(w_i \vert c) = \frac{\exp(c \cdot v_{w_i})}{Z_c}
$$

Note that $v_{w_i}$ indicates the actual vector representation of $w_i$, and this probability model uses dot-product distance to capture the probability of word $w_i$ appear in the sequence, normalized by a partition function $Z_c = \sum_w \exp(v_w \cdot c)$. This dictates that the discourse vector $c$ will be close to words it produces in vector space.

Based on this very simple model, a very interesting theorem can be proved. Here, I write out the actual proof (with more details than the one provided in the paper, as well as an easy illustration on what’s going on).

Theorem 1 of Arora et al. (2018)³ paper can be understood by introducing a vector $u$. For a given word $w$ and its corresponding word vector $v_w$, we can compute a vector $u$.

Figure 2

For this word $w$, it must appear in different spans of words across the entire document. A random variable of a window of $n$ words can be introduced as $s$, a span. Computationally, the vector $u$ for the word $w$ can be computed as follow:

$$
u = \frac{1}{k} \sum_{s \in \{s_1, ..., s_k\}} \frac{1}{n} \sum_{w_i \in s} v_{w_i}
$$

To even make this statement simpler, assume the above figure represents a tensor $S \in \mathcal{R}^{n \times k \times d}$, we can easily run the following Numpy operation to obtain $u$: u = np.mean(np.mean(S, axis=0), axis=1). After knowing how $u$ is computed, then we can understand Theorem 1:

$$
v_w = A u
$$

For any word, if we compute the corresponding vector $u$, the word embedding of this word can be obtained through a linear transformation (matrix multiplication) by a fixed matrix $A$. I provide some algebra walk through the proof of Theorem 1 in the paper. Readers who find it elementary or advanced can skip this block straight to the next section.

(Optional)

The proof stands as long as the generative model in Figure 1 holds. We set to show that $v_w \approx A \mathbb{E}(\frac{1}{n} \sum_{w_i \in s} v_{w_i} \vert w \in s)$. By “iterated expectation” or “law of total expectation”, we can expand the $\mathbb{E}[c_s \vert w \in s]$ as:

\begin{equation}
\mathbb{E}[c_s \vert w \in s] = \mathbb{E}[\mathbb{E}[c_s \vert s = w_1...w...w_n \vert w \in s]]
\label{first-eq}
\tag{1}
\end{equation}

The following step is to find the probability density function (pdf) of $c \vert w$: $p(c \vert w)$. In the earlier portion of the paper, we have the following equalities that we can substitute: $Z_c \approx Z \exp(|c |^2)$⁴, the probability density function of a multivariate normal distribution $c \sim (0, \Sigma)$ is $p(c) = \exp(-\frac{1}{2} c^T \Sigma^{-1}c)$, $| c |^2 = c^Tc = c^T I c$, and the log-linear model we assumed: $p(w \vert c) = \exp(c \cdot v_w)$. We can expand $p(c\vert w)$ using Bayes rule and substitute these terms in and obtain:

\begin{align*}
p(c \vert w) &\propto p(w \vert c)p(c) \\
&\propto \frac{1}{Z} \exp(v_w \cdot c - c^T(\frac{1}{2} \Sigma^{-1} + I)c) \\
\end{align*}

After obtaining the probability density function of $c \vert w$, we can think about what kind of random variable this pdf suggests, because eventually we want to know what is $\mathbb{E}(c \vert w)$, the left hand side of equation (1). Since there is a covariance matrix inverse $\Sigma^{-1}$ invovled, we can try to re-arrange the terms to make it look more like a multivariate Gaussian distribution. Since we do want to know $\mathbb{E}(c \vert w)$, we need to know what is the mean of this new distribution.

First, we ignore the covariance determinant term as it is a constant and in Arora’s setting, the covariance matrix is invertible – “if the word vectors as random variables are isotropic to some extent, it will make the covariance matrix identifiable” (identifiable = invertible). The assumption “isotropic word embedding” here means that word embedding dimensions should not be correlated with each other.

Then, all we need to do is to rearrange the terms in $p(c \vert w)$ to appear in the form of $\exp(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu))$. By doing so, we will be able to find our $\mu$, the expectation of this pdf. Since the form $\frac{1}{2} c^T( \Sigma^{-1} + 2I)c$ looks very similar to the quadratic form that we need, we can let $B^{-1} = \Sigma^{-1} + 2I$ and let $B$ be our new covariance matrix for $c \vert w$. We can work out the equations from both sides. We let $\mu$ be the mean we want and solve for it:

\begin{align*}
p(c \vert w) &\propto \exp(-\frac{1}{2} (c-\mu)^T B^{-1} (c-\mu)) \\
&= \exp(-\frac{1}{2}(c^T B^{-1} c - c^TB^{-1}\mu - \mu^TB^{-1}c + \mu^TB^{-1}\mu))\\
p(c \vert w) &\propto \frac{1}{Z} \exp(v_w \cdot c - c^T(\frac{1}{2} \Sigma^{-1} + I)c) \\
&= \frac{1}{Z} \exp(-\frac{1}{2}(-2 v_w \cdot c + c^TB^{-1}c))
\end{align*}

Now we have two expressions of $p(c \vert w)$. We can match the terms between two equations, one term $c^TB^{-1}c$ already appears in both, but not $-2 v_w \cdot c$. However, there are two terms with negative signs in the top expansion. A trick that applies here is to just make them equal and hope things to work out – we solve for $\mu$:

$$
-2 v_w \cdot c = - c^TB^{-1}\mu - \mu^TB^{-1}c \\
-2 v_w \cdot c = -2 \mu^T B^{-1}c
$$

It is somewhat transparent that on the RHS (right hand side), $B$ needs to disappear since the LHS (left hand side) does not contain any $B$. To do that, $\mu$ should at least contain $B$ so that it cancels out with $B^{-1}$. Also the LHS has $v_w$ while RHS has none. Then the answer should be apparent: $\mu = Bv_w$. If you plug this in, the above equality works, shows that this is our $\mu$.

My stats PhD friend told me, if I saw a pdf in the form of $w^Tx - \frac{1}{2} x^TB^{-1}x$, then I can actually skip the above algebra and directly “see” this distribution of $x$ as mean $Bw$, with variance $B$.

So now, we know that $c \vert w \sim \mathcal{N}(B^{-1}v_w, B)$ where $B = (\Sigma^{-1} + 2I)^{-1}$, the posterior distribution of $c$ after conditioning on a single word in the sequence. Thus $\mathbb{E}(c \vert w) = (\Sigma^{-1} + 2I)^{-1} v_w$.

Then we want to get the pdf that describes $c|w_1, ..., w_n$. This part is relatively straightforward, no algebra trick / insight is required. The work mostly hinges on the following expression:

$$
p(c \vert w_1, ..., w_n) \propto p(w_1,...,w_n \vert c) p(c) \propto p(c) \prod_{i=1}^n p(w_i \vert c) \\
= \frac{1}{Z^n} \exp(\sum_{i=1}^n v_{w_i}^Tc - \frac{1}{2} c^T(\Sigma^{-1} + 2nI)c)
$$

The generation of words are independent with each other conditioned on $c$. We already know the expression of $p(w \vert c)$. So the above equation evaluates to a form that we have already worked out before. We can skip the algebra and know that $\mathbb{E}[c \vert w_1, ..., w_n] \approx (\Sigma^{-1} + 2nI)^{-1} \sum_{i=1}^n v_{w_i}$.

If you still recall the LHS and RHS of the Equation (1), then what we have left to conclude the proof is to plug what we have derived into the LHS and RHS. Feel free to refer to the paper since it offers a cleaner/shorter presentation.

$$
\mathbb{E}[c_s \vert w \in s] = \mathbb{E}[\mathbb{E}[c_s \vert s = w_1...w...w_n \vert w \in s]] \\
 (\Sigma^{-1} + 2I)^{-1} v_w \approx (\Sigma^{-1} + 2nI)^{-1} \sum_{i=1}^n v_{w_i}
$$

Therefore, we know that the matrix $A$ that we set out to find is now solvable by re-arranging the terms in above equations: $A = n(\Sigma^{-1} + 2I) (\Sigma^{-1} + 2nI)^{-1}$.

(Optional End)

Estimation of Linear Transformation

Now we have an analytic form of $A$, it seems that there are two ways to finding what $A$ is. The first way is to directly estimate $c$'s prior distribution in the hopes to getting $\Sigma$. The problem is that $p(c) = \sum_w p(c \vert w)p(w)$. We can easily compute $c|w$ but it's not very easy for us to compute $p(c)$.

Then we also know that $c \vert w \sim \mathcal{N}(B^{-1}v_w, B)$, and $A = n B^{-1}B$, if we can somehow estimate $B$, we can also compute $A$. Let's take a closer look at $c \vert w$. The mean of the distribution $B^{-1} v_w$ is word-dependent, but the covariance is not. This means if we want to find this distribution $c|w$ for any word, then we need to: for every word $w$, fit the posterior $c \vert w$ with a multivariate Gaussian distribution, and share the same covariance matrix across all these pdfs. This seems possible but computing matrix inverse: $ B^{-1}$ is expensive ($B$ is a 300 x 300 matrix, for a 300d word vector).

The last choice is to do monte carlo estimation on the Theorem 1’s original equation: $v_w \approx A \mathbb{E}(\frac{1}{n} \sum_{w_i \in s} v_{w_i} \vert w \in s)$, replace the expectation with sampling over window $s$ and sampling over word $w$, and find $A$ as a linear regression problem. This is also the method the original paper has chosen.

If we suppose that $u$ is the averaged discourse vectors for word $w$, then iterating through the vocabulary, we should be able to find matrix $A$ by solving the following optimization:

$$
\arg\min_A \sum_w \| A u_w - v_w \|_2^2
$$

The paper tried to fit GloVe embeddings using linear transformation and use SIF⁵ to generate context vector $c_w$. As we can see the practical fit is good in Table 2.

Figure 3

Application to Word Senses

Intuitively, Theorem 1 dictates that a word has a linear relationship (fulfilled by matrix $A$) to the average of all the context vectors this word appears in: $v_w = A \sum_s c_s$. This relationship is fully specified by $\Sigma$, the covariance of discourse random vector $c$. Theorem 2 of the paper can be interpreted as: $v_w \approx \alpha v_{s_1} + \beta v_{s_2}$, a linear combination of 2 senses (each sense is represented as a vector). We can see the parallel between this linear decomposition and the transformed average of all context that word $w$ appears in. The proof of Theorem 2 essentially expresses that if there are 2 senses for a given word, then with high probability, $\alpha$% of the context vectors should be similar to each other as they are all this word expressed in sense 1, $\beta$% of the context vectors should be similar/grouped together as they are all expressed as sense 2.

Since we do not observe the frequency ($\alpha$, $\beta$), nor do we know how many senses are in there, Arora proposed to discover senses using sparse coding⁶, finding a set of unit vectors $A_1, …,A_m$, that for any word $v_w$, it can be expressed by a small number of these unit vectors. These unit vectors are referred as the Atoms of Discourse.

Sparse coding objective and description can be found in the paper, overall, given a set of word vectors, two integers k, m with k « m, find a set of unit vectors $A_1, …,A_m$ such that:

$$
v_w = \sum_{j=1}^m \alpha_{w,j}A_j + \eta_w
$$

where at most k of the coefficients $\alpha$ are nonzero. The goal is to minimize the reconstruction error term $\sum_w \eta_w$.

$$
\sum_w \vert\vert v_w - \sum_{j=1}^m \alpha_{w,j} A_j \vert\vert_2^2
$$

Do Non-linear Sentence Embeddings Conform to this?

Arora et al.’s proof above assumed SIF sentence embeddings. SIF algorithm uses a linear combination of word embeddings to create sentence embeddings. One clear downside of SIF is that since GloVE/Word2Vec word embeddings do not contain order information, SIF sentence embeddings don’t have order-information either.

We can list out the assumptions used in Arora et al.’s model:

$p(w \vert c)$ is a log-linear model
Words $w$ in a window are generated independently by $c$.
$p(c)$, the prior of discourse vector $c$ is Gaussian with mean 0 and invertible covariance matrix $\Sigma$.

Each assumption has some flaws. Assumption (1) assumes a very simplistic model on how words are generated from meaning. By extending to a more complex model, such as bilinear transformation: $p(w \vert c) \propto \exp(v_w^T H c)$, we gain more expressivity and we still have an analytical expression of $c\vert w$, however we might lose the concentration of the partition function $Z_c$.

Assumption (2) is a very serious offense for syntax in language. Words definitely depend on one another – grammatical structure naturally emerges from word-to-word dependencies. The drink is cold, the choice of is is clearly influenced by the plurality of the subject. However, if we drop this assumption, we won’t be able to get the following factorization:

$$
 p(w_1,...,w_n \vert c) \propto \prod_{i=1}^n p(w_i \vert c)
$$

Instead, what we get is:

$$
p(w_1,...,w_n \vert c) \propto \prod_{i=1}^n p(w_i \vert c, w_{<i})
$$

This will not enable us to reuse the $p(w\vert c)$ expression that we derived, breaking the equality in Equation (1).

Assumption (3) dictates a Gaussian random walk model, which seems least objectionable. The discussion of this assumption is left for future posts.

So what does Theorem 1 really provide for us? Well, it obviously leads to Theorem 2, word embeddings under these assumptions are additive combination of senses. Also, learning matrix $A$ can find a semantic meaning for $u$, the averaged context vectors. Arora et al. described an algorithm that is referenced from Reisinger and Mooney (2010)⁷: compute $c_1, …, c_m$, for a word $w$ appears in $m$ contexts. Cluster these vectors and average them. The cluster center originally are not near meaningful words that suggest the sense this cluster tries to represent, but by applying $A$ to the cluster center, we obtain meaningful nearest words again:

Figure 4

What about Sentence Embedding Training Objectives?

In many cases, we want the theory to have suggestive power for applications. Many sentence embedding models have been proposed and many of them have different objectives: InferSent, DisSent⁸ trained on a specific semantic task (predicting inference or discourse relation); OpenAI GPT⁹ and ELMo¹⁰ trains with language modeling; BERT¹¹ trains with word cloze task (using context to predict center word).

From a very high-level view, Theorem 1 seems to be proposing a relationship between averaged context and a word that appeared in those contexts: a word can be “recovered” through applying a linear transformation to the averaged context vectors that contains this word. This is very different from the language-modeling objective:

$$
p(x_1,...,x_n) = \prod_{t=1}^n p(x_t | x_{<t})
$$

A cursory look at the graphical model of Arora’s model and language model reminded me of the confounding variable diagram below:

Figure 5

When there is a hidden variable (confounding variable) $c$ that governs the behavior of $a$ and $b$, without knowing the existence of $c$, we (the algorithm) can often make the mistake of inferring a correlational relationship between variable $a$ and $b$. Does the left figure look quite similar to a language model, and the right figure similar to what Arora et al. has proposed? This is not to suggest that there is no dependence between words – there clearly is, but modeling language well by conditioning on previous words should not exclude us from hypothesizing a different kind of generative model.

The language modeling objective is a decomposition of joint probability over words $p(x_1,...,x_n)$, and the decomposition itself does not include any structural bias. However, what I suggest is to model the joint probability of $p(c, x_1, ..., x_n)$, with the inclusion of an additional random variable.

BERT objective, using context to predict the prescence of a word seems quite similar to what Theorem 1 seems to suggest, allowing a word to be recovered through its context. However BERT does mask the word itself (as it makes sense), and the prediction happens on each sentence, instead of averaged context.

DisSent is very similar to BERT due to the nature of predicting word using context, even though it is narrow in scope (only predicting a small number of words).

There is one last type of sentence embedding objective that has not been thoughly investigated – sequence auto-encoding. Sequence auto-encoding objective compresses the whole sequence into a context representation, and the model is required to reconstruct each word from this context representation¹². I have not seen many paper on this objective, but maybe it is worth a shot given its approximity to Arora et al.’s proposal model.

Word Recovering Through Context

On a high-level, Theorem 1 inspires a statistical test: “can a word be recovered from its context?”. Obviously sentence embddings, unlike word embeddings, should capture both semantic and syntactic information. Word recover test can only measure the former. However, it is a task-agnostic, global measure of the quality of sentence embedding.

We consider 4 models: BERT, InferSent⁸, DisSent¹³, and SIF (Arora et al.’s original model). We evaluate on News Crawl Shuffled 2011 dataset, which contains roughly 2M sentences from news sources, and is part of the LM1B dataset. BERT uses SentencePiece to tokenize and produces its own subword unit for infrequent words. InferSent, DisSent, and SIF purely relies on GloVe embeddings. Thus, we choose vocabularies that overlap both GloVe and BERT.

Since our testing models are large, we subsample the overlapped vocabulary (12.5%) and construct a training/testing set (1000 words and their corresponding sentence embedding pairs for training, 100 words and their sentence embedding pairs for test) to learn the transformation matrix $A$. These words jointly appear in >99% of the sentences in the corpus.

For BERT, we select the [CLS] token position as the embedding of the sentence because it is trained on the next-sentence prediction task, albeit not fine-tuned on other tasks.

The y-axis plots the cosine distance between $Ax$ and $x$ on the test set, and x-axis shows the number of epochs during training. Unfortunately this is not the result I was expecting, at least not according to the theorem and experiment reported by the paper.

There are a few key differences between my experiment here and the paper:

I only sampled 1000 high frequency words. The paper probabily fitted for all vocabulary (global fit).
Data source is very different. Wikipedia allows words that have different senses to be talked about in very different context (“crane” will be talked as if it’s an animal or a heavy lifting machine). This is a special artifact of the Wikipedia.
I use the entire sentence to construct context embedding (throw away sentences that are longer than 30 words), while the paper used a simple window around the word.

BERT embedding has a much higher dimension (768-dim) compared to SIF embedding (300-dim), but it performed much worse than SIF embedding. The idea that BERT is simply averaging word vectors does not hold up under this test, but since BERT also mixes in positional embeddings, it complicates any simple analysis like this.

InferSent and DisSent (4096-dim) both did very well generalizing to the test set words. This test might still be flawed and it’s possible and would appreciate different inputs.

Closing Thoughts

Unwittingly at first, Word2Vec is quickly shown to be an implicit solution to a non-convex co-occurence matrix decomposition. GloVE and other word embedding methods followed the lead and grounded these methods in theory. Are sentence embedding models, let it be InferSent, DisSent, ELMo, BERT, implicit solutions to a more principled discourse model? So far, we are still expecting great work and interesting answers.

Acknowledgement

Special thanks to Jaime Roquero who discussed this paper in very detailed manner with me, and pointed out an algebra error I made in an earlier version of this draft.

In the scope of this post, we can assume it’s an embedding. This is a very narrow interpretation that is ignoring decades of linguistic work on sentence representations. Interested readers can take a look at Kemp’s Discourse Representation Theory framework. ↩
In most of Arora et al.’s work, “sentence meaning”, “discourse”, and “context” are used almost interchangeably. They all refer to a vector representation of a span of words, usually within a fixed window. ↩
Linear Algebraic Structure of Word Senses, with Applications to Polysemy. ↩
This is proven in A Latent Variable Model Approach to PMI-based Word Embeddings, Lemma 2.1. They proved a concentration bound of this partition function under the Bayesian priors specified in the model of Figure 1. It seems to be a general bound linked to the self-normalizing property of log-linear models. ↩
SIF refers to Arora’s other paper: A Simple but Tough-to-Beat Baseline for Sentence Embeddings. Basically it’s a additive summation of word embeddings in a sentence re-weighted by the frequency of the word in corpus. ↩
http://ufldl.stanford.edu/wiki/index.php/Sparse_Coding ↩
Multi-Prototype Vector-Space Models of Word Meaning ↩
InferSent: Supervised Learning of Universal Sentence Representations from Natural Language Inference Data; DisSent: Sentence Representation Learning from Explicit Discourse Relations. ↩ ↩²
Improving Language Understanding by Generative Pre-Training ↩
Deep contextualized word representations ↩
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ↩
https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html ↩
DisSent: Sentence Representation Learning from Explicit Discourse Relations. https://arxiv.org/abs/1710.04334 ↩