A Brief History of RNN Encoder Decoder

March 19, 2016

RNN encoder-decoder has become the next heated field in Natural Language Processing, here is a history of its development. Papers are organized in a timely manner.

Ian Goodfellow’s book has a great overall summary to it: http://www.deeplearningbook.org/contents/rnn.html but sadly since it is a book, it lacks any recent development and any technical details.

Another interesting, more folksy way to approach is from WildML’s recent post: http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/

RNN encoder-decoder network was started by Kyunghyun Cho (Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation) and Ilya Sutskever (Sequence to Sequence Learning with Neural Networks) both in 2014.

Sutskever’s model constitutes end-to-end (sequence-to-sequence) training (with pure RNN), while Cho’s model relies on scoring proposals generated by another machien translation system.

A huge drawback of these two initial models are they both use a fixed-length vector to represent the meaning of a source sentence. The dimension of this fixed-length vector can get too small to summarize a long sequence.

Bahdanau observed this phenomenon in his paper in 2015:, which became the second milestone paper for RNN encoder-decoder (NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE). He proposed a variable length vector C to summarize the context of the source sentence, and the most infamous attention mechanism. It’s interesting to notice that the so-called “attention” is used interchangeably as “alignment” in Bahdanau’s paper, because attention mechanism is analogous to alignment in traditional machien translation literature.

This brings us to another paper, also published in 2015, that sheds light on global vs. local attention (i.e., soft vs. hard attention in Computer Vision), in Luong’s paper (Effective Approaches to Attention-based Neural Machine Translation). Local attention is inferred using a Gaussian distribution.

At last, let’s mention NLI: Natural Language Inference. With the SNLI corpus, NLI shares many common features as MT system, as pointed out in Bill McCartney’s paper: A Phrase-Based Alignment Model for Natural Language Inference, and one of them is alignment.

The RNN encoder-decoder scheme is first used in NLI by Rocktaschel in his paper: Reasoning about Entailment with Neural Attention in 2015, then Cheng proposed a different model: shallow attention fusion vs. deep attention fusion in the paper Long Short-Term Memory-Networks for Machine Reading, with the main difference that shallow attention fusion only summarize attention information at the last time step of the decoder (obviously the wrong way), and deep fusion summarizes attention information at each step, and merge hidden state of the encoder into the decoder at each time step, and such merging process is a weighted sum based on how much attention given decoder time step has to each time step from the encoder.

We expect to see more development of this scheme in 2016, and this article will be properly updated :)