180 likes | 186 Vues
Attention for translation. Learn to encode multiple pieces of information and use them selectively for the output. Encode the input sentence into a sequence of vectors. Choose a subset of these adaptively while decoding (translating) – choose those vectors most relevant for current output.
E N D
Attention for translation • Learn to encode multiple pieces of information and use them selectively for the output. • Encode the input sentence into a sequence of vectors. • Choose a subset of these adaptively while decoding (translating) – choose those vectors most relevant for current output. • I.e., learn to jointly align and translate. • Question: How can we learn and use a vector to decide where to focus attention? How can we make that differentiable to work with gradient descent? Bahdanau et al., 2015 : https://arxiv.org/pdf/1409.0473.pdf
Soft attention • Use a probability distribution over all inputs. • Classification assigned probability to all possible outputs • Attention uses probability to weight all possible inputs – learn to weight more relevant parts more heavily. https://distill.pub/2016/augmented-rnns/
Attention for translation • Input, x, and output, y, are sequences • Encoder has a hidden state hiassociated with each xi. These states come from a bidirectional RNN to allow information from both sides in encoding. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Attention for translation • Decoder: each output yi is predicted using • Previous output yi-1 • Decoder hidden state si • Context vector ci • Decoder hidden state depends on • Previous output yi-1 • Previous state si-1 • Context vector ci • Attention is embedded in the context vector via learned weights https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Context vector • The encoder hidden states are combined in a weighted average to form a context vector ct for the tth output. • This can capture features from each part of the input. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Attention for translation • The weights va and Wa are learned using a feed-forward network trained with the rest of the network. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Attention for translation • Key ideas: • Implement attention as a probability distribution over inputs/features. • Extend encoder/decoder pair to include context information relevant to the current decoding task. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Attention with images • Can combine a CNN with an RNN using attention. • CNN extracts high-level features. • RNN generates a description, using attention to focus on relevant parts of the image. https://distill.pub/2016/augmented-rnns/
Self-attention • Previous: focus attention on the input while working on the output • Self-attention: focus on other parts of input while processing the input http://jalammar.github.io/illustrated-transformer/
Self-attention • Use each input vector to produce query, key, and value for that input: each is defined by a matrix multiplication of the embedding with each of WQ, WK, WV http://jalammar.github.io/illustrated-transformer/
Self-attention • Similarity is determined by dot product of the Query of one input with the Key of all the inputs • E.g., for input 1, get a vector of dot products (q1, k1), (q1, k2), … • Do a scaling and softmax to get a distribution over the input vectors. This gives a distribution p11, p12, … that is the attention for input 1 on all inputs. • Use the attention vector to do a weighted sum over the Value vectors for the inputs: z1 = p11v1 + p12v2 + … • This is the output of the self-attention for input 1 http://jalammar.github.io/illustrated-transformer/
Self-attention • Multi-headed attention: run several copies in parallel and concatenate the outputs for next layer.
Related work • Neural Turing Machines – combine an RNN with an external memory. https://distill.pub/2016/augmented-rnns/
Neural Turing Machines • Use attention to do weighted read/writes at every location. • Can combine content-based attention with location-based attention to take advantage of both. https://distill.pub/2016/augmented-rnns/
Related work • Adaptive computation time for RNNs • Include a probability distribution on the number of steps for a single input • Final output is a weighted sum of the steps for that input https://distill.pub/2016/augmented-rnns/
Related work • Neural programmer • Determine a sequence of operations to solve some problem. • Use a probability distribution to combine multiple possible sequences. https://distill.pub/2016/augmented-rnns/
Attention: summary • Attention uses a probability distribution to allow the learning and use of relevant inputs for RNN output • This can be used in multiple ways to augment RNNs: • Better use of input to encoder • External memory • Program control (adaptive computation) • Neural programming