1 / 17

Attention for translation

Attention for translation. Learn to encode multiple pieces of information and use them selectively for the output. Encode the input sentence into a sequence of vectors. Choose a subset of these adaptively while decoding (translating) – choose those vectors most relevant for current output.

agnew
Télécharger la présentation

Attention for translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Attention for translation • Learn to encode multiple pieces of information and use them selectively for the output. • Encode the input sentence into a sequence of vectors. • Choose a subset of these adaptively while decoding (translating) – choose those vectors most relevant for current output. • I.e., learn to jointly align and translate. • Question: How can we learn and use a vector to decide where to focus attention? How can we make that differentiable to work with gradient descent? Bahdanau et al., 2015 : https://arxiv.org/pdf/1409.0473.pdf

  2. Soft attention • Use a probability distribution over all inputs. • Classification assigned probability to all possible outputs • Attention uses probability to weight all possible inputs – learn to weight more relevant parts more heavily. https://distill.pub/2016/augmented-rnns/

  3. Attention for translation • Input, x, and output, y, are sequences • Encoder has a hidden state hiassociated with each xi. These states come from a bidirectional RNN to allow information from both sides in encoding. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

  4. Attention for translation • Decoder: each output yi is predicted using • Previous output yi-1 • Decoder hidden state si • Context vector ci • Decoder hidden state depends on • Previous output yi-1 • Previous state si-1 • Context vector ci • Attention is embedded in the context vector via learned weights https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

  5. Context vector • The encoder hidden states are combined in a weighted average to form a context vector ct for the tth output. • This can capture features from each part of the input. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

  6. Attention for translation • The weights va and Wa are learned using a feed-forward network trained with the rest of the network. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

  7. Attention for translation • Key ideas: • Implement attention as a probability distribution over inputs/features. • Extend encoder/decoder pair to include context information relevant to the current decoding task. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

  8. Attention with images • Can combine a CNN with an RNN using attention. • CNN extracts high-level features. • RNN generates a description, using attention to focus on relevant parts of the image. https://distill.pub/2016/augmented-rnns/

  9. Self-attention • Previous: focus attention on the input while working on the output • Self-attention: focus on other parts of input while processing the input http://jalammar.github.io/illustrated-transformer/

  10. Self-attention • Use each input vector to produce query, key, and value for that input: each is defined by a matrix multiplication of the embedding with each of WQ, WK, WV http://jalammar.github.io/illustrated-transformer/

  11. Self-attention • Similarity is determined by dot product of the Query of one input with the Key of all the inputs • E.g., for input 1, get a vector of dot products (q1, k1), (q1, k2), … • Do a scaling and softmax to get a distribution over the input vectors. This gives a distribution p11, p12, … that is the attention for input 1 on all inputs. • Use the attention vector to do a weighted sum over the Value vectors for the inputs: z1 = p11v1 + p12v2 + … • This is the output of the self-attention for input 1 http://jalammar.github.io/illustrated-transformer/

  12. Self-attention • Multi-headed attention: run several copies in parallel and concatenate the outputs for next layer.

  13. Related work • Neural Turing Machines – combine an RNN with an external memory. https://distill.pub/2016/augmented-rnns/

  14. Neural Turing Machines • Use attention to do weighted read/writes at every location. • Can combine content-based attention with location-based attention to take advantage of both. https://distill.pub/2016/augmented-rnns/

  15. Related work • Adaptive computation time for RNNs • Include a probability distribution on the number of steps for a single input • Final output is a weighted sum of the steps for that input https://distill.pub/2016/augmented-rnns/

  16. Related work • Neural programmer • Determine a sequence of operations to solve some problem. • Use a probability distribution to combine multiple possible sequences. https://distill.pub/2016/augmented-rnns/

  17. Attention: summary • Attention uses a probability distribution to allow the learning and use of relevant inputs for RNN output • This can be used in multiple ways to augment RNNs: • Better use of input to encoder • External memory • Program control (adaptive computation) • Neural programming

More Related