Introduction
Motivation
In training process, we can see the whole output before making prediction. However, when we are in test process, the former time step cannot see latter time step. Hence, we want to create a “mask” which will hide latter words from the current word
More specifically, when we are computing , we don’t want the model make prediction based on words after , such as , , …
Designed for training process
Explanation
The way we create mask is just making corresponding alignment scores to minus infinity. This way, we’ll get 0 in attention weight in these entries since we pass the scores through softmax function
