Introduction

Motivation

In training process, we can see the whole output before making prediction. However, when we are in test process, the former time step cannot see latter time step. Hence, we want to create a “mask” which will hide latter words from the current word

More specifically, when we are computing , we don’t want the model make prediction based on words after , such as , , …

Designed for training process

Explanation

The way we create mask is just making corresponding alignment scores to minus infinity. This way, we’ll get 0 in attention weight in these entries since we pass the scores through softmax function