Motivation

As we’ve discussed earlier (note), RNN or RCN is not parallelizable, thus can’t make the most of the GPU

Our solution is self-attention, and now we want to bring it to video classification

Self-Attention for Video Classification

Non-Local Block

Non-local block means that every output element of this block depends on every input elements

Self-Attention

Recall: Self Attention It is almost entirely the same as original self-attention layer

We compute queries, keys, and values with input feature map
Use queries and keys to compute attention weight, which tell us how hard each output element depends on every input elements
Next, we multiply attention weight with values to get output vector
Since we want the input and output have the same dimensions, we’ll use convolution to resize the channel

Insert Self-Attention Layer into 3D CNN

If we initialize the last convolution to zero, then the entire block computes identity in default. Hence, we can insert self-attention layer between layers in 3D CNN

Chilfox

目錄

D-DL4CV-Lec18g-self-attention

Motivation

Self-Attention for Video Classification

Non-Local Block

Self-Attention

Insert Self-Attention Layer into 3D CNN

關係圖譜

反向連結