Early Fusion

Idea

We squeeze the original 4D input videos into 3D, then perform 2D CNN on it

Implementation

Input:

Steps:

  1. Collapse the input into three dimensions, i.e.,
  2. Input the feature map we get in Step 1 into normal 2D CNN, then we’ll get the final class scores

Problem

The temporal dimension collapsed too fast.

After we send the feature map into CNN, the first Conv layer will turn into , the temporal dimension disappear in the first layer. However, we want the temporal feature be processed in more layers before disappearing