Early Fusion
Idea
We squeeze the original 4D input videos into 3D, then perform 2D CNN on it
Implementation
Input:
Steps:
- Collapse the input into three dimensions, i.e.,
- Input the feature map we get in Step 1 into normal 2D CNN, then we’ll get the final class scores
Problem
The temporal dimension collapsed too fast.
After we send the feature map into CNN, the first Conv layer will turn into , the temporal dimension disappear in the first layer. However, we want the temporal feature be processed in more layers before disappearing