Introduction
Background
After batch normalization was invented, researcher finally can train neural network with more than 10 layers. Because it is clear that deeper network can emulate shallower network, researchers at that time thinks the deeper the network is, the more accurate will the prediction be
However, after creating 50+ layers model, they find that not only the test error, but also the training error is larger than that under 10 layers
Problem
Since test and training error is both bad, researchers conclude that this is an “optimization” problem, deeper model are harder to optimize. In particularly, they don’t learn identity functions to emulate shallow model
Solution
Residual network introduces the “residual block” to solve this problem
Residual Block
Introduction
A common network takes input and applies some transformation by Conv, then output . However, a residual block on the other hand outputs
This is because a network can easily learn but not
Basic Block and Bottleneck Block

These two residual blocks have similar FLOPs. These are some guideline choosing which kind to use
Basic Blocks:
- Building smaller model
- Work with small dataset where deep network may overfit
- Computational resource is limited
Bottleneck Block:
- Building deep model (50+ layers)
- Training on large datasets
Researchers discover that using activation function like ReLU before Conv will perform better than putting ReLU after it