Deep Residual Learning for Image Recognition
Deep Residual Learning for Image Recognition
When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated and then degrades rapidly.
But, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error → underfitting!
A deeper model can emulate a shallower model: copy layers from shallower model, set extra layers to identity. But, this problem suggests that a deeper model might have difficulties in approximating identity mappings by multiple nonlinear layers.
→ Residual Networks
Plain block
$$ \frac{\partial L}{\partial X} \text{downstram}=\frac{\partial L}{\partial H(X)}\text{upstream}\times\frac{\partial H(X)}{\partial X}_\text{local} $$
Residual Block
$$ \frac{\partial L}{\partial X} \text{downstram}=\frac{\partial L}{\partial H(X)}\text{upstream}\times\frac{\partial H(X)}{\partial X}_\text{local}=\frac{\partial L}{\partial H(X)}\left(\frac{\partial F(X)}{\partial X}+1\right) $$
A residual network is a stack of many residual blocks
Regular design, like VGG: each residual block has two 3x3 conv
Network is divided into stages: the first block of each stage halves the resolution (with stride-2 conv) and doubles the number of channels
$H\times W\times C \rightarrow H/2 \times W/2 \times 2C$
Like GoogLeNet, no big fully-connected-layers: instead use Global Average Pooling (GAP) and a single linear layer at the end