Paper

Deep Residual Learning for Image Recognition

Deep Residual Learning for Image Recognition

Introduction

Untitled

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated and then degrades rapidly.

But, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error → underfitting!

A deeper model can emulate a shallower model: copy layers from shallower model, set extra layers to identity. But, this problem suggests that a deeper model might have difficulties in approximating identity mappings by multiple nonlinear layers.

→ Residual Networks

Residual Networks

Untitled

Plain block

$$ \frac{\partial L}{\partial X} \text{downstram}=\frac{\partial L}{\partial H(X)}\text{upstream}\times\frac{\partial H(X)}{\partial X}_\text{local} $$
1. local gradient << 1 → Gradient Vanishing Problem
2. Underfitting in a deeper network
Residual Block

$$ \frac{\partial L}{\partial X} \text{downstram}=\frac{\partial L}{\partial H(X)}\text{upstream}\times\frac{\partial H(X)}{\partial X}_\text{local}=\frac{\partial L}{\partial H(X)}\left(\frac{\partial F(X)}{\partial X}+1\right) $$
1. Solve Gradient Vanishing Problem
2. Identity Mapping (If convs are 0, whole block will be identity)

Architecture

A residual network is a stack of many residual blocks

Regular design, like VGG: each residual block has two 3x3 conv

Network is divided into stages: the first block of each stage halves the resolution (with stride-2 conv) and doubles the number of channels

$H\times W\times C \rightarrow H/2 \times W/2 \times 2C$

Untitled

Like GoogLeNet, no big fully-connected-layers: instead use Global Average Pooling (GAP) and a single linear layer at the end

Untitled

better performance, lower computation than VGG

Residual Networks: Bottleneck Block

Untitled