Paper

ImageNet Classification with Deep Convolutional Neural Networks (NeurlPS 2012)

ImageNet Classification with Deep Convolutional Neural Networks

Architecture

Untitled

227 x 227 inputs
8 learned layers
- 5 Convolutional layers
- 3 fully-connected layers
Max pooling
ReLU nonlinearities

ReLU Nonlinearity

In terms of training time with gradient descent, these saturating nonlinearities ($f(x)=\tanh(x)$ or $f(x)=(1+e^{-x})^{-1}$) are much slower than the non-saturating nonlinearity $f(x)=\max(0,x)$.

tanh, sigmoid 보다 ReLU가 더 짧은 training time을 가짐

Training on Multiple GPUs

A single GTX 580 GPU has only 3GB of memory, which limits the maximum size of the networks that can be trained on it. Therefore we spread the net across two GPUs.

Local Response Normalization

ReLUs do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron. However, the researchers find that the following local normalization scheme aids generalization

$$ b^i_{x,y}=a^i_{x,y}/\left(k+\alpha\sum^{\min(N-1,i+n/2)}{j=\max(0,i-n/2)}(a^i{x,y})^2 \right)^\beta $$

$n$ adjacent kernel maps at the same spatial positon, $N$ is the total number of kernels in the layer.

This normalization implements a form of lateral inhibition (측면 억제) inspired by the type found in real neurons, creating competition for big activities among neuron outputs computed using different kernels.

Paper

Architecture

ReLU Nonlinearity

Training on Multiple GPUs

Local Response Normalization

Overlapping Pooling