ImageNet Classification with Deep Convolutional Neural Networks (NeurlPS 2012)
ImageNet Classification with Deep Convolutional Neural Networks
In terms of training time with gradient descent, these saturating nonlinearities ($f(x)=\tanh(x)$ or $f(x)=(1+e^{-x})^{-1}$) are much slower than the non-saturating nonlinearity $f(x)=\max(0,x)$.
tanh, sigmoid 보다 ReLU가 더 짧은 training time을 가짐
A single GTX 580 GPU has only 3GB of memory, which limits the maximum size of the networks that can be trained on it. Therefore we spread the net across two GPUs.
ReLUs do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron. However, the researchers find that the following local normalization scheme aids generalization
$$ b^i_{x,y}=a^i_{x,y}/\left(k+\alpha\sum^{\min(N-1,i+n/2)}{j=\max(0,i-n/2)}(a^i{x,y})^2 \right)^\beta $$
$n$ adjacent kernel maps at the same spatial positon, $N$ is the total number of kernels in the layer.
This normalization implements a form of lateral inhibition (측면 억제) inspired by the type found in real neurons, creating competition for big activities among neuron outputs computed using different kernels.