Momentum: At each update, also add a fraction of the average of past gradients
Adaptive Learning Factor: RMSProp
ADAM: Adaptive Learning Factor w/ Momentum
맨 처음 s0=0 으로 초기화돼서 다음 s1=0.1*grad 처럼 10%만 반영됨
Z-normalize hidden unit values (m: mean, s: std over the current mini-batch)
$\tilde{a_j}=\frac{a_j-m_j}{s_j}$