Lec12 Deep Learning

Deep Neural Networks

$ReLU(a) = \begin{cases} a &\text{if } a>0 \\ 0 &\text{otherwise} \end{cases}$
Does not saturate for large a
Leads to a sparse representation
No learning for $a<0$, be careful with initialization
- initiate very small weights [-0.01, 0.01] + positive bias
Leaky RELU
- $LReLU(a) = \begin{cases} a &\text{if } a>0 \\ 0.01a &\text{otherwise} \end{cases}$

스크린샷 2022-12-14 오후 9.51.12.png

Momentum: At each update, also add a fraction of the average of past gradients
- $s^t_i = \alpha s_i^{t-1} + (1-\alpha)\frac{∂E^t}{∂w_i}$
- $\Delta w^t_i = -\eta s^t_i$
- Generally, $\alpha = 0.9$
- Pros: small mini-batch size or online learning → Good!
- Cons: For all parameter, $s_i$ needs to be saved → need extra memory
- 모든 batch를 다 보는 경우 momentum은 필요 없음
Adaptive Learning Factor: RMSProp
- $\Delta w_i^t = -\frac{\eta}{\sqrt{r_i^t}}\frac{∂E^t}{∂w_i}$
- Accumulated past gradient: $r_i^t=\rho r_i^{t-1} + (1-\rho)|\frac{∂E^t}{∂w_i}|^2$, initially 0
- Generally, $\rho = 0.999$
- Make update inversely proportional to the sum of past gradients
- Update more when gradient is small and less where it is large
ADAM: Adaptive Learning Factor w/ Momentum

맨 처음 s0=0 으로 초기화돼서 다음 s1=0.1*grad 처럼 10%만 반영됨

Z-normalize hidden unit values (m: mean, s: std over the current mini-batch)

$\tilde{a_j}=\frac{a_j-m_j}{s_j}$