DL 2 - Improving DNN: Tuning, Regularization and Optimization

Danger icon
The last modifications of this post were around 3 years ago, some information may be outdated!

This is my note for the course (Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization). The codes in this note are rewritten to be more clear and concise.

πŸ‘‰ Course 1 -- Neural Networks and Deep Learning.
πŸ‘‰ Course 2 -- Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization.
πŸ‘‰ Course 3 -- Structuring Machine Learning Projects.
πŸ‘‰ Course 4 -- Convolutional Neural Networks.
πŸ‘‰ Course 5 -- Sequence Models.

This course will teach you the "magic" of getting deep learning to work well. Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good results. You will also learn TensorFlow.

Initialization step

layers_dims contains the size of each layer from 00 to LL.

zero initialization

parameters['W'+str(l)] = np.zeros((layers_dims[l], layers_dims[l-1]))
parameters['b'+str(l)] = np.zeros((layers_dims[l], 1))
  • The performance is really bad, and the cost does not really decrease.
  • initializing all the weights to zero β‡’ failing to break symmetry β‡’ every neuron in each layer will learn the same thing β‡’ n[l]=1n^{[l]}=1 for every layer β‡’ no more powerful than a linear classifier such as logistic regression.

Random initialization

To break symmetry, lets intialize the weights randomly.

parameters['W'+str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * 10 # <- LARGE (just an example of SHOULDN'T)
parameters['b'+str(l)] = np.zeros((layers_dims[l], 1))
  • High initial weights β‡’ The cost starts very high (near 0 or 1 or infinity).
  • Poor initialization β‡’ vanishing/exploding gradients β‡’ slows down the optimization algorithm.
  • If you train this network longer β‡’ better results, BUT initializing with overly large random numbers β‡’ slows down the optimization.

He initialization

Multiply randomly initial WW with 2n[lβˆ’1]\sqrt{\frac{2}{n^{[l-1]}}}. It's similar to Xavier initialization in which multipler factor is 1n[lβˆ’1]\sqrt{\frac{1}{n^{[l-1]}}}

parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2./layers_dims[l-1])
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))

Regularization step

To reduce the overfitting problem.

L2 regularization

L2_regularization_cost = 0
for l in range(1, L+1):
L2_regularization_cost += 1/m * lambd/2 * (np.sum(np.square(W[l]))
  • The standard way. Modify cost function from,

    J=βˆ’1mβˆ‘i=1m(y(i)log⁑(a[L](i))+(1βˆ’y(i))log⁑(1βˆ’a[L](i)))J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)}


    Jregularized=βˆ’1mβˆ‘i=1m(y(i)log⁑(a[L](i))+(1βˆ’y(i))log⁑(1βˆ’a[L](i)))⏟cross-entropy cost+1mΞ»2βˆ‘lβˆ‘kβˆ‘jWk,j[l]2⏟L2 regularization costJ_{\text{regularized}} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost}
  • The value of Ξ»\lambda is a hyperparameter that you can tune using a dev set.

  • L2 regularization makes your decision boundary smoother. If Ξ»\lambda is too large, it is also possible to "oversmooth", resulting in a model with high bias.


# [Forward] An example at layer 3
D3 = np.random.rand(A3.shape(0), A3.shape(1)) < keep_drop
A3 *= D3
A3 /= keep_drop
# [Backprop]
dA3 *= D3
dA3 /= keep_drop
  • Dropout is a widely used regularization technique that is specific to deep learning.
  • Randomly shuts down some neurons in each iteration.
  • When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons.
  • With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.
  • Don't apply dropout to the input layer or output layer.
  • Use dropout during training, not during test time.
  • Apply dropout both during forward and backward propagation.

Gradient checking

  • To answer "Give me a proof that your backpropagation is actually working!"

  • We are confident on computing JJ but βˆ‚Jβˆ‚ΞΈ\frac{\partial J}{\partial\theta}.

  • Use JJ to compute an approximation of βˆ‚Jβˆ‚ΞΈ\frac{\partial J}{\partial\theta} and compare with βˆ‚Jβˆ‚ΞΈ\frac{\partial J}{\partial\theta}.

    βˆ‚Jβˆ‚ΞΈ=lim⁑Ρ→0J(ΞΈ+Ξ΅)βˆ’J(ΞΈβˆ’Ξ΅)2Ξ΅\frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon}

Optimization algorithms


  • Gradient Descent: go down the hill.
  • Momentum / RMSprop / Adam: which direction?

Mini-batch gradient descent

  • Problem: NN works great on big data but many data leads to slow the training β‡’ We need to optimize!
  • Solution: Divide into smaller "mini-batches" (for example, from 5M to 5K of 1K each).
X(nX,m=5M)=[X(1),…,X(1K)⏟X(nX,1K){1},X(1K+1),…,X(2K)⏟X(nX,1K){2},…,X(mβˆ’1K+1),…,X(m)⏟X(nX,1K){5K}],Y(1,m=5M)=[y(1),…,y(1K)⏟Y(1,1K){1},y(1K+1),…,y(2K)⏟Y(1,1K){2},…,y(mβˆ’1K+1),…,y(m)⏟Y(1,1K){5K}]\begin{aligned} X_{(n_X, m=5M)} &= [\underbrace{X^{(1)},\ldots,X^{(1K)}}_{X^{\{1\}}_{(n_X,1K)}}, \underbrace{X^{(1K+1)},\ldots,X^{(2K)}}_{X^{\{2\}}_{(n_X,1K)}}, \ldots, \underbrace{X^{(m-1K+1)},\ldots,X^{(m)}}_{X^{\{5K\}}_{(n_X,1K)}}], \\ Y_{(1, m=5M)} &= [\underbrace{y^{(1)},\ldots,y^{(1K)}}_{Y^{\{1\}}_{(1,1K)}}, \underbrace{y^{(1K+1)},\ldots,y^{(2K)}}_{Y^{\{2\}}_{(1,1K)}}, \ldots, \underbrace{y^{(m-1K+1)},\ldots,y^{(m)}}_{Y^{\{5K\}}_{(1,1K)}}] \end{aligned}

Different between mini-batch and normal batch
Different between mini-batch and normal batch on the cost function. It's oscillated for mini-batch because the cost may be large for this mini-batch but small for the others. Image from the course.


  • X(i)X^{(i)}: iith training example.
  • z[l]z^{[l]}: zz value in llth layer.
  • X{t},Y{t}X^{\{t\}}, Y^{\{t\}}: index of different mini-batches.


X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations): # loop through epoches: to get the convergence
for t in range(0, num_batches): # loop through the batches
# Forward propagation
a, caches = forward_propagation(X[:,t], parameters)
# Compute cost
cost += compute_cost(a, Y[:,t])
# Backward propagation
grads = backward_propagation(a, caches, parameters)
# Update parameters.
parameters = update_parameters(parameters, grads)

How to build mini-batches?

We need 2 steps:

  1. Shuffle: shuffle columns (training examples) correspondingly between XX and YY. The shuffling step ensures that examples will be split randomly into different mini-batches.
  2. Partition: choose a batch size and take mini-batches. Note that, the last batch may be smaller than the others.

Type of mini-batch

There are 3 types based on the size of batches:

  1. Batch Gradient Descent (nt=mn_t = m) : entire training examples, i.e. (X{1},Y{1})=(X,Y)(X^{\{1\}}, Y^{\{1\}}) = (X,Y).
  2. Stochastic Gradient Descent (nt=1n_t = 1) : every training example is it own a mini-batch (mm mini batches).
  3. 1<nt<m1<n_t<m.

Different between 3 types of mini-batch.
Different between 3 types of mini-batch. Image from the course.


  • If small training set (m≀2000m \le 2000): using batch gradient descent.
  • Typical mini-batch sizes: 64,128,256,512,…64, 128, 256, 512, \ldots
  • Make sure mibi-batch size in CPU/GPU memory!

Exponentially weighted averages

  • It's faster than Gradient Descent!

  • Example (temperature in London):

    • ΞΈt\theta_t: the temperature on day tt.

    • vtv_t: the average temp of each day. It's called exponential average over 11βˆ’Ξ²\frac{1}{1-\beta} days temperature.

      vt=Ξ²vtβˆ’1+(1βˆ’Ξ²)ΞΈtv_t = \beta v_{t-1} + (1-\beta)\theta_t
    • E.g. Ξ²=0.9β‡’vt≃10\beta=0.9 \Rightarrow v_t \simeq 10 days temperature; Ξ²=0.98β‡’vt≃50\beta=0.98 \Rightarrow v_t \simeq 50 days temperature.

  • Ξ²\beta larger β‡’ smoother average line because we consider more days. However, curve is now shifted further to the right.

    Exponentially weighted average curves.
    Exponentially weighted average curves: red line (Ξ²=0.9\beta=0.9), green line (Ξ²=0.98\beta=0.98). Image from the course.

  • When Ξ²\beta is so large β‡’ vtv_t adapts slowly to the changes of temperature (more latency).

  • Why we call "exponentially"?

    v100=0.9Γ—v99+0.1Γ—ΞΈ100=0.1Γ—ΞΈ100+0.1Γ—0.99Γ—ΞΈ99+0.1Γ—0.992Γ—ΞΈ99+…\begin{aligned}v_{100} &= 0.9\times v_{99} + 0.1\times \theta_{100}\\&= 0.1\times \theta_{100} + 0.1\times 0.99\times\theta_{99} + 0.1\times 0.99^2 \times\theta_{99} + \ldots\end{aligned}

Bias correction

  • Problem: the value of vtv_t at the beginning of exp ave curves may be lower than what we expect. For example, with v0=0v_0=0, we have v1=0.02ΞΈ1v_1 = 0.02\theta_1 instead of v1=v0+0.02ΞΈ1v_1 = v_0 + 0.02\theta_1.

  • Solution: Instead of using vtv_t, we take

  • When tt is large β‡’ Ξ²t≃0β‡’vt1βˆ’Ξ²t≃vt\beta^t \simeq 0 \Rightarrow \dfrac{v_t}{1-\beta_t} \simeq v_t

    Bias correction.
    Bias correction for the green line, it's effective at the beginning of the line, with bigger tt, green and violet are overlapped. Image from the course.

  • In practice, we don't really see people bothering with bias correction!

Gradient Descent with Momentum

  • It's faster than Gradient Descent!

  • Why: when we use mini-batch, there are oscillation, momentum helps use reduce this.

  • One sentence: compute the exponential weighted average of your gradient β‡’ use that gradient to update your weights instead.

  • Idea: Momentum takes into account the past gradients to smooth out the update. We will store the 'direction' of the previous gradients in the variable vv . Formally, this will be the exponentially weighted average of the gradient on previous steps.

  • Intuition: You can also think of vv as the "velocity" of a ball rolling downhill, building up speed (and momentum) according to the direction of the gradient/slope of the hill.

    • dW,dbdW, db like "acceleration".
    • VdW,VdbVdW, Vdb like "velocity".
    • Ξ²\beta likes "friction".

    We want slower learning in vertial direction and faster in horizontal direction. Image from the course.

  • Algorithm: on iteration tt:

    1. Compute dW,dbdW, db on current mini-batch.
    2. VdW=Ξ²VdW+(1βˆ’Ξ²)dWVdW = \beta VdW + (1-\beta)dW.
    3. Vdb=Ξ²Vdb+(1βˆ’Ξ²)dbVdb = \beta Vdb + (1-\beta)db.
    4. W:=Wβˆ’Ξ±VdWW:=W-\alpha VdW.
    5. b:=bβˆ’Ξ±Vdbb:=b-\alpha Vdb.
  • Implementation:

    • Try to tune between [0.8;0.999][0.8; 0.999], commonly use Ξ²=0.9\beta=0.9.
    • Don't bother bias correction, NO NEED.
    • Don't need (1βˆ’Ξ²)(1-\beta) in the formulas but Andrew prefer to keep it!
    • Bigger Ξ²\beta, smaller in vertical direction.


  • It's "Root Mean Square propagation".
  • Algorithm: on iteration tt,
    1. Compute dW,dbdW, db on current element-wise mini-batch.
    2. SdW=Ξ²SdW+(1βˆ’Ξ²)dW2SdW = \beta SdW + (1-\beta)dW^2.
    3. Sdb=Ξ²Sdb+(1βˆ’Ξ²)db2Sdb = \beta Sdb + (1-\beta)db^2.
    4. W:=Wβˆ’Ξ±dWSdW+Ο΅W:=W -\alpha \frac{dW}{\sqrt{SdW}+\epsilon}.
    5. b:=bβˆ’Ξ±dbSdW+Ο΅b:=b-\alpha \frac{db}{\sqrt{SdW} + \epsilon}.
  • We choose Ο΅=10βˆ’8\epsilon=10^{-8} if SdW\sqrt{SdW} is too small, otherwise Ο΅=0\epsilon=0.
  • In practice: dW,dbdW, db are very high dimensional vectors.

Adam Optimization

  • It's "Adaptive Moment Estimation".
  • One of the most effective optimization algorithm for training NN. It's commonly used and proven to be very effective for many different NN of a very wide variety of architectures.
  • Adam = Momentum + RMSprop.
  • Implementation: on iteration tt,
    1. Compute dW,dbdW, db using current mini-batch.
    2. (Monentum) VdW=Ξ²1VdW+(1βˆ’Ξ²1)dWVdW = \beta_1 VdW + (1-\beta_1)dW; Vdb=Ξ²1Vdb+(1βˆ’Ξ²1)dbVdb = \beta_1 Vdb+(1-\beta_1)db.
    3. (RMSprop) SdW=Ξ²2SdW+(1βˆ’Ξ²2)dW2SdW = \beta_2 SdW + (1-\beta_2)dW^2; Sdb=Ξ²2Sdb+(1βˆ’Ξ²2)db2Sdb = \beta_2Sdb +(1-\beta_2)db^2.
    4. VdWcorrected=VdW1βˆ’Ξ²1tV_{dW}^{\text{corrected}} = \dfrac{VdW}{1-\beta_1^t}; Vdbcorrected=Vdb1βˆ’Ξ²1tV_{db}^{\text{corrected}} = \dfrac{Vdb}{1-\beta_1^t}.
    5. SdWcorrected=SdW1βˆ’Ξ²2tS_{dW}^{\text{corrected}} = \dfrac{SdW}{1-\beta_2^t}; Sdbcorrected=Sdb1βˆ’Ξ²2tS_{db}^{\text{corrected}} = \dfrac{Sdb}{1-\beta_2^t}.
    6. W:=Wβˆ’Ξ±VdWcorrectedSdWcorrected+Ο΅W:=W-\alpha \dfrac{V_{dW}^{\text{corrected}}}{\sqrt{S_{dW}^{\text{corrected}}} + \epsilon}; b:=bβˆ’Ξ±VdbcorrectedSdbcorrected+Ο΅b:=b-\alpha \dfrac{V_{db}^{\text{corrected}}}{\sqrt{S_{db}^{\text{corrected}}} + \epsilon}.
  • Initialization of the velocity is zero, i.e. VdW=SdW=Vdb=Sdb=0VdW=SdW=Vdb=Sdb=0.
  • If Ξ²=0\beta=0, it's standard gradient descent without momentum.
  • Hyperparameter choices:
    • Ξ±\alpha = needs to be tuned, very important!
    • Ξ²1=0.9\beta_1 = 0.9 (dWdW), first moment.
    • Ξ²2=0.999\beta_2 = 0.999 (dW2dW^2), second mement.
    • Ο΅=10βˆ’8\epsilon = 10^{-8}.

Learning rate decay

  • Idea: slowly reduce learning rate over time, it's learning rate decay.

  • Why? Below figure showes that, we need slower rate Ξ±\alpha (smaller step) at the area near the center.

    Learning rate decay.
    Example of learning rate decay. Image from the course.

  • Recall that, 1 epoch = 1 pass through data.

  • Learning rate decay can be chosen 1 of below,

Ξ±=11+decay_rateΓ—epoch_numΓ—Ξ±0,Ξ±=0.95epoch_numΓ—Ξ±0βˆ’exponentially_decay,Ξ±=kepoch_numberΓ—Ξ±,Ξ±=ktΓ—Ξ±0.\begin{aligned} \alpha &= \dfrac{1}{1 + \text{decay\_rate} \times \text{epoch\_num}} \times \alpha_0, \\ \alpha &= 0.95^{\text{epoch\_num}} \times \alpha_0 - \text{exponentially\_decay}, \\ \alpha &= \dfrac{k}{\sqrt{\text{epoch\_number}}} \times \alpha, \\ \alpha &= \dfrac{k}{\sqrt{t}} \times \alpha_0. \end{aligned}

Problem of local optima

Local optima problem.
Local optima problem: local & right optima (left) and saddle point (right). Image from the course.

  • In high dimension, you likely see saddle points than local optimum.
  • Problem of plateau: a region where derivative is close to zero for a long time.
    • Unlikely get stuck in a bad local optimal.
    • Plateau can make learning slow: use Momentum, RMSprop, Adam.

Batch GD makes learning too long?

  • Try better random initialization for weights.
  • Try mini-batch GD.
  • Try using Adam
  • Try tuning learning rate Ξ±\alpha.

Hyperparameter tuning

Tuning process

  • There are many hyperparameters but some are more important than others!

  • Learning rate Ξ±\alpha (most important), #hiddien units, Ξ²\beta, mini-batch size (2nd important), #layers, learning decay,...

  • Don't use grid, use random!

    Tuning process
    Tuning process. Don't use grid (left), use random (right). Image from the course.

  • Coarse to fine: find an area containing effective values β‡’ zoom in and take more points in that area,

    Coarse to fine
    Coarse to fine: first try on a big square, then focus on the smaller one (blue). Image from the course.

  • Choose randomly but NOT mean uniform scale! We can choose uniformly on #hidden units, #layers, but not for the others (e.g. Ξ±\alpha).

  • For Ξ±\alpha, for example, we need to divide into equal "large" spaces and then use uniform.

    Appropriate scale for hyperparameters
    Appropriate scale for hyperparameters. Image from the course.

  • Hyperparameters for exponentially weighted averages:

    • We cannot try with values between [0.9,0.999][0.9, 0.999] because,

      • Ξ²:0.9000β†’0.9005\beta: 0.9000 \to 0.9005 : no much changes,
      • Ξ²:0.999β†’0.995\beta: 0.999 \to 0.995 : huge impact!
    • Consider 1βˆ’Ξ²βˆˆ[10βˆ’1,10βˆ’3]1-\beta \in [10^{-1}, 10^{-3}] instead!

      r∈[βˆ’3,βˆ’1]1βˆ’Ξ²=10r⇔β=1βˆ’10r\begin{aligned}r &\in [-3, -1] \\1-\beta = 10^r &\Leftrightarrow \beta = 1-10^r\end{aligned}

In practice: Panda vs Caviar

  • How to organize your hyperparameter search?
  • Advice: Re-testing/Re-evaluating your hyperparameters at least once every several months.
  • 2 ways:
    1. Babysitting one model (Panda): when we have huge data but weak CPU/GPU β‡’\Rightarrow try very small number of models at a time. Check the performance step by step (cost function reduces...)
      • In some domains like advertising, computer vision apps,...
      • We call "panda" because panda has very few number of babies at a time (and in their life) β‡’\Rightarrow try to keep them alike once at a time.
    2. Training many models in parallel (Caviar): when we don't work on huge data + strong CPU/GPU. β‡’\Rightarrow Try many models in parallel and choose the best performance!
      • We call "Caviar" because of intuition.

Batch Normalization

  • Make NN much more robust to the choice of hyperparameters. ⇐\Leftarrow doesn't work for all NN but if it does, make training faster!
  • One of the most important ideas in the rise of Deep Learning.
  • Like we wanna normalize input to speed up learning, in this case, we wanna normalize ZZ (in the hidden layers)

Given some initial values in NN Z[l](1),…,Z[l](m)Z^{[l](1)},\ldots, Z^{[l](m)},

  1. ΞΌ=1mβˆ‘iZ[l](i)\mu = \dfrac{1}{m} \sum_i Z^{[l](i)}
  2. Οƒ2=1mβˆ‘i(Z[l](i)βˆ’ΞΌ)2\sigma^2 = \dfrac{1}{m}\sum_i (Z^{[l](i)} - \mu)^2
  3. Znorm[l](i)=Z[l](i)βˆ’ΞΌΟƒ2+Ο΅Z^{[l](i)}_{\text{norm}} = \dfrac{Z^{[l](i)} - \mu}{\sqrt{\sigma^2} + \epsilon} to get mean ΞΌ=0\mu=0 and STD Οƒ=1\sigma=1.
  4. Z~[l](i)=Ξ³Znorm[l](i)+Ξ²\tilde{Z}^{[l](i)} = \gamma Z^{[l](i)}_{\text{norm}} + \beta to have different other normal distribution.

Now, Ξ³,Ξ²\gamma, \beta are learnable parameters of the model.

  • If we choose different Ξ²,Ξ³\beta, \gamma β‡’\Rightarrow hidden units have other means & variances.
  • Instead of using Z[l](1),…,Z[l](m)Z^{[l](1)}, \ldots, Z^{[l](m)}, we use Z~[l](i)\tilde{Z}^{[l](i)}.
  • Difference between normalizing input XX and normalizing in hidden units:
    • XX: after normalizing, ΞΌ=0,Οƒ=1\mu=0, \sigma=1.
    • ZZ: after normalizing, various ΞΌ,Οƒ\mu, \sigma.
Xβ†’W[1],b[1]Z[1]β†’Batch NormΞ²[1],Ξ³[1]Z~[1]β†’a[1]=g[1](Z~[1])β†’W[2],b[2]Z[2]β†’Batch NormΞ²[2],Ξ³[2]Z~[2]β†’a[2]→…X \xrightarrow[]{W^{[1]}, b^{[1]}} Z^{[1]} \xrightarrow[\text{Batch Norm}]{\beta^{[1]}, \gamma^{[1]}} \tilde{Z}^{[1]} \to a^{[1]} = g^{[1]}(\tilde{Z}^{[1]}) \xrightarrow[]{W^{[2]}, b^{[2]}} Z^{[2]} \xrightarrow[\text{Batch Norm}]{\beta^{[2]}, \gamma^{[2]}} \tilde{Z}^{[2]} \to a^{[2]} \to \ldots
  • Note that, Ξ²\beta in this case is different from Ξ²\beta in Adam optimization!

  • We can use gradient descent to update Ξ²\beta and even use Adam/RMSprop/Momentum to update params Ξ³,Ξ²\gamma, \beta, not just for Gradient Descent.

  • In practice, we won't have to implement Batch Norm step by step by ourself, programming framework (like Tensorflow) will do!

  • In practice, Batch Norm is usually applied with mini-batch of your training set.

  • Parameters: W[l],Ξ²[l],Ξ³[l]W^{[l]}, \beta^{[l]}, \gamma^{[l]}. We don't need to consider b[l]b^{[l]} becase it will be subtracted out in the process of normalization!

  • Fitting Batch Norm into a NN: for tt goes through the number of mini-batches,

    1. Compute forward prop on X{t}X^{\{t\}}.
    2. In each hidden layer, use Batch Norm to reparameter Z[l]Z^{[l]} to Z~[l]\tilde{Z}^{[l]}.
    3. Use backprop to compute dW[l],dΞ²[l],dΞ³[l]dW^{[l]}, d\beta^{[l]}, d\gamma^{[l]}.
    4. Update params (we can use Momentum / RMSprop / Adam):
    W[l]:=W[l]βˆ’Ξ±dW[l],Ξ²[l]:=Ξ²[l]βˆ’Ξ±dΞ²[l],Ξ³[l]:=Ξ³[l]βˆ’Ξ±dΞ³[l].\begin{aligned}W^{[l]} &:= W^{[l]} - \alpha dW^{[l]}, \\\beta^{[l]} &:= \beta^{[l]} - \alpha d\beta^{[l]}, \\\gamma^{[l]} &:= \gamma^{[l]} - \alpha d\gamma^{[l]}.\end{aligned}
  • Sometimes, BN has a 2nd effect as a regularization technique but it's unintended! We don't use it for the purpose of regularization, use L1, L2 or dropout instead.

(Recall) Regularization: techniques that lower the complexity of a NN during training, thus prevent the overfitting.

Why BN works?

  • Make weights in later / deeper layers be more robust to changing to the weights in the earlier layers.
  • Covariate shift problem: suppose we have Xβ†’YX \to Y. If XX's distribution changes, it changes the result in YY much. We have to re-train our model.
    • Example: "cat vs non-cat" problem. If we apply params from the model of "black cat vs non-cat" to the problem of "colored-cat vs non-cat", it won't work because distribution in "black cat" is different from "colored cat".

Covariate problem
Covariate problem. Image from the course.

Why BN works?
Why BN works?. Image from the course.

In the perspective of layer 3, it depends only on layer 2 β‡’\Rightarrow If layers before layer 2 changes β‡’\Rightarrow distribution of layer 2 changes β‡’\Rightarrow covariate shift problem for layer 3 β‡’\Rightarrow Batch Norm makes sure that mean and variance in layer 2 is always robust before going to layer 3!

Batch Norm in test time

  • BN processes our data one min-batch at a time. However, in test time, you need to process the examples at a time. β‡’\Rightarrow Need to adapt your network to do that.
  • Idea: calculate ΞΌ,Οƒ2\mu, \sigma^2 using exponentially weighted average (across mini-batches). Other words,
    • In the training time, we calculate (and store) also the ΞΌ{t}[l],Οƒ{t}[l]\mu^{\{t\}[l]}, \sigma^{\{t\}[l]} in each mini-batch.
    • Find ΞΌ,Οƒ2\mu, \sigma^2 (exponentially weighted average) of all mini-batches.
    • Use this ΞΌ,Οƒ2\mu, \sigma^2 to find ZnormZ_{\text{norm}} and Z~\tilde{Z} (at each example ii).
  • Don't worry, it's easy to use with Deep Learning Frameworks.

Tensorflow introduction

Writing and running programs in TensorFlow has the following steps:

  1. Create Tensors (variables) that are not yet executed/evaluated.
  2. Write operations between those Tensors.
  3. Initialize your Tensors.
  4. Create a Session.
  5. Run the Session. This will run the operations you'd written above.
# create placeholders
x = tf.placeholder(tf.int64, name = 'x')
X = tf.placeholder(tf.float32, [n_x, None], name="X")

# initialize
W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer())

There are two typical ways to create and use sessions in tensorflow:

  1. Method 1:
sess = tf.Session()
# Run the variables initialization (if needed), run the operations
result = sess.run(..., feed_dict = {...})
sess.close() # Close the session
  1. Method 2:
with tf.Session() as sess:
# run the variables initialization (if needed), run the operations
result = sess.run(..., feed_dict = {...})
# This takes care of closing the session for you :)

What you should remember:

  • Tensorflow is a programming framework used in deep learning
  • The two main object classes in tensorflow are Tensors and Operators.
  • When you code in tensorflow you have to take the following steps:
    • Create a graph containing Tensors (Variables, Placeholders ...) and Operations (tf.matmul, tf.add, ...)
    • Create a session
    • Initialize the session
    • Run the session to execute the graph
  • You can execute the graph multiple times as you've seen in model()
  • The backpropagation and optimization is automatically done when running the session on the "optimizer" object.

πŸ‘‰ Check more details about the codes in the notebook.

πŸ’¬ Comments

Support Thi Support Thi