Note for couse DL 2: Improving DNN: Tuning, Regularization and Optimization

Anh-Thi Dinh
This is my note for the course (Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization). The codes in this note are rewritten to be more clear and concise.
This course will teach you the "magic" of getting deep learning to work well. Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good results. You will also learn TensorFlow.

Initialization step

layers_dims contains the size of each layer from to .

zero initialization

1parameters['W'+str(l)] = np.zeros((layers_dims[l], layers_dims[l-1]))
2parameters['b'+str(l)] = np.zeros((layers_dims[l], 1))
• The performance is really bad, and the cost does not really decrease.
• initializing all the weights to zero ⇒ failing to break symmetry → every neuron in each layer will learn the same thing → for every layer → no more powerful than a linear classifier such as logistic regression.

Random initialization

To break symmetry, lets intialize the weights randomly.
1parameters['W'+str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * 10
2# 👆 LARGE (just an example of SHOULDN'T)
3parameters['b'+str(l)] = np.zeros((layers_dims[l], 1))
• High initial weights ⇒ The cost starts very high (near 0 or 1 or infinity).
• Poor initialization ⇒ vanishing/exploding gradients ⇒ slows down the optimization algorithm.
• If you train this network longer ⇒ better results, BUT initializing with overly large random numbers ⇒ slows down the optimization.

He initialization

Multiply randomly initial with . It's similar to Xavier initialization in which multipler factor is .
1parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2./layers_dims[l-1])
2parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))

Regularization step

To reduce the overfitting problem.

L2 regularization

1L2_regularization_cost = 0
2for l in range(1, L+1):
3    L2_regularization_cost += 1/m * lambd/2 * (np.sum(np.square(W[l]))
• The standard way. Modify cost function from,
• to
• The value of is a hyperparameter that you can tune using a dev set.
• L2 regularization makes your decision boundary smoother. If is too large, it is also possible to "oversmooth", resulting in a model with high bias.

Dropout

1# [Forward] An example at layer 3
2D3 = np.random.rand(A3.shape(0), A3.shape(1)) < keep_drop
3A3 *= D3
4A3 /= keep_drop
5# [Backprop]
6dA3 *= D3
7dA3 /= keep_drop
• Dropout is a widely used regularization technique that is specific to deep learning.
• Randomly shuts down some neurons in each iteration.
• When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons.
• With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.
• Don't apply dropout to the input layer or output layer.
• Use dropout during training, not during test time.
• Apply dropout both during forward and backward propagation.

• To answer "Give me a proof that your backpropagation is actually working!"
• We are confident on computing but .
• Use to compute an approximation of and compare with .

Optimization algorithms

Intuition:
• Gradient Descent: go down the hill.
• Momentum / RMSprop / Adam: which direction?

• Problem: NN works great on big data but many data leads to slow the training ⇒ We need to optimize!
• Solution: Divide into smaller "mini-batches" (for example, from 5M to 5K of 1K each).

Notations

• : th training example.
• : value in th layer.
• : index of different mini-batches.

Algorithm

1X = data_input
2Y = labels
3parameters = initialize_parameters(layers_dims)
4for i in range(0, num_iterations): # loop through epoches: to get the convergence
5  for t in range(0, num_batches): # loop through the batches
6    # Forward propagation
7    a, caches = forward_propagation(X[:,t], parameters)
8    # Compute cost
9    cost += compute_cost(a, Y[:,t])
10    # Backward propagation
11    grads = backward_propagation(a, caches, parameters)
12    # Update parameters.
13    parameters = update_parameters(parameters, grads)

How to build mini-batches?

We need 2 steps:
1. Shuffle: shuffle columns (training examples) correspondingly between and . The shuffling step ensures that examples will be split randomly into different mini-batches.
1. Partition: choose a batch size and take mini-batches. Note that, the last batch may be smaller than the others.

Type of mini-batch

There are 3 types based on the size of batches:
1. Batch Gradient Descent () : entire training examples, i.e. .
1. Stochastic Gradient Descent () : every training example is it own a mini-batch ( mini batches).
1. .
Guideline:
• If small training set (): using batch gradient descent.
• Typical mini-batch sizes:
• Make sure mibi-batch size in CPU/GPU memory!

Exponentially weighted averages

• It's faster than Gradient Descent!
• Example (temperature in London):
• : the temperature on day .
• : the average temp of each day. It's called exponential average over days temperature.
• E.g. days temperature; days temperature.
• larger ⇒ smoother average line because we consider more days. However, curve is now shifted further to the right.
• When is so large ⇒ adapts slowly to the changes of temperature (more latency).
• Why we call "exponentially"?

Bias correction

• Problem: the value of at the beginning of exp ave curves may be lower than what we expect. For example, with , we have instead of .
• Solution: Instead of using , we take
• When is large ⇒
• In practice, we don't really see people bothering with bias correction!

• It's faster than Gradient Descent!
• Why: when we use mini-batch, there are oscillation, momentum helps use reduce this.
• Idea: Momentum takes into account the past gradients to smooth out the update. We will store the 'direction' of the previous gradients in the variable . Formally, this will be the exponentially weighted average of the gradient on previous steps.
• Intuition: You can also think of as the "velocity" of a ball rolling downhill, building up speed (and momentum) according to the direction of the gradient/slope of the hill.
• like "acceleration".
• like "velocity".
• likes "friction".
• Algorithm: on iteration :
1. Compute on current mini-batch.
1. .
1. .
1. .
1. .
• Implementation:
• Try to tune between , commonly use .
• Don't bother bias correction, NO NEED.
• Don't need in the formulas but Andrew prefer to keep it!
• Bigger , smaller in vertical direction.

RMSprop

• It's "Root Mean Square propagation".
• Algorithm: on iteration
1. Compute on current element-wise mini-batch.
1. .
1. .
1. .
1. .
• We choose if is too small, otherwise .
• In practice: are very high dimensional vectors.

• One of the most effective optimization algorithm for training NN. It's commonly used and proven to be very effective for many different NN of a very wide variety of architectures.
• Adam = Momentum + RMSprop.
• Implementation: on iteration
1. Compute using current mini-batch.
1. (Monentum) ; .
1. (RMSprop) ; .
1. ; .
1. ; .
• Initialization of the velocity is zero, i.e. .
• If , it's standard gradient descent without momentum.
• Hyperparameter choices:
• = needs to be tuned, very important!
• (), first moment.
• (), second mement.
• .

Learning rate decay

• Idea: slowly reduce learning rate over time, it's learning rate decay.
• Why? Below figure showes that, we need slower rate (smaller step) at the area near the center.
• Recall that, 1 epoch = 1 pass through data.
• Learning rate decay can be chosen 1 of below,

Problem of local optima

• In high dimension, you likely see saddle points than local optimum.
• Problem of plateau: a region where derivative is close to zero for a long time.
• Unlikely get stuck in a bad local optimal.
• Plateau can make learning slow: use Momentum, RMSprop, Adam.

Batch GD makes learning too long?

• Try better random initialization for weights.
• Try mini-batch GD.
• Try tuning learning rate .

Hyperparameter tuning

Tuning process

• There are many hyperparameters but some are more important than others!
• Learning rate (most important), #hiddien units, , mini-batch size (2nd important), #layers, learning decay,...
• Don't use grid, use random!
• Coarse to fine: find an area containing effective values ⇒ zoom in and take more points in that area,
• Choose randomly but NOT mean uniform scale! We can choose uniformly on #hidden units, #layers, but not for the others (e.g. ).
• For , for example, we need to divide into equal "large" spaces and then use uniform.
• Hyperparameters for exponentially weighted averages:
• We cannot try with values between because,
• : no much changes,
• : huge impact!

In practice: Panda vs Caviar

• How to organize your hyperparameter search?
• 2 ways:
1. Babysitting one model (Panda): when we have huge data but weak CPU/GPU → try very small number of models at a time. Check the performance step by step (cost function reduces...)
• In some domains like advertising, computer vision apps,...
• We call "panda" because panda has very few number of babies at a time (and in their life) → try to keep them alike once at a time.
1. Training many models in parallel (Caviar): when we don't work on huge data + strong CPU/GPU. → Try many models in parallel and choose the best performance!
• We call "Caviar" because of intuition.

Batch Normalization

• Make NN much more robust to the choice of hyperparameters. → doesn't work for all NN but if it does, make training faster!
• One of the most important ideas in the rise of Deep Learning.
• Like we wanna normalize input to speed up learning, in this case, we wanna normalize (in the hidden layers)
Given some initial values in NN
1. to get mean and STD .
1. to have different other normal distribution.
Now, are learnable parameters of the model.
• If we choose different → hidden units have other means & variances.
• Instead of using , we use .
• Difference between normalizing input and normalizing in hidden units:
• : after normalizing, .
• : after normalizing, various .
• We can use gradient descent to update and even use Adam/RMSprop/Momentum to update params , not just for Gradient Descent.
• In practice, we won't have to implement Batch Norm step by step by ourself, programming framework (like Tensorflow) will do!
• In practice, Batch Norm is usually applied with mini-batch of your training set.
• Parameters: . We don't need to consider becase it will be subtracted out in the process of normalization!
• Fitting Batch Norm into a NN: for goes through the number of mini-batches,
1. Compute forward prop on .
1. In each hidden layer, use Batch Norm to reparameter to .
1. Use backprop to compute .
1. Update params (we can use Momentum / RMSprop / Adam):
• Sometimes, BN has a 2nd effect as a regularization technique but it's unintended! We don't use it for the purpose of regularization, use L1, L2 or dropout instead.
(Recall) Regularization: techniques that lower the complexity of a NN during training, thus prevent the overfitting.

Why BN works?

• Make weights in later / deeper layers be more robust to changing to the weights in the earlier layers.
• Covariate shift problem: suppose we have . If 's distribution changes, it changes the result in much. We have to re-train our model.
• Example: "cat vs non-cat" problem. If we apply params from the model of "black cat vs non-cat" to the problem of "colored-cat vs non-cat", it won't work because distribution in "black cat" is different from "colored cat".
In the perspective of layer 3, it depends only on layer 2 → If layers before layer 2 changes → distribution of layer 2 changes → covariate shift problem for layer 3 → Batch Norm makes sure that mean and variance in layer 2 is always robust before going to layer 3!

Batch Norm in test time

• BN processes our data one min-batch at a time. However, in test time, you need to process the examples at a time. → Need to adapt your network to do that.
• Idea: calculate using exponentially weighted average (across mini-batches). Other words,
• In the training time, we calculate (and store) also the in each mini-batch.
• Find (exponentially weighted average) of all mini-batches.
• Use this to find and (at each example ).
• Don't worry, it's easy to use with Deep Learning Frameworks.

Tensorflow introduction

Writing and running programs in TensorFlow has the following steps:
1. Create Tensors (variables) that are not yet executed/evaluated.
1. Write operations between those Tensors.
1. Create a Session.
1. Run the Session. This will run the operations you'd written above.
1# create placeholders
2x = tf.placeholder(tf.int64, name = 'x')
3X = tf.placeholder(tf.float32, [n_x, None], name="X")
4
5# initialize
6W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
7b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer())
There are two typical ways to create and use sessions in tensorflow:
1. Method 1:
1. 1sess = tf.Session()
2# Run the variables initialization (if needed), run the operations
3result = sess.run(..., feed_dict = {...})
4sess.close() # Close the session
1. Method 2:
1. 1with tf.Session() as sess:
2    # run the variables initialization (if needed), run the operations
3    result = sess.run(..., feed_dict = {...})
4    # This takes care of closing the session for you :)
What you should remember:
• Tensorflow is a programming framework used in deep learning
• The two main object classes in tensorflow are Tensors and Operators.
• When you code in tensorflow you have to take the following steps:
• Create a graph containing Tensors (Variables, Placeholders ...) and Operations (tf.matmul, tf.add, ...)
• Create a session
• Initialize the session
• Run the session to execute the graph
• You can execute the graph multiple times as you've seen in model()
• The backpropagation and optimization is automatically done when running the session on the "optimizer" object.
👉 Check more details about the codes in the notebook.