Last modified on 27 Jul 2020.

This is my note for the course (Neural Networks and Deep Learning). The codes in this note are rewritten to be more clear and concise.

🎯 Overview of all 5 courses.

πŸ‘‰ Course 2 – Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization.

If you want to break into cutting-edge AI, this course will help you do so.

Activation functions

Check Comparison of activation functions on wikipedia.

Why non-linear activation functions in NN Model?

Suppose g(z)=zg(z)=z (linear)

a[1]=g(z[1]=z[1])=w[1]x+b[1](linear)a[1]=g(z[2]=z[2])=w[2]a[1]+b[2]=(w[2]w[1])x+(w[2]b[1]+b[2])(linearΒ again). \begin{aligned} a^{[1]} &= g(z^{[1]} = z^{[1]}) = w^{[1]}x + b^{[1]} \quad \text{(linear)} \\ a^{[1]} &= g(z^{[2]} = z^{[2]}) = w^{[2]}a^{[1]} + b^{[2]} \\ &= (w^{[2]}w^{[1]})x + (w^{[2]}b^{[1]} + b^{[2]}) \quad \text{(linear again)}. \end{aligned}

You might not have any hidden layer! Your model is just Logistic Regression, no hidden unit! Just use non-linear activations for hidden layers!

Sigmoid function

  • Usually used in the output layer in the binary classification.
  • Don't use sigmoid in the hidden layers!

Οƒ(z)=11+eβˆ’zΟƒ(z)β†’zβ†’βˆž1Οƒ(z)β†’zβ†’βˆ’βˆž0Οƒβ€²(x)=Οƒ(x)(1βˆ’Οƒ(x)) \begin{aligned} \sigma(z) &= \dfrac{1}{1+e^{-z}} \\ \sigma(z) &\xrightarrow{z\to \infty} 1 \\ \sigma(z) &\xrightarrow{z\to -\infty} 0 \\ \sigma'(x) &= \sigma(x) (1 - \sigma(x)) \end{aligned}

sigmoid function Signmoid function graph on Wikipedia.

import numpy as np
import numpy as np

def sigmoid(z):
    return 1 / (1+np.exp(-z))
def sigmoid_derivative(z):
    return sigmoid(z)*(1-sigmoid(z))

Softmax function

The output of the softmax function can be used to represent a categorical distribution – that is, a probability distribution over K different possible outcomes.

softmax function Udacity Deep Learning Slide on Softmax

Οƒ(z)i=eziβˆ‘j=1KezjΒ forΒ i=1,…,KΒ andΒ z∈RK \sigma (\mathbf {z} )_{i}={\frac {e^{z_{i}}}{\sum _{j=1}^{K}e^{z_{j}}}}{\text{ for }}i=1,\dotsc ,K{\text{ and }}\mathbf {z}\in \mathbb {R} ^{K}

def softmax(x):
    z_exp = np.exp(z)
    z_sum = np.sum(z_exp, axis=1, keepdims=True)
    return z_exp / z_sum

tanh function (Hyperbolic tangent)

  • tanh is better than sigmoid because mean β†’\to 0 and it centers the data better for the next layer.
  • Don’t use sigmoid on hidden units except for the output layer because in the case 0≀y^≀10 \le \hat{y} \le 1, sigmoid is better than tanh.

Οƒ(z)=ezβˆ’eβˆ’zez+eβˆ’z \sigma(z) = \dfrac{e^{z} - e^{-z}}{e^{z} + e^{-z}}

def tanh(z):
    return (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))

tanh function Graph of tanh from analyticsindiamag.


  • ReLU (Rectified Linear Unit).
  • Its derivative is much different from 0 than sigmoid/tanh β†’\to learn faster!
  • If you aren't sure which one to use in the activation, use ReLU!
  • Weakness: derivative ~ 0 in the negative side, we use Leaky ReLU instead! However, Leaky ReLU aren’t used much in practice!

Οƒ(z)=max(0,z) \sigma(z) = max(0,z)

def relu(z):
    return np.maximum(0, z)

relu vs leaky relu ReLU (left) and Leaky ReLU (right)

Logistic Regression

  • Usually used for binary classification (there are only 2 only 2 outputs). In the case of multiclass classification, we can use one vs all (couple multiple logistic regression steps).

Gradient Descent

Gradient Descent is an algorithm to minimizing the cose function JJ. It contains 2 steps: Forward Propagation (From XX to compute the cost JJ) and Backward Propagation (compute derivaties and optimize the parameters w,bw, b).

Initialize w,bw, b and then repeat until convergence (mm: number of training examples, Ξ±\alpha: learning rate, JJ: cost function, AA: activation function):

  1. A=Οƒ(wTX+b)A = \sigma(w^TX + b)
  2. J(w,b)=βˆ’1m(Ylog⁑AT+(1βˆ’Y)log⁑(1βˆ’AT))J(w,b) = -\frac{1}{m} \left( Y \log A^T + (1-Y)\log(1-A^T) \right)
  3. βˆ‚wJ=1mX(Aβˆ’Y)T\partial_{w}J = \frac{1}{m}X(A-Y)^T
  4. βˆ‚bJ=1mΞ£(Aβˆ’Y)\partial_{b}J = \frac{1}{m} \Sigma (A-Y)
  5. w:=wβˆ’Ξ±βˆ‚wJw := w - \alpha \partial_{w}J
  6. b:=bβˆ’Ξ±βˆ‚bJb := b - \alpha \partial_{b}J

The dimension of variables: X∈RnxΓ—m,Y∈R1Γ—m,b∈R1Γ—m,w∈RnxΓ—1,A∈R1Γ—m,J∈RX\in \mathbb{R}^{n_x \times m}, Y\in \mathbb{R}^{1\times m}, b\in \mathbb{R}^{1\times m}, w\in \mathbb{R}^{n_x \times 1}, A\in \mathbb{R}^{1\times m}, J\in \mathbb{R}, βˆ‚wJ∈R\partial_wJ \in \mathbb{R}, βˆ‚bJ∈R\partial_bJ \in \mathbb{R}.


def logistic_regression_model(X_train, Y_train, X_test, Y_test,
                              num_iterations = 2000, learning_rate = 0.5):
    m = X_train.shape[1] # number of training examples

    # INITIALIZE w, b
    w = np.zeros((X_train.shape[0], 1))
    b = 0

    for i in range(num_iterations):
        # FORWARD PROPAGATION (from x to cost)
        A = sigmoid(, X_train) + b)
        cost = -1/m * (, np.log(A.T))
               +, np.log(1-A.T)))

        # BACKWARD PROPAGATION (find grad)
        dw = 1/m *, (A-Y).T)
        db = 1/m * np.sum(A-Y)
        cost = np.squeeze(cost)

        # OPTIMIZE
        w = w - learning_rate*dw
        b = b - learning_rate*db

    # PREDICT (with optimized w, b)
    Y_pred = np.zeros((1,m))
    w = w.reshape(X.shape[0], 1)

    A = sigmoid(,X_test) + b)
    Y_pred_test = A > 0.5

Neural Network overview


  • X(i)X^{(i)} : iith training example.
  • mm : number of examples.
  • LL : number of layers.
  • n[0]=nXn^{[0]} = n_X : number of features (# nodes in the input).
  • n[L]n^{[L]} : number of nodes in the output layer.
  • n[l]n^{[l]} : number of nodes in the hidden layers.
  • w[l]w^{[l]} : weights for z[l]z^{[l]}.
  • a[0]=Xa^{[0]} = X : activation in the input layer.
  • ai[2]a^{[2]}_i : activation in layer 22, node ii.
  • a[2](i)a^{[2](i)} : activation in layer 22, example ii.
  • a[L]=y^a^{[L]} = \hat{y}.


  • A[0]=X∈Rn[0]Γ—mA^{[0]} = X \in \mathbb{R}^{n^{[0]} \times m}
  • Z[l],A[l]∈Rn[l]Γ—mZ^{[l]}, A^{[l]} \in \mathbb{R}^{n^{[l]}\times m}.
  • dZ[l],dA[l]∈Rn[l]Γ—mdZ^{[l]}, dA^{[l]} \in \mathbb{R}^{n^{[l]}\times m}.
  • dW[l],W[l]∈Rn[l]Γ—[lβˆ’1]dW^{[l]}, W^{[l]} \in \mathbb{R}^{n^{[l]} \times ^{[l-1]}}.
  • db[l],b[l]∈Rn[l]Γ—1db^{[l]}, b^{[l]} \in \mathbb{R}^{n^{[l]} \times 1}.

L-layer deep neural network

L-layer deep neural network L-layer deep neural network. Image from the course.

  1. Initialize parameters / Define hyperparameters
  2. Loop for num_iterations:
    1. Forward propagation
    2. Compute cost function
    3. Backward propagation
    4. Update parameters (using parameters, and grads from backprop)
  3. Use trained parameters to predict labels.

Initialize parameters

  • In the Logistic Regression, we use 00 for w,bw, b (it’s OK because LR doesn’t have hidden layers) but we can’t in the NN model!
  • If we use 00, we’ll meet the completely symmetric problem. No matter how long you train your NN, hidden units compute exactly the same function β‡’\Rightarrow No point to having more than 1 hidden unit!
  • We add a little bit in WW and keep 00 in bb.

Forward & Backward Propagation

Blocks of forward and backward propagation deep NN Blocks of forward and backward propagation deep NN. Unknown source.

Blocks of forward and backward propagation deep NN Blocks of forward and backward propagation deep NN. Image from the course.

Forward Propagation: Loop through number of layers

  1. A[0]=XA^{[0]} = X
  2. Z[l]=W[l]A[lβˆ’1]+b[l]Z^{[l]} = W^{[l]}A^{[l-1]} + b^{[l]} (linear)
  3. A[l]=Οƒ[l](Z[l])A^{[l]} = \sigma^{[l]}(Z^{[l]}) (for l=1…Lβˆ’1l=1 \ldots L-1, non-linear activations)
  4. A[L]=Οƒ[L](Z[L])A^{[L]} = \sigma^{[L]}(Z^{[L]}) (sigmoid function)

Cost function: J(w,b)=βˆ’1m(Ylog⁑AT+(1βˆ’Y)log⁑(1βˆ’AT))J(w,b) = -\frac{1}{m} \left( Y \log A^T + (1-Y)\log(1-A^T) \right)

Backward Propagation: Loop through number of layers

  1. dA[L]=βˆ’yA[L]+1βˆ’y1βˆ’A[L]dA^{[L]} = -\frac{y}{A^{[L]}} + \frac{1-y}{1-A^{[L]}}.
  2. for l=L…1l=L \ldots 1, non-linear activations:
    1. dZ[l]=dA[l](Οƒ[l])β€²(Z[l])dZ^{[l]} = dA^{[l]} (\sigma^{[l]})'(Z^{[l]}).
    2. dW[l]=dJβˆ‚W[l]=1mdZ[l](A[lβˆ’1])TdW^{[l]} = \frac{dJ}{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} (A^{[l-1])^T}.
    3. db[l]=dJβˆ‚b[l]=1mΟƒ1mdZ[l](i)db^{[l]} = \frac{dJ}{\partial b^{[l]}} = \frac{1}{m}\sigma_1^m dZ^{[l](i)}.
    4. dA[lβˆ’1]=(W[l])TdZ[l]dA^{[l-1]} = (W^{[l])^T}dZ^{[l]}.

Update parameters: loop through number of layers (for l=1…Ll=1\ldots L)

  1. W[l]=W[l]βˆ’Ξ±dW[l]W^{[l]} = W^{[l]} - \alpha dW^{[l]}.
  2. b[l]=b[l]βˆ’Ξ±db[l]b^{[l]} = b^{[l]} - \alpha db^{[l]}.


def L_Layer_NN(X, Y, layers_dims, learning_rate=0.0075,
               num_iterations=3000, print_cost=False):
    costs = []
    m = X_train.shape[1] # number of training examples
    L = len(layer_dims)  # number of layers

    params = {'W':[], 'b':[]}
    for l in range(L):
        params['W'][l] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
        params['b'][l] = np.zeros((layer_dims[l], 1))

    for i in range(0, num_iterations):
        # FORWARD PROPAGATION (Linear -> ReLU x (L-1) -> Linear -> Sigmoid (L))
        A = X
        caches = {'A':[], 'W':[], 'b':[], 'Z':[]}
        for l in range(L):
            # INITIALIZE W, b
            W = params['W'][l]
            b = params['b'][l]
            # RELU X (L-1)
            Z =, A) + b
            if l != L: # hidden layers
                A = relu(Z)
            else: # output layer
                A = sigmoid(Z)

        # COST
        cost = -1/m *, Y.T) - 1/m *, 1-Y.T)

        #FORWARD PROPAGATION (Linear -> ReLU x (L-1) -> Linear -> Sigmoid (L))
        dA = - (np.divide(Y, A) - np.divide(1 - Y, 1 - A))
        grads = {'dW':[], 'db':[]}
        for l in reversed(range(L)):
            cache_Z = caches['Z'][l]
            if l != L-1: # hidden layers
                dZ = np.array(dA, copy=True)
                dZ[Z <= 0] = 0
            else: # output layer
                dZ = dA * sigmoid(cache_Z)*(1-sigmoid(cache_Z))
            cache_A_prev = caches['A_prev'][l]
            dW = 1/m *, cache_A_prev.T)
            db = 1/m * np.sum(dZ, axis=1, keepdims=True)
            dA =, dZ)

        for l in range(L):
            params['W'][l+1] = params['W'][l] - grads['dW'][l]
            params['b'][l+1] = params['b'][l] - grads['db'][l]

    if print_cost and i % 100 == 0:
        print ("Cost after iteration %i: %f" %(i, cost))
    if print_cost and i % 100 == 0:

    return parameter

Parameters vs Hyperparameters

  • Parameters: W,bW, b.
  • Hyperparameters:
    • Learning rate (Ξ±\alpha).
    • Number of iterations (in gradient descent algorithm) (numiterationsnum_iterations).
    • Number of layers (LL).
    • Number of nodes in each layer (n[i]n^{[i]}).
    • Choice of activation functions (their form, not their values).


  • Always use vectorized if possible! Especially for number of examples!
  • We can’t use vectorized for number of layers, we need for.
  • Sometimes, functions computed with Deep NN (more layers, fewer nodes in each layer) is better than Shallow (fewer layers, more nodes). E.g. function XOR.
  • Deeper layer in the network, more complex features to be determined!
  • Applied deep learning is a very empirical process! Best values depend much on data, algorithms, hyperparameters, CPU, GPU,…
  • Learning algorithm works sometimes from data, not from your thousands line of codes (surprise!!!)

Application: recognize a cat

This section contains an idea, not a complete task!

Image to vector conversion. Image to vector conversion. Image from the course.

L-layer deep neural network L-layer deep neural network. Image from the course.

Python tips

β—‹ Reshape quickly from (10,9,9,3) to (9*9*3,10):

X = np.random.rand(10, 9, 9, 3)
X = X.reshape(10,-1).T

β—‹ Don’t use loop, use vectorization!

πŸ‘‰ Course 2 – Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization.