# Note for couse DL 1: NN and DL

Anh-Thi Dinh
If you want to break into cutting-edge AI, this course will help you do so.

## Activation functions

👉 Check Comparison of activation functions on wikipedia.

### Why non-linear activation functions in NN Model?

Suppose (linear)
You might not have any hidden layer! Your model is just Logistic Regression, no hidden unit! Just use non-linear activations for hidden layers!

### Sigmoid function

• Usually used in the output layer in the binary classification.
• Don't use sigmoid in the hidden layers!
1import numpy as np
2import numpy as np
3
4def sigmoid(z):
5    return 1 / (1+np.exp(-z))
1def sigmoid_derivative(z):
2    return sigmoid(z)*(1-sigmoid(z))

### Softmax function

The output of the softmax function can be used to represent a categorical distribution – that is, a probability distribution over K different possible outcomes.

1def softmax(x):
2    z_exp = np.exp(z)
3    z_sum = np.sum(z_exp, axis=1, keepdims=True)
4    return z_exp / z_sum

### tanh function (Hyperbolic tangent)

• tanh is better than sigmoid because mean $\to$ 0 and it centers the data better for the next layer.
• Don't use sigmoid on hidden units except for the output layer because in the case , sigmoid is better than tanh.
1def tanh(z):
2    return (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))

### ReLU

• ReLU (Rectified Linear Unit).
• Its derivative is much different from 0 than sigmoid/tanh $\to$ learn faster!
• If you aren't sure which one to use in the activation, use ReLU!
• Weakness: derivative ~ 0 in the negative side, we use Leaky ReLU instead! However, Leaky ReLU aren't used much in practice!
1def relu(z):
2    return np.maximum(0, z)

## Logistic Regression

Usually used for binary classification (there are only 2 outputs). In the case of multiclass classification, we can use one vs all (couple multiple logistic regression steps).

Gradient Descent is an algorithm to minimizing the cose function . It contains 2 steps: Forward Propagation (From to compute the cost ) and Backward Propagation (compute derivaties and optimize the parameters ).
Initialize and then repeat until convergence (: number of training examples, : learning rate, : cost function, : activation function):
The dimension of variables: , , .

### Code

1def logistic_regression_model(X_train, Y_train, X_test, Y_test,
2                              num_iterations = 2000, learning_rate = 0.5):
3    m = X_train.shape # number of training examples
4
5    # INITIALIZE w, b
6    w = np.zeros((X_train.shape, 1))
7    b = 0
8
10    for i in range(num_iterations):
11        # FORWARD PROPAGATION (from x to cost)
12        A = sigmoid(np.dot(w.T, X_train) + b)
13        cost = -1/m * (np.dot(Y, np.log(A.T))
14               + p.dot((1-Y), np.log(1-A.T)))
15
16        # BACKWARD PROPAGATION (find grad)
17        dw = 1/m * np.dot(X_train, (A-Y).T)
18        db = 1/m * np.sum(A-Y)
19        cost = np.squeeze(cost)
20
21        # OPTIMIZE
22        w = w - learning_rate*dw
23        b = b - learning_rate*db
24
25    # PREDICT (with optimized w, b)
26    Y_pred = np.zeros((1,m))
27    w = w.reshape(X.shape, 1)
28
29    A = sigmoid(np.dot(w.T,X_test) + b)
30    Y_pred_test = A > 0.5

## Neural Network overview

### Notations

• : th training example.
• : number of examples.
• : number of layers.
• : number of features (# nodes in the input).
• : number of nodes in the output layer.
• : number of nodes in the hidden layers.
• : weights for .
• : activation in the input layer.
• : activation in layer 2, node .
• : activation in layer 2, example .
• .

• .
• .
• .
• .

### L-layer deep neural network

1. Initialize parameters / Define hyperparameters
1. Loop for num_iterations:
1. Forward propagation
2. Compute cost function
3. Backward propagation
4. Update parameters (using parameters, and grads from backprop)
1. Use trained parameters to predict labels.

### Initialize parameters

• In the Logistic Regression, we use for (it's OK because LogR doesn't have hidden layers) but we can't in the NN model!
• If we use 0, we'll meet the completely symmetric problem. No matter how long you train your NN, hidden units compute exactly the same function → No point to having more than 1 hidden unit!
• We add a little bit in and keep 0 in .

### Forward & Backward Propagation

Forward Propagation: Loop through number of layers:
1. (linear)
1. (for , non-linear activations)
1. (sigmoid function)
Cost function:
Backward Propagation: Loop through number of layers
1. .
1. for , non-linear activations:
1. .
2. .
3. .
4. .
Update parameters: loop through number of layers (for )
1. .
1. .

### Code

1def L_Layer_NN(X, Y, layers_dims, learning_rate=0.0075,
2               num_iterations=3000, print_cost=False):
3    costs = []
4    m = X_train.shape # number of training examples
5    L = len(layer_dims)  # number of layers
6
7    # INITIALIZE W, b
8    params = {'W':[], 'b':[]}
9    for l in range(L):
10        params['W'][l] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
11        params['b'][l] = np.zeros((layer_dims[l], 1))
12
14    for i in range(0, num_iterations):
15        # FORWARD PROPAGATION (Linear -> ReLU x (L-1) -> Linear -> Sigmoid (L))
16        A = X
17        caches = {'A':[], 'W':[], 'b':[], 'Z':[]}
18        for l in range(L):
19            caches['A_prev'].append(A)
20            # INITIALIZE W, b
21            W = params['W'][l]
22            b = params['b'][l]
23            caches['W'].append(W)
24            caches['b'].append(b)
25            # RELU X (L-1)
26            Z = np.dot(W, A) + b
27            if l != L: # hidden layers
28                A = relu(Z)
29            else: # output layer
30                A = sigmoid(Z)
31            caches['Z'].append(Z)
32
33        # COST
34        cost = -1/m * np.dot(np.log(A), Y.T) - 1/m * np.dot(np.log(1-A), 1-Y.T)
35
36        #FORWARD PROPAGATION (Linear -> ReLU x (L-1) -> Linear -> Sigmoid (L))
37        dA = - (np.divide(Y, A) - np.divide(1 - Y, 1 - A))
39        for l in reversed(range(L)):
40            cache_Z = caches['Z'][l]
41            if l != L-1: # hidden layers
42                dZ = np.array(dA, copy=True)
43                dZ[Z <= 0] = 0
44            else: # output layer
45                dZ = dA * sigmoid(cache_Z)*(1-sigmoid(cache_Z))
46            cache_A_prev = caches['A_prev'][l]
47            dW = 1/m * np.dot(dZ, cache_A_prev.T)
48            db = 1/m * np.sum(dZ, axis=1, keepdims=True)
49            dA = np.dot(W.T, dZ)
52
53        # UPDATE PARAMETERS
54        for l in range(L):
55            params['W'][l+1] = params['W'][l] - grads['dW'][l]
56            params['b'][l+1] = params['b'][l] - grads['db'][l]
57
58    if print_cost and i % 100 == 0:
59        print ("Cost after iteration %i: %f" %(i, cost))
60    if print_cost and i % 100 == 0:
61        costs.append(cost)
62
63    return parameter

## Parameters vs Hyperparameters

• Parameters: .
• Hyperparameters:
• Learning rate ().
• Number of iterations (in gradient descent algorithm) (num_iterations).
• Number of layers ().
• Number of nodes in each layer ().
• Choice of activation functions (their form, not their values).

• Always use vectorized if possible! Especially for number of examples!
• We can't use vectorized for number of layers, we need for.
• Sometimes, functions computed with Deep NN (more layers, fewer nodes in each layer) is better than Shallow (fewer layers, more nodes). E.g. function XOR.
• Deeper layer in the network, more complex features to be determined!
• Applied deep learning is a very empirical process! Best values depend much on data, algorithms, hyperparameters, CPU, GPU,...
• Learning algorithm works sometimes from data, not from your thousands line of codes (surprise!!!)

## Application: recognize a cat

This section contains an idea, not a complete task!

## Python tips in this course

✳️ Reshape quickly from (10,9,9,3) to (9*9*3,10):
1X = np.random.rand(10, 9, 9, 3)
2X = X.reshape(10,-1).T
✳️ Don't use loop, use vectorization!