If you want to break into cutting-edge AI, this course will help you do so.

👉 Check Comparison of activation functions on wikipedia.

Suppose (linear)

You might not have any hidden layer! Your model is just Logistic Regression, no hidden unit! Just use non-linear activations for hidden layers!

- Usually used in the output layer in the binary classification.

- Don't use sigmoid in the hidden layers!

```
1import numpy as np
2import numpy as np
3
4def sigmoid(z):
5 return 1 / (1+np.exp(-z))
```

```
1def sigmoid_derivative(z):
2 return sigmoid(z)*(1-sigmoid(z))
```

The output of the softmax function can be used to represent a categorical distribution – that is, a probability distribution over K different possible outcomes.

```
1def softmax(x):
2 z_exp = np.exp(z)
3 z_sum = np.sum(z_exp, axis=1, keepdims=True)
4 return z_exp / z_sum
```

- tanh is better than sigmoid because mean $\to$ 0 and it centers the data better for the next layer.

- Don't use sigmoid on hidden units except for the output layer because in the case , sigmoid is better than tanh.

```
1def tanh(z):
2 return (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))
```

- ReLU (
**R**ectified**L**inear**U**nit).

- Its derivative is much different from 0 than sigmoid/tanh $\to$ learn faster!

- If you aren't sure which one to use in the activation, use ReLU!

- Weakness: derivative ~ 0 in the negative side, we use
**Leaky ReLU**instead! However, Leaky ReLU aren't used much in practice!

```
1def relu(z):
2 return np.maximum(0, z)
```

Usually used for binary classification (there are only

**2 outputs**). In the case of multiclass classification, we can use*one vs all*(couple multiple logistic regression steps).Gradient Descent is an algorithm to minimizing the cose function . It contains 2 steps:

**Forward Propagation**(From to compute the cost ) and**Backward Propagation**(compute derivaties and optimize the parameters ).Initialize and then repeat until convergence (: number of training examples, : learning rate, : cost function, : activation function):

The dimension of variables: , , .

```
1def logistic_regression_model(X_train, Y_train, X_test, Y_test,
2 num_iterations = 2000, learning_rate = 0.5):
3 m = X_train.shape[1] # number of training examples
4
5 # INITIALIZE w, b
6 w = np.zeros((X_train.shape[0], 1))
7 b = 0
8
9 # GRADIENT DESCENT
10 for i in range(num_iterations):
11 # FORWARD PROPAGATION (from x to cost)
12 A = sigmoid(np.dot(w.T, X_train) + b)
13 cost = -1/m * (np.dot(Y, np.log(A.T))
14 + p.dot((1-Y), np.log(1-A.T)))
15
16 # BACKWARD PROPAGATION (find grad)
17 dw = 1/m * np.dot(X_train, (A-Y).T)
18 db = 1/m * np.sum(A-Y)
19 cost = np.squeeze(cost)
20
21 # OPTIMIZE
22 w = w - learning_rate*dw
23 b = b - learning_rate*db
24
25 # PREDICT (with optimized w, b)
26 Y_pred = np.zeros((1,m))
27 w = w.reshape(X.shape[0], 1)
28
29 A = sigmoid(np.dot(w.T,X_test) + b)
30 Y_pred_test = A > 0.5
```

- : th training example.

- : number of examples.

- : number of layers.

- : number of features (# nodes in the input).

- : number of nodes in the output layer.

- : number of nodes in the hidden layers.

- : weights for .

- : activation in the input layer.

- : activation in layer 2, node .

- : activation in layer 2, example .

- .

- .

- .

- .

- .

- Initialize parameters / Define hyperparameters

- Loop for num_iterations:
- Forward propagation
- Compute cost function
- Backward propagation
- Update parameters (using parameters, and grads from backprop)

- Use trained parameters to predict labels.

- In the Logistic Regression, we use for (it's OK because LogR doesn't have hidden layers) but we can't in the NN model!

- If we use 0, we'll meet the
**completely symmetric problem**. No matter how long you train your NN, hidden units compute exactly the same function → No point to having more than 1 hidden unit!

- We add a little bit in and keep 0 in .

**Forward Propagation**: Loop through number of layers:

- (linear)

- (for , non-linear activations)

- (sigmoid function)

**Cost function**:

**Backward Propagation**: Loop through number of layers

- .

- for , non-linear activations:
- .
- .
- .
- .

**Update parameters**: loop through number of layers (for )

- .

- .

```
1def L_Layer_NN(X, Y, layers_dims, learning_rate=0.0075,
2 num_iterations=3000, print_cost=False):
3 costs = []
4 m = X_train.shape[1] # number of training examples
5 L = len(layer_dims) # number of layers
6
7 # INITIALIZE W, b
8 params = {'W':[], 'b':[]}
9 for l in range(L):
10 params['W'][l] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
11 params['b'][l] = np.zeros((layer_dims[l], 1))
12
13 # GRADIENT DESCENT
14 for i in range(0, num_iterations):
15 # FORWARD PROPAGATION (Linear -> ReLU x (L-1) -> Linear -> Sigmoid (L))
16 A = X
17 caches = {'A':[], 'W':[], 'b':[], 'Z':[]}
18 for l in range(L):
19 caches['A_prev'].append(A)
20 # INITIALIZE W, b
21 W = params['W'][l]
22 b = params['b'][l]
23 caches['W'].append(W)
24 caches['b'].append(b)
25 # RELU X (L-1)
26 Z = np.dot(W, A) + b
27 if l != L: # hidden layers
28 A = relu(Z)
29 else: # output layer
30 A = sigmoid(Z)
31 caches['Z'].append(Z)
32
33 # COST
34 cost = -1/m * np.dot(np.log(A), Y.T) - 1/m * np.dot(np.log(1-A), 1-Y.T)
35
36 #FORWARD PROPAGATION (Linear -> ReLU x (L-1) -> Linear -> Sigmoid (L))
37 dA = - (np.divide(Y, A) - np.divide(1 - Y, 1 - A))
38 grads = {'dW':[], 'db':[]}
39 for l in reversed(range(L)):
40 cache_Z = caches['Z'][l]
41 if l != L-1: # hidden layers
42 dZ = np.array(dA, copy=True)
43 dZ[Z <= 0] = 0
44 else: # output layer
45 dZ = dA * sigmoid(cache_Z)*(1-sigmoid(cache_Z))
46 cache_A_prev = caches['A_prev'][l]
47 dW = 1/m * np.dot(dZ, cache_A_prev.T)
48 db = 1/m * np.sum(dZ, axis=1, keepdims=True)
49 dA = np.dot(W.T, dZ)
50 grads['dW'].append(dW)
51 grads['db'].append(db)
52
53 # UPDATE PARAMETERS
54 for l in range(L):
55 params['W'][l+1] = params['W'][l] - grads['dW'][l]
56 params['b'][l+1] = params['b'][l] - grads['db'][l]
57
58 if print_cost and i % 100 == 0:
59 print ("Cost after iteration %i: %f" %(i, cost))
60 if print_cost and i % 100 == 0:
61 costs.append(cost)
62
63 return parameter
```

**Parameters**: .

**Hyperparameters**:- Learning rate ().
- Number of iterations (in gradient descent algorithm) (
`num_iterations`

). - Number of layers ().
- Number of nodes in each layer ().
- Choice of activation functions (their form, not their values).

- Always use vectorized if possible! Especially for number of examples!

- We can't use vectorized for number of layers, we need
`for`

.

- Sometimes, functions computed with
**Deep NN**(more layers, fewer nodes in each layer) is better than**Shallow**(fewer layers, more nodes). E.g. function`XOR`

.

- Deeper layer in the network, more complex features to be determined!

- Applied deep learning is a very empirical process! Best values depend much on data, algorithms, hyperparameters, CPU, GPU,...

- Learning algorithm works sometimes from data, not from your thousands line of codes (surprise!!!)

This section contains an idea, not a complete task!

✳️ Reshape quickly from

`(10,9,9,3)`

to `(9*9*3,10)`

:```
1X = np.random.rand(10, 9, 9, 3)
2X = X.reshape(10,-1).T
```

✳️ Don't use loop, use

**vectorization**!