If you want to break into cutting-edge AI, this course will help you do so.
👉 Check Comparison of activation functions on wikipedia.
Suppose (linear)
You might not have any hidden layer! Your model is just Logistic Regression, no hidden unit! Just use non-linear activations for hidden layers!
- Usually used in the output layer in the binary classification.
- Don't use sigmoid in the hidden layers!
The output of the softmax function can be used to represent a categorical distribution – that is, a probability distribution over K different possible outcomes.
- tanh is better than sigmoid because mean $\to$ 0 and it centers the data better for the next layer.
- Don't use sigmoid on hidden units except for the output layer because in the case , sigmoid is better than tanh.
- ReLU (Rectified Linear Unit).
- Its derivative is much different from 0 than sigmoid/tanh $\to$ learn faster!
- If you aren't sure which one to use in the activation, use ReLU!
- Weakness: derivative ~ 0 in the negative side, we use Leaky ReLU instead! However, Leaky ReLU aren't used much in practice!
Usually used for binary classification (there are only 2 outputs). In the case of multiclass classification, we can use one vs all (couple multiple logistic regression steps).
Gradient Descent is an algorithm to minimizing the cose function . It contains 2 steps: Forward Propagation (From to compute the cost ) and Backward Propagation (compute derivaties and optimize the parameters ).
Initialize and then repeat until convergence (: number of training examples, : learning rate, : cost function, : activation function):
The dimension of variables: , , .
- : th training example.
- : number of examples.
- : number of layers.
- : number of features (# nodes in the input).
- : number of nodes in the output layer.
- : number of nodes in the hidden layers.
- : weights for .
- : activation in the input layer.
- : activation in layer 2, node .
- : activation in layer 2, example .
- .
- .
- .
- .
- .
- Initialize parameters / Define hyperparameters
- Loop for num_iterations:
- Forward propagation
- Compute cost function
- Backward propagation
- Update parameters (using parameters, and grads from backprop)
- Use trained parameters to predict labels.
- In the Logistic Regression, we use for (it's OK because LogR doesn't have hidden layers) but we can't in the NN model!
- If we use 0, we'll meet the completely symmetric problem. No matter how long you train your NN, hidden units compute exactly the same function → No point to having more than 1 hidden unit!
- We add a little bit in and keep 0 in .
Forward Propagation: Loop through number of layers:
- (linear)
- (for , non-linear activations)
- (sigmoid function)
Cost function:
Backward Propagation: Loop through number of layers
- .
- for , non-linear activations:
- .
- .
- .
- .
Update parameters: loop through number of layers (for )
- .
- .
- Parameters: .
- Hyperparameters:
- Learning rate ().
- Number of iterations (in gradient descent algorithm) (
num_iterations
). - Number of layers ().
- Number of nodes in each layer ().
- Choice of activation functions (their form, not their values).
- Always use vectorized if possible! Especially for number of examples!
- We can't use vectorized for number of layers, we need
for
.
- Sometimes, functions computed with Deep NN (more layers, fewer nodes in each layer) is better than Shallow (fewer layers, more nodes). E.g. function
XOR
.
- Deeper layer in the network, more complex features to be determined!
- Applied deep learning is a very empirical process! Best values depend much on data, algorithms, hyperparameters, CPU, GPU,...
- Learning algorithm works sometimes from data, not from your thousands line of codes (surprise!!!)
This section contains an idea, not a complete task!
✳️ Reshape quickly from
(10,9,9,3)
to (9*9*3,10)
:✳️ Don't use loop, use vectorization!