CNN | Site of Thi

Week 1 - CNN

HxWxC → too large → too many params → need conv

edge detection by a filter (3x3 for example) (or kernel) → multiply with original image by a convulution operator (*)

Sobel filter, Scharr filter (not only 0, 1, -1)

Can use backprop to learn the value of filter.

Not only edge, we can learn by degree of images.

Padding

we don't want image to shrink everytime

6x6 * 3x3 (filter) → 4x4
(nxn)*(fxf) → (n-f+1 x n-f+1)

we don't want pixel on the corner or edges are used much less on the outputs

→ we can pad the images (mở rộng ra thêm all around the images): 6x6 → (pad 1) → 8x8

8x8 * 3x3 → 6x6 (instead of 4x4)

→ n+2p-f+1 x n+2p-f+!

2 common choices

valid conv → no padding (nxn * fxf → n-f+1 x n-f+1)
same conv → out's size = in's size → choose

f is usually odd → just convention

3x3, 5x5 are very common

Stride convulutions

filter moves more than 1 step
ex: 7x7 * 3x3 (with stride=2) → 3x3
nxn * fxf →
if fraction is not integer → round down → we take the floor()

Conv over volumes (not just on 2d images) → ex on RGB images (3 channels)

6x6x3 * 3x3x3 (3 layers of filers) → 4x4x1

We multiply each layer together and then sum up all 3 layers → give only 1 number on the resulting matrix → that's why we only have 4x4x1
if we wanna detect verticle edge only on the red channel → 1st layer (in 3x3x3) can detect it, the other 2 is 0s.
multiple filters at the same time? → 1st filter (verticle), 2nd (horizontal) → 4x4x2 (2 here is 2 filters)

we can use 100+ fileters →

1 layer of ConvNet

if 10 filters (3x3x3) → how many parameters?

each filter = 3x3x3=27 + 1 bias ⇒ 28 params
10 filters → 280 params

→ no matter what size of image, we only have 280 params with 10 filters (3x3x3)

Notations:

The number of filters used will be the number of channels in the output

SImple example of ConvNet

Type of layer in ConvNet

Convolution (conv)
Pooling (pool)
Fully connected (FC)

Pooling layer

Purpose?

to reduce the size of representation → speed up
to make some of the features that detects a bit more robust

Max pooling → take the max of each region

Idea? if these features detected anywhere in this filter → keep the high number (upper left), if this feature not deteced (doesn't exist) (upper right), max is still quite small.
Người ta dùng nhiều vì nó works well in convnet nhưng nhiều lúc ngta cũng ko biết meaning behind.
Max pooling has no parameter to learn → grad desc doesn't change any thing.
usually we don't need any padding! (p=0)

Average pooling → like max pool, we take average

→ max is used much more than avg

LeNet-5 (NN example) ← inspire by it, introduced by Yan LeCun

Convention: 1 layer contains (conv + pool) → when people talk abt layer, they mean the layer with params to learn → pool doesn't have any param to learn.
decrease, increase
A typical NN looks something like that: conv → pool → conv → pool → ... → FC → FC → FC → softmax
activation size go down

Why convolution?

2 main advantages of conv are: (if we use fully connected → too much parameters!!)

parameter sharing: some feature is useful in one part of the image, it may be useful in another part.→ no need to learn different feat detected.
Sparsity of connections: each layer, output value depends only on a small number of inputs.

Week 2 - Case studies & practical advices for using ConvNets

Case studies of effective ConvNets

ConvNet works well on 1 computer vision task → (usually) works well on other tasks.
Classic networks: LeNet-5, AlexNet, VGG
(depper) ResNet (152 NN)
Inception NN (GoogLeNet)

Classic networks

LeNet-5 → focus on section II, III in the article, they're interesting! → use sigmoid/tanh

1x1 convolution (network in network)

just multiply by some number (1 filter)
If more filters → more sense → apply weights to nodes of channels

It's basically having a fully connected neuron network that applies to each of 62 different positions → it puts 32 numbers + outputs #filers
Useful for:

Shrink #channel: Eg. 28x28x192 → (32 filters of 1x1x192) → 28x28x32
it allows to learn more complex func of the network by adding non-linearity. (in case 28x28x192 → (1x1x192) → 28x28x192.

Shrink? (eg. 28x28x192)

28x28 → lower? ⇒ use pooling layer
192 channels → lower? ⇒ use 1x1 conv

⇒ very useful in building Inception Network!

Inception Network (GoogLeNet)

👉 Đọc thêm: https://dlapplications.github.io/2018-07-06-CNN/#5-googlenet2014 (có lưu backup trong raindrop)

Basic building block (motivation)

When design a ConvNet, we have to choose to use 1x1, 3x3, 5x5, or use/not pooling layers? → inception net says that "use them all". → more complicated but better!
Problem: computation cost? → consider 5x5 layer? → ~120M (28x28x32 * 5x5x192) → need to use 1x1 conv (remain 12.4M params)

Without 1x1 conv → 12M params

Using 1x1 conv → only 12.4M params

1x1 conv like a "bottleneck" layer
When shrink → it usually doesn't hurt your performance! (within reason to use)
Inception module

Inception → combine inception modules together

There are some "max pooling" inside to change the H and W. There are also som additional branch using softmax to make the prediction (green)

Where is inception come from? → from the movie "Inception" → "We need to go deeper" (Leo said in the movie) → we need to go deeper in the inception network!

Practical advices for using ConvNets

Using open source implementaion! -> eg. Github
Transfer learning → using already-trained model → download weights

Image Net, MS COCO, Pascal,...
Download not only the codes but also the weights
Forget the last (softmax) layer → create your own
If you have not many label dataset (training) → freeze all layers exept the last one(s)

If you have "larger" label dataset (training) → freeze fewer layers and train later layers

If you have a lot of data → use the whole trained weights as initializations + re train again the whole network!

Data augmentation

In practice, we run in parallel:

1 thread → load + augmentation task
other threads → training

State of Computer Vision

Ebsembling:

Train several networks independently and average their outputs. (3-15 networks)

Multi-crop at test time

Run classifier on multiple versions of test images and average results.

Week 3 - Detection algorithms

Object detection: Object Localization:

draw bounding box around object → classification with localization.

bounding box → based on coordinate of images → export coor of bounding box.

How to define label y? → object or not? if yes → we need to clarify other, if not, we don't care others

Landmak detection

Just want NN export x,y coord of important points on images (called "landmark") → eg. corner of someone's eyes, mouth's shape,...
Eg: snapchat apps, computer graphic effects, AR,...
Person's pose?
The labels must be consistent cross images,...

Object detection's algo

Use ConvNets to build object detection using sliding window detection algorithm.
Start with images croped contains closely the cars.
Sliding windows detection: start with small window and go through (using ConvNet) the entire image. → increase the window → hopefully there is sizing of windows that contain the car. → high computational cost! → there is a solution!

Convolutional Implementation of Sliding Windows

Turning FC layer into convolutional layer. ← use this idea to implement sliding window less computationally!

What we want: example using a stride of 2 (yellow in the figure) → we move the window around (4 times) → 4 different time of using ConvNets → costly!
→ Idea of conv imple of sliding windows: share computation of above 4 computations!
How? → keep the region of padding → run the ConvNets → the final give 4 different cases (share the common region computations)

Instead of compute separately sliding windows, we compute at once and obtain the result at the final step for each region!

Weakness? → position of bounding box is not accurate → read next to fix!

Bounding box predictions → Using YOLO (You Only Look Once) → sau cái này mấy ý mới là note chính của YOLO.

Divide image into grid cells, let say 3x3 grid → apply NN on each grid cell to output a label y → there are totally 3x3x8 target output (3x3=grid, 8=dim of y).
In each cell, we detect the midpoint of the object and check which cell it belongs to.

Even object rely on multiple grid cells, the midpoint is located at 1.

Instead of 3x3, we could use 19x19 to prevent multiple objects in the same grid cell.

CNN run fast in this case → can be used in real time object detection!
How to encode coordinates of boxes?

bx,by → coor based on coor of grid cell
bh, bw → fraction of length w.r.t the side of cell → could be > 1 (out of the cell)

YOLO paper is very hard to understand (even for Andrew)

Intersection over union (IoU) ← Evaluating method!!!

Evaluating object localization: if your algo output vilolet bound → good or not? → find the size of the intersections / size of union ⇒ "correct" if IoU ≥ 0.5 (convention)

Non-max suppression ← clean up multple detections and keep only 1!

Problem: 1 object detected multiple times (many midpoints accepted)

Make sure your algo detect one object only once!!!
Non-max → take the max probability (light blue) and then supress the remaining rectangle who have the most IoU with the light blue one.

Algo:

If there are multiple objects → apply non-max multiple times.

Anchor boxes → what if algo detects mutiple objects? → allow your algo have specialize better

Without anchor: each object assigned to grid cell containing the midpoint.
With anchor: each object assigned to grid cell containing the midpoint + anchor box for the grid cell with highest IoU.
How to choose anchor box? → by hand / using K-Means algo (more advanced)...

YOLO angorithm

How construct training set?

We go from image of 100x100x3 → ConvNet → obtain 3x3x16 (3x3=grid cells, 16=2x8 ← 2=#anchor, 8 is dim of y)

Making predictions?

Output the non-max supressed outputs:
- Each cell has 2 anchors → bỏ hết mấy cái cell with low prob
- Trên mấy cái còn lại (như hình dưới) → dùng non-max để generate final prediction.

Region proposals → very influential in computer vision but less often

Check & try YOLO / DarkNet project!

Week 4 - Neural Style Transfer

What's Neural Style Transfer?

1 of the most fun and exiting application of ConvNets
Re-create some bigture in a style of other image.
In order to do → look at features extracted by ConvNets at various layers

What are deep ConvNets Learning? (at each layer)

Cost Function → J(G) contains 2 different cost functions: J_contents (giống hình) and J_style (giống style)

Content Cost Function

Style Cost Function

Meaning of "style" of an image? = correlation between activations across channels

Example of "correlated":

red channel is corresponding to red neurons (vertical)
yellow channel is corresponding to yellow neurons (orange color batches)
"highly correlated" → whenever there is vertical texture, there is orange-ish tint!
"uncorrelated" → whenever there is vertical texture, there is NO orange-ish tint!

→ high level texture components tend to occur of not occur together in part the image.

We calculate the same "degree" of these channels in the generated image.

The detailed formulas can be found in this ppt file (slide 37, 38)

1D and 3D Generalizations

example in 1D: EKG signal (Electrocardiogram)

16 → 16 filters

ConvNets can be used on 1D data despite that with 1D data, people usually uses RNN

Example of 3D: city scan

DL by DL.AI — Course 4: CNN

Week 1 - CNN

Week 2 - Case studies & practical advices for using ConvNets

Week 3 - Detection algorithms

Week 4 - Neural Style Transfer