# Note for couse DL 4: CNN

Anh-Thi Dinh
draft
⚠️
This is a quick & dirty draft, for me only!

## Week 1 - CNN

• HxWxC → too large → too many params → need conv
• edge detection by a filter (3x3 for example) (or kernel) → multiply with original image by a convulution operator (*)
• Sobel filter, Scharr filter (not only 0, 1, -1)
• Can use backprop to learn the value of filter.
• Not only edge, we can learn by degree of images.
• we don't want image to shrink everytime
• 6x6 * 3x3 (filter) → 4x4
• (nxn)*(fxf) → (n-f+1 x n-f+1)
• we don't want pixel on the corner or edges are used much less on the outputs
• → we can pad the images (mở rộng ra thêm all around the images): 6x6 → (pad 1) → 8x8
• 8x8 * 3x3 → 6x6 (instead of 4x4)
• → n+2p-f+1 x n+2p-f+!
• 2 common choices
• valid conv → no padding (nxn * fxf → n-f+1 x n-f+1)
• same conv → out's size = in's size → choose
• f is usually odd → just convention
• 3x3, 5x5 are very common
• Stride convulutions
• filter moves more than 1 step
• ex: 7x7 * 3x3 (with stride=2) → 3x3
• nxn * fxf →
• if fraction is not integer → round down → we take the floor()
• Conv over volumes (not just on 2d images) → ex on RGB images (3 channels)
• 6x6x3 * 3x3x3 (3 layers of filers) → 4x4x1
• We multiply each layer together and then sum up all 3 layers → give only 1 number on the resulting matrix → that's why we only have 4x4x1
• if we wanna detect verticle edge only on the red channel → 1st layer (in 3x3x3) can detect it, the other 2 is 0s.
• multiple filters at the same time? → 1st filter (verticle), 2nd (horizontal) → 4x4x2 (2 here is 2 filters)
• we can use 100+ fileters →
• 1 layer of ConvNet
• if 10 filters (3x3x3) → how many parameters?
• each filter = 3x3x3=27 + 1 bias ⇒ 28 params
• 10 filters → 280 params
• → no matter what size of image, we only have 280 params with 10 filters (3x3x3)
• Notations:
• The number of filters used will be the number of channels in the output
• SImple example of ConvNet
• Type of layer in ConvNet
• Convolution (conv)
• Pooling (pool)
• Fully connected (FC)
• Pooling layer
• Purpose?
• to reduce the size of representation → speed up
• to make some of the features that detects a bit more robust
• Max pooling → take the max of each region
• Idea? if these features detected anywhere in this filter → keep the high number (upper left), if this feature not deteced (doesn't exist) (upper right), max is still quite small.
• Người ta dùng nhiều vì nó works well in convnet nhưng nhiều lúc ngta cũng ko biết meaning behind.
• Max pooling has no parameter to learn → grad desc doesn't change any thing.
• usually we don't need any padding! (p=0)
• Average pooling → like max pool, we take average
• → max is used much more than avg
• LeNet-5 (NN example) ← inspire by it, introduced by Yan LeCun
• Convention: 1 layer contains (conv + pool) → when people talk abt layer, they mean the layer with params to learn → pool doesn't have any param to learn.
• decrease, increase
• A typical NN looks something like that: conv → pool → conv → pool → ... → FC → FC → FC → softmax
• activation size go down
• Why convolution?
• 2 main advantages of conv are: (if we use fully connected → too much parameters!!)
• parameter sharing: some feature is useful in one part of the image, it may be useful in another part.→ no need to learn different feat detected.
• Sparsity of connections: each layer, output value depends only on a small number of inputs.

## Week 2 - Case studies & practical advices for using ConvNets

• Case studies of effective ConvNets
• ConvNet works well on 1 computer vision task → (usually) works well on other tasks.
• Classic networks: LeNet-5, AlexNet, VGG
• (depper) ResNet (152 NN)
• Classic networks
• LeNet-5 → focus on section II, III in the article, they're interesting! → use sigmoid/tanh
• 1x1 convolution (network in network)
• just multiply by some number (1 filter)
• If more filters → more sense → apply weights to nodes of channels
• It's basically having a fully connected neuron network that applies to each of 62 different positions → it puts 32 numbers + outputs #filers
• Useful for:
• Shrink #channel: Eg. 28x28x192 → (32 filters of 1x1x192) → 28x28x32
• it allows to learn more complex func of the network by adding non-linearity. (in case 28x28x192 → (1x1x192) → 28x28x192.
• Shrink? (eg. 28x28x192)
• 28x28 → lower? ⇒ use pooling layer
• 192 channels → lower? ⇒ use 1x1 conv
• ⇒ very useful in building Inception Network!
• 👉 Đọc thêm: https://dlapplications.github.io/2018-07-06-CNN/#5-googlenet2014 (có lưu backup trong raindrop)
• Basic building block (motivation)
• When design a ConvNet, we have to choose to use 1x1, 3x3, 5x5, or use/not pooling layers? → inception net says that "use them all". → more complicated but better!
• Problem: computation cost? → consider 5x5 layer? → ~120M (28x28x32 * 5x5x192) → need to use 1x1 conv (remain 12.4M params)
• 1x1 conv like a "bottleneck" layer
• When shrink → it usually doesn't hurt your performance! (within reason to use)
• Inception module
• Inception → combine inception modules together
• Where is inception come from? → from the movie "Inception" → "We need to go deeper" (Leo said in the movie) → we need to go deeper in the inception network!
• Practical advices for using ConvNets
• Using open source implementaion! -> eg. Github
• Image Net, MS COCO, Pascal,...
• Forget the last (softmax) layer → create your own
• If you have not many label dataset (training) → freeze all layers exept the last one(s)
• If you have "larger" label dataset (training) → freeze fewer layers and train later layers
• If you have a lot of data → use the whole trained weights as initializations + re train again the whole network!
• Data augmentation
• In practice, we run in parallel:
• State of Computer Vision
• Ebsembling:
• Train several networks independently and average their outputs. (3-15 networks)
• Multi-crop at test time
• Run classifier on multiple versions of test images and average results.

## Week 3 - Detection algorithms

• Object detection: Object Localization:
• draw bounding box around object → classification with localization.
• bounding box → based on coordinate of images → export coor of bounding box.
• How to define label y? → object or not? if yes → we need to clarify other, if not, we don't care others
• Landmak detection
• Just want NN export x,y coord of important points on images (called "landmark") → eg. corner of someone's eyes, mouth's shape,...
• Eg: snapchat apps, computer graphic effects, AR,...
• Person's pose?
• The labels must be consistent cross images,...
• Object detection's algo
• Use ConvNets to build object detection using sliding window detection algorithm.
• Sliding windows detection: start with small window and go through (using ConvNet) the entire image. → increase the window → hopefully there is sizing of windows that contain the car. → high computational cost! → there is a solution!
• Convolutional Implementation of Sliding Windows
• Turning FC layer into convolutional layer. ← use this idea to implement sliding window less computationally!
• What we want: example using a stride of 2 (yellow in the figure) → we move the window around (4 times) → 4 different time of using ConvNets → costly!
→ Idea of conv imple of sliding windows: share computation of above 4 computations!
• How? → keep the region of padding → run the ConvNets → the final give 4 different cases (share the common region computations)
• Weakness? → position of bounding box is not accurate → read next to fix!
• Bounding box predictions → Using YOLO (You Only Look Once) → sau cái này mấy ý mới là note chính của YOLO.
• Divide image into grid cells, let say 3x3 grid → apply NN on each grid cell to output a label y → there are totally 3x3x8 target output (3x3=grid, 8=dim of y).
In each cell, we detect the
midpoint of the object and check which cell it belongs to.
• Even object rely on multiple grid cells, the midpoint is located at 1.
• Instead of 3x3, we could use 19x19 to prevent multiple objects in the same grid cell.
• CNN run fast in this case → can be used in real time object detection!
• How to encode coordinates of boxes?
• bx,by → coor based on coor of grid cell
• bh, bw → fraction of length w.r.t the side of cell → could be > 1 (out of the cell)
• YOLO paper is very hard to understand (even for Andrew)
• Intersection over union (IoU) ← Evaluating method!!!
• Evaluating object localization: if your algo output vilolet bound → good or not? → find the size of the intersections / size of union ⇒ "correct" if IoU ≥ 0.5 (convention)
• Non-max suppression ← clean up multple detections and keep only 1!
• Problem: 1 object detected multiple times (many midpoints accepted)
• Make sure your algo detect one object only once!!!
• Non-max → take the max probability (light blue) and then supress the remaining rectangle who have the most IoU with the light blue one.
• Algo:
• If there are multiple objects → apply non-max multiple times.
• Anchor boxes → what if algo detects mutiple objects? → allow your algo have specialize better
• Without anchor: each object assigned to grid cell containing the midpoint.
• With anchor: each object assigned to grid cell containing the midpoint + anchor box for the grid cell with highest IoU.
• How to choose anchor box? → by hand / using K-Means algo (more advanced)...
• YOLO angorithm
• How construct training set?
• Making predictions?
• Output the non-max supressed outputs:
- Each cell has 2 anchors → bỏ hết mấy cái cell with low prob
- Trên mấy cái còn lại (như hình dưới) → dùng non-max để generate final prediction.
• Region proposals → very influential in computer vision but less often

## Week 4 - Neural Style Transfer

• What's Neural Style Transfer?
• 1 of the most fun and exiting application of ConvNets
• Re-create some bigture in a style of other image.
• In order to do → look at features extracted by ConvNets at various layers
• What are deep ConvNets Learning? (at each layer)
• Cost Function → J(G) contains 2 different cost functions: J_contents (giống hình) and J_style (giống style)
• Content Cost Function
• Style Cost Function
• Meaning of "style" of an image? = correlation between activations across channels
• Example of "correlated":
• red channel is corresponding to red neurons (vertical)
• yellow channel is corresponding to yellow neurons (orange color batches)
• "highly correlated" → whenever there is vertical texture, there is orange-ish tint!
• "uncorrelated" → whenever there is vertical texture, there is NO orange-ish tint!
• → high level texture components tend to occur of not occur together in part the image.
• We calculate the same "degree" of these channels in the generated image.
• The detailed formulas can be found in this ppt file (slide 37, 38)
• 1D and 3D Generalizations
• example in 1D: EKG signal (Electrocardiogram)
• ConvNets can be used on 1D data despite that with 1D data, people usually uses RNN
• Example of 3D: city scan
•