Thi's avatar
HomeAboutNotesBlogTopicsToolsReading
About|My sketches |Cooking |Cafe icon Support Thi
πŸ’Œ [email protected]

DL by DL.AI β€” Course 4: CNN

Anh-Thi Dinh
draft
DeepLearning.AIDeep LearningMOOCmooc-dl
Left aside
⚠️
This is a quick & dirty draft, for me only!

Week 1 - CNN

  • HxWxC β†’ too large β†’ too many params β†’ need conv
  • edge detection by a filter (3x3 for example) (or kernel) β†’ multiply with original image by a convulution operator (*)
    • Sobel filter, Scharr filter (not only 0, 1, -1)
    • Can use backprop to learn the value of filter.
    • Not only edge, we can learn by degree of images.
  • Padding
    • we don't want image to shrink everytime
      • 6x6 * 3x3 (filter) β†’ 4x4
      • (nxn)*(fxf) β†’ (n-f+1 x n-f+1)
    • we don't want pixel on the corner or edges are used much less on the outputs
    • β†’ we can pad the images (mở rα»™ng ra thΓͺm all around the images): 6x6 β†’ (pad 1) β†’ 8x8
      • 8x8 * 3x3 β†’ 6x6 (instead of 4x4)
      • β†’ n+2p-f+1 x n+2p-f+!
    • 2 common choices
      • valid conv β†’ no padding (nxn * fxf β†’ n-f+1 x n-f+1)
      • same conv β†’ out's size = in's size β†’ choose
    • f is usually odd β†’ just convention
      • 3x3, 5x5 are very common
  • Stride convulutions
    • filter moves more than 1 step
    • ex: 7x7 * 3x3 (with stride=2) β†’ 3x3
    • nxn * fxf β†’
    • if fraction is not integer β†’ round down β†’ we take the floor()
  • Conv over volumes (not just on 2d images) β†’ ex on RGB images (3 channels)
    • 6x6x3 * 3x3x3 (3 layers of filers) β†’ 4x4x1
      • We multiply each layer together and then sum up all 3 layers β†’ give only 1 number on the resulting matrix β†’ that's why we only have 4x4x1
      • if we wanna detect verticle edge only on the red channel β†’ 1st layer (in 3x3x3) can detect it, the other 2 is 0s.
      • multiple filters at the same time? β†’ 1st filter (verticle), 2nd (horizontal) β†’ 4x4x2 (2 here is 2 filters)
      • we can use 100+ fileters β†’
  • 1 layer of ConvNet
    • if 10 filters (3x3x3) β†’ how many parameters?
      • each filter = 3x3x3=27 + 1 bias β‡’ 28 params
      • 10 filters β†’ 280 params
      • β†’ no matter what size of image, we only have 280 params with 10 filters (3x3x3)
    • Notations:
    • The number of filters used will be the number of channels in the output
  • SImple example of ConvNet
  • Type of layer in ConvNet
    • Convolution (conv)
    • Pooling (pool)
    • Fully connected (FC)
  • Pooling layer
    • Purpose?
      • to reduce the size of representation β†’ speed up
      • to make some of the features that detects a bit more robust
    • Max pooling β†’ take the max of each region
      • Idea? if these features detected anywhere in this filter β†’ keep the high number (upper left), if this feature not deteced (doesn't exist) (upper right), max is still quite small.
      • Người ta dΓΉng nhiều vΓ¬ nΓ³ works well in convnet nhΖ°ng nhiều lΓΊc ngta cΕ©ng ko biαΊΏt meaning behind.
      • Max pooling has no parameter to learn β†’ grad desc doesn't change any thing.
      • usually we don't need any padding! (p=0)
    • Average pooling β†’ like max pool, we take average
    • β†’ max is used much more than avg
  • LeNet-5 (NN example) ← inspire by it, introduced by Yan LeCun
    • Convention: 1 layer contains (conv + pool) β†’ when people talk abt layer, they mean the layer with params to learn β†’ pool doesn't have any param to learn.
    • decrease, increase
    • A typical NN looks something like that: conv β†’ pool β†’ conv β†’ pool β†’ ... β†’ FC β†’ FC β†’ FC β†’ softmax
    • activation size go down
  • Why convolution?
    • 2 main advantages of conv are: (if we use fully connected β†’ too much parameters!!)
      • parameter sharing: some feature is useful in one part of the image, it may be useful in another part.β†’ no need to learn different feat detected.
      • Sparsity of connections: each layer, output value depends only on a small number of inputs.
Β 

Week 2 - Case studies & practical advices for using ConvNets

  • Case studies of effective ConvNets
    • ConvNet works well on 1 computer vision task β†’ (usually) works well on other tasks.
    • Classic networks: LeNet-5, AlexNet, VGG
    • (depper) ResNet (152 NN)
    • Inception NN (GoogLeNet)
  • Classic networks
    • LeNet-5 β†’ focus on section II, III in the article, they're interesting! β†’ use sigmoid/tanh
  • 1x1 convolution (network in network)
    • just multiply by some number (1 filter)
    • If more filters β†’ more sense β†’ apply weights to nodes of channels
    • It's basically having a fully connected neuron network that applies to each of 62 different positions β†’ it puts 32 numbers + outputs #filers
    • Useful for:
      • Shrink #channel: Eg. 28x28x192 β†’ (32 filters of 1x1x192) β†’ 28x28x32
      • it allows to learn more complex func of the network by adding non-linearity. (in case 28x28x192 β†’ (1x1x192) β†’ 28x28x192.
  • Shrink? (eg. 28x28x192)
    • 28x28 β†’ lower? β‡’ use pooling layer
    • 192 channels β†’ lower? β‡’ use 1x1 conv
    • β‡’ very useful in building Inception Network!
  • Inception Network (GoogLeNet)
    • πŸ‘‰Β Δα»c thΓͺm: https://dlapplications.github.io/2018-07-06-CNN/#5-googlenet2014 (cΓ³ lΖ°u backup trong raindrop)
    • Basic building block (motivation)
    • When design a ConvNet, we have to choose to use 1x1, 3x3, 5x5, or use/not pooling layers? β†’ inception net says that "use them all". β†’ more complicated but better!
    • Problem: computation cost? β†’ consider 5x5 layer? β†’ ~120M (28x28x32 * 5x5x192) β†’ need to use 1x1 conv (remain 12.4M params)
      • Without 1x1 conv β†’ 12M params
        Using 1x1 conv β†’ only 12.4M params
    • 1x1 conv like a "bottleneck" layer
    • When shrink β†’ it usually doesn't hurt your performance! (within reason to use)
    • Inception module
    • Inception β†’ combine inception modules together
      • There are some "max pooling" inside to change the H and W. There are also som additional branch using softmax to make the prediction (green)
    • Where is inception come from? β†’ from the movie "Inception" β†’ "We need to go deeper" (Leo said in the movie) β†’ we need to go deeper in the inception network!
  • Practical advices for using ConvNets
    • Using open source implementaion! -> eg. Github
    • Transfer learning β†’ using already-trained model β†’ download weights
      • Image Net, MS COCO, Pascal,...
      • Download not only the codes but also the weights
      • Forget the last (softmax) layer β†’ create your own
      • If you have not many label dataset (training) β†’ freeze all layers exept the last one(s)
      • If you have "larger" label dataset (training) β†’ freeze fewer layers and train later layers
      • If you have a lot of data β†’ use the whole trained weights as initializations + re train again the whole network!
    • Data augmentation
      • In practice, we run in parallel:
        • 1 thread β†’ load + augmentation task
        • other threads β†’ training
    • State of Computer Vision
    • Ebsembling:
      • Train several networks independently and average their outputs. (3-15 networks)
    • Multi-crop at test time
      • Run classifier on multiple versions of test images and average results.

Week 3 - Detection algorithms

  • Object detection: Object Localization:
    • draw bounding box around object β†’ classification with localization.
    • bounding box β†’ based on coordinate of images β†’ export coor of bounding box.
    • How to define label y? β†’ object or not? if yes β†’ we need to clarify other, if not, we don't care others
  • Landmak detection
    • Just want NN export x,y coord of important points on images (called "landmark") β†’ eg. corner of someone's eyes, mouth's shape,...
    • Eg: snapchat apps, computer graphic effects, AR,...
    • Person's pose?
    • The labels must be consistent cross images,...
  • Object detection's algo
    • Use ConvNets to build object detection using sliding window detection algorithm.
    • Start with images croped contains closely the cars.
    • Sliding windows detection: start with small window and go through (using ConvNet) the entire image. β†’ increase the window β†’ hopefully there is sizing of windows that contain the car. β†’ high computational cost! β†’ there is a solution!
    • Convolutional Implementation of Sliding Windows
      • Turning FC layer into convolutional layer. ← use this idea to implement sliding window less computationally!
      • What we want: example using a stride of 2 (yellow in the figure) β†’ we move the window around (4 times) β†’ 4 different time of using ConvNets β†’ costly!
        β†’ Idea of conv imple of sliding windows: share computation of above 4 computations!
      • How? β†’ keep the region of padding β†’ run the ConvNets β†’ the final give 4 different cases (share the common region computations)
        • Instead of compute separately sliding windows, we compute at once and obtain the result at the final step for each region!
      • Weakness? β†’ position of bounding box is not accurate β†’ read next to fix!
    • Bounding box predictions β†’ Using YOLO (You Only Look Once) β†’ sau cΓ‘i nΓ y mαΊ₯y Γ½ mα»›i lΓ  note chΓ­nh cα»§a YOLO.
      • Divide image into grid cells, let say 3x3 grid β†’ apply NN on each grid cell to output a label y β†’ there are totally 3x3x8 target output (3x3=grid, 8=dim of y).
        In each cell, we detect the
        midpoint of the object and check which cell it belongs to.
        • Even object rely on multiple grid cells, the midpoint is located at 1.
      • Instead of 3x3, we could use 19x19 to prevent multiple objects in the same grid cell.
      • CNN run fast in this case β†’ can be used in real time object detection!
      • How to encode coordinates of boxes?
        • bx,by β†’ coor based on coor of grid cell
        • bh, bw β†’ fraction of length w.r.t the side of cell β†’ could be > 1 (out of the cell)
      • YOLO paper is very hard to understand (even for Andrew)
  • Intersection over union (IoU) ← Evaluating method!!!
    • Evaluating object localization: if your algo output vilolet bound β†’ good or not? β†’ find the size of the intersections / size of union β‡’ "correct" if IoU β‰₯ 0.5 (convention)
  • Non-max suppression ← clean up multple detections and keep only 1!
    • Problem: 1 object detected multiple times (many midpoints accepted)
    • Make sure your algo detect one object only once!!!
    • Non-max β†’ take the max probability (light blue) and then supress the remaining rectangle who have the most IoU with the light blue one.
      • Algo:
        • If there are multiple objects β†’ apply non-max multiple times.
  • Anchor boxes β†’ what if algo detects mutiple objects? β†’ allow your algo have specialize better
    • Without anchor: each object assigned to grid cell containing the midpoint.
    • With anchor: each object assigned to grid cell containing the midpoint + anchor box for the grid cell with highest IoU.
    • How to choose anchor box? β†’ by hand / using K-Means algo (more advanced)...
  • YOLO angorithm
    • How construct training set?
      • We go from image of 100x100x3 β†’ ConvNet β†’ obtain 3x3x16 (3x3=grid cells, 16=2x8 ← 2=#anchor, 8 is dim of y)
    • Making predictions?
    • Output the non-max supressed outputs:
      - Each cell has 2 anchors β†’ bỏ hαΊΏt mαΊ₯y cΓ‘i cell with low prob
      - TrΓͺn mαΊ₯y cΓ‘i cΓ²n lαΊ‘i (nhΖ° hΓ¬nh dΖ°α»›i) β†’ dΓΉng non-max để generate final prediction.
  • Region proposals β†’ very influential in computer vision but less often
    • Check & try YOLO / DarkNet project!

    Week 4 - Neural Style Transfer

    • What's Neural Style Transfer?
      • 1 of the most fun and exiting application of ConvNets
      • Re-create some bigture in a style of other image.
      • In order to do β†’ look at features extracted by ConvNets at various layers
    • What are deep ConvNets Learning? (at each layer)
    • Cost Function β†’ J(G) contains 2 different cost functions: J_contents (giα»‘ng hΓ¬nh) and J_style (giα»‘ng style)
    • Content Cost Function
    • Style Cost Function
      • Meaning of "style" of an image? = correlation between activations across channels
      • Example of "correlated":
        • red channel is corresponding to red neurons (vertical)
        • yellow channel is corresponding to yellow neurons (orange color batches)
        • "highly correlated" β†’ whenever there is vertical texture, there is orange-ish tint!
        • "uncorrelated" β†’ whenever there is vertical texture, there is NO orange-ish tint!
        • β†’ high level texture components tend to occur of not occur together in part the image.
        • We calculate the same "degree" of these channels in the generated image.
      • The detailed formulas can be found in this ppt file (slide 37, 38)
    • 1D and 3D Generalizations
      • example in 1D: EKG signal (Electrocardiogram)
        • 16 β†’ 16 filters
        • ConvNets can be used on 1D data despite that with 1D data, people usually uses RNN
      • Example of 3D: city scan
        • Β 
    β—†Week 1 - CNNβ—†Week 2 - Case studies & practical advices for using ConvNetsβ—†Week 3 - Detection algorithmsβ—†Week 4 - Neural Style Transfer
    About|My sketches |Cooking |Cafe icon Support Thi
    πŸ’Œ [email protected]