Note for couse DL 4: CNN

Anh-Thi Dinh
draft
⚠️
This is a quick & dirty draft, for me only!

Week 1 - CNN

  • HxWxC → too large → too many params → need conv
  • edge detection by a filter (3x3 for example) (or kernel) → multiply with original image by a convulution operator (*)
    • Sobel filter, Scharr filter (not only 0, 1, -1)
    • Can use backprop to learn the value of filter.
    • Not only edge, we can learn by degree of images.
  • Padding
    • we don't want image to shrink everytime
      • 6x6 * 3x3 (filter) → 4x4
      • (nxn)*(fxf) → (n-f+1 x n-f+1)
    • we don't want pixel on the corner or edges are used much less on the outputs
    • → we can pad the images (mở rộng ra thêm all around the images): 6x6 → (pad 1) → 8x8
      • 8x8 * 3x3 → 6x6 (instead of 4x4)
      • → n+2p-f+1 x n+2p-f+!
    • 2 common choices
      • valid conv → no padding (nxn * fxf → n-f+1 x n-f+1)
      • same conv → out's size = in's size → choose
    • f is usually odd → just convention
      • 3x3, 5x5 are very common
  • Stride convulutions
    • filter moves more than 1 step
    • ex: 7x7 * 3x3 (with stride=2) → 3x3
    • nxn * fxf →
    • if fraction is not integer → round down → we take the floor()
  • Conv over volumes (not just on 2d images) → ex on RGB images (3 channels)
    • 6x6x3 * 3x3x3 (3 layers of filers) → 4x4x1
      • We multiply each layer together and then sum up all 3 layers → give only 1 number on the resulting matrix → that's why we only have 4x4x1
      • if we wanna detect verticle edge only on the red channel → 1st layer (in 3x3x3) can detect it, the other 2 is 0s.
      • multiple filters at the same time? → 1st filter (verticle), 2nd (horizontal) → 4x4x2 (2 here is 2 filters)
      • we can use 100+ fileters →
  • 1 layer of ConvNet
    • if 10 filters (3x3x3) → how many parameters?
      • each filter = 3x3x3=27 + 1 bias ⇒ 28 params
      • 10 filters → 280 params
      • → no matter what size of image, we only have 280 params with 10 filters (3x3x3)
    • Notations:
    • The number of filters used will be the number of channels in the output
  • SImple example of ConvNet
  • Type of layer in ConvNet
    • Convolution (conv)
    • Pooling (pool)
    • Fully connected (FC)
  • Pooling layer
    • Purpose?
      • to reduce the size of representation → speed up
      • to make some of the features that detects a bit more robust
    • Max pooling → take the max of each region
      • Idea? if these features detected anywhere in this filter → keep the high number (upper left), if this feature not deteced (doesn't exist) (upper right), max is still quite small.
      • Người ta dùng nhiều vì nó works well in convnet nhưng nhiều lúc ngta cũng ko biết meaning behind.
      • Max pooling has no parameter to learn → grad desc doesn't change any thing.
      • usually we don't need any padding! (p=0)
    • Average pooling → like max pool, we take average
    • → max is used much more than avg
  • LeNet-5 (NN example) ← inspire by it, introduced by Yan LeCun
    • Convention: 1 layer contains (conv + pool) → when people talk abt layer, they mean the layer with params to learn → pool doesn't have any param to learn.
    • decrease, increase
    • A typical NN looks something like that: conv → pool → conv → pool → ... → FC → FC → FC → softmax
    • activation size go down
  • Why convolution?
    • 2 main advantages of conv are: (if we use fully connected → too much parameters!!)
      • parameter sharing: some feature is useful in one part of the image, it may be useful in another part.→ no need to learn different feat detected.
      • Sparsity of connections: each layer, output value depends only on a small number of inputs.
 

Week 2 - Case studies & practical advices for using ConvNets

  • Case studies of effective ConvNets
    • ConvNet works well on 1 computer vision task → (usually) works well on other tasks.
    • Classic networks: LeNet-5, AlexNet, VGG
    • (depper) ResNet (152 NN)
    • Inception NN (GoogLeNet)
  • Classic networks
    • LeNet-5 → focus on section II, III in the article, they're interesting! → use sigmoid/tanh
  • 1x1 convolution (network in network)
    • just multiply by some number (1 filter)
    • If more filters → more sense → apply weights to nodes of channels
    • It's basically having a fully connected neuron network that applies to each of 62 different positions → it puts 32 numbers + outputs #filers
    • Useful for:
      • Shrink #channel: Eg. 28x28x192 → (32 filters of 1x1x192) → 28x28x32
      • it allows to learn more complex func of the network by adding non-linearity. (in case 28x28x192 → (1x1x192) → 28x28x192.
  • Shrink? (eg. 28x28x192)
    • 28x28 → lower? ⇒ use pooling layer
    • 192 channels → lower? ⇒ use 1x1 conv
    • ⇒ very useful in building Inception Network!
  • Inception Network (GoogLeNet)
    • 👉 Đọc thêm: https://dlapplications.github.io/2018-07-06-CNN/#5-googlenet2014 (có lưu backup trong raindrop)
    • Basic building block (motivation)
    • When design a ConvNet, we have to choose to use 1x1, 3x3, 5x5, or use/not pooling layers? → inception net says that "use them all". → more complicated but better!
    • Problem: computation cost? → consider 5x5 layer? → ~120M (28x28x32 * 5x5x192) → need to use 1x1 conv (remain 12.4M params)
      • Without 1x1 conv → 12M params
        Using 1x1 conv → only 12.4M params
    • 1x1 conv like a "bottleneck" layer
    • When shrink → it usually doesn't hurt your performance! (within reason to use)
    • Inception module
    • Inception → combine inception modules together
      • There are some "max pooling" inside to change the H and W. There are also som additional branch using softmax to make the prediction (green)
    • Where is inception come from? → from the movie "Inception" → "We need to go deeper" (Leo said in the movie) → we need to go deeper in the inception network!
  • Practical advices for using ConvNets
    • Using open source implementaion! -> eg. Github
    • Transfer learning → using already-trained model → download weights
      • Image Net, MS COCO, Pascal,...
      • Download not only the codes but also the weights
      • Forget the last (softmax) layer → create your own
      • If you have not many label dataset (training) → freeze all layers exept the last one(s)
      • If you have "larger" label dataset (training) → freeze fewer layers and train later layers
      • If you have a lot of data → use the whole trained weights as initializations + re train again the whole network!
    • Data augmentation
      • In practice, we run in parallel:
        • 1 thread → load + augmentation task
        • other threads → training
    • State of Computer Vision
    • Ebsembling:
      • Train several networks independently and average their outputs. (3-15 networks)
    • Multi-crop at test time
      • Run classifier on multiple versions of test images and average results.

Week 3 - Detection algorithms

  • Object detection: Object Localization:
    • draw bounding box around object → classification with localization.
    • bounding box → based on coordinate of images → export coor of bounding box.
    • How to define label y? → object or not? if yes → we need to clarify other, if not, we don't care others
  • Landmak detection
    • Just want NN export x,y coord of important points on images (called "landmark") → eg. corner of someone's eyes, mouth's shape,...
    • Eg: snapchat apps, computer graphic effects, AR,...
    • Person's pose?
    • The labels must be consistent cross images,...
  • Object detection's algo
    • Use ConvNets to build object detection using sliding window detection algorithm.
    • Start with images croped contains closely the cars.
    • Sliding windows detection: start with small window and go through (using ConvNet) the entire image. → increase the window → hopefully there is sizing of windows that contain the car. → high computational cost! → there is a solution!
    • Convolutional Implementation of Sliding Windows
      • Turning FC layer into convolutional layer. ← use this idea to implement sliding window less computationally!
      • What we want: example using a stride of 2 (yellow in the figure) → we move the window around (4 times) → 4 different time of using ConvNets → costly!
        → Idea of conv imple of sliding windows: share computation of above 4 computations!
      • How? → keep the region of padding → run the ConvNets → the final give 4 different cases (share the common region computations)
        • Instead of compute separately sliding windows, we compute at once and obtain the result at the final step for each region!
      • Weakness? → position of bounding box is not accurate → read next to fix!
    • Bounding box predictions → Using YOLO (You Only Look Once) → sau cái này mấy ý mới là note chính của YOLO.
      • Divide image into grid cells, let say 3x3 grid → apply NN on each grid cell to output a label y → there are totally 3x3x8 target output (3x3=grid, 8=dim of y).
        In each cell, we detect the
        midpoint of the object and check which cell it belongs to.
        • Even object rely on multiple grid cells, the midpoint is located at 1.
      • Instead of 3x3, we could use 19x19 to prevent multiple objects in the same grid cell.
      • CNN run fast in this case → can be used in real time object detection!
      • How to encode coordinates of boxes?
        • bx,by → coor based on coor of grid cell
        • bh, bw → fraction of length w.r.t the side of cell → could be > 1 (out of the cell)
      • YOLO paper is very hard to understand (even for Andrew)
  • Intersection over union (IoU) ← Evaluating method!!!
    • Evaluating object localization: if your algo output vilolet bound → good or not? → find the size of the intersections / size of union ⇒ "correct" if IoU ≥ 0.5 (convention)
  • Non-max suppression ← clean up multple detections and keep only 1!
    • Problem: 1 object detected multiple times (many midpoints accepted)
    • Make sure your algo detect one object only once!!!
    • Non-max → take the max probability (light blue) and then supress the remaining rectangle who have the most IoU with the light blue one.
      • Algo:
        • If there are multiple objects → apply non-max multiple times.
  • Anchor boxes → what if algo detects mutiple objects? → allow your algo have specialize better
    • Without anchor: each object assigned to grid cell containing the midpoint.
    • With anchor: each object assigned to grid cell containing the midpoint + anchor box for the grid cell with highest IoU.
    • How to choose anchor box? → by hand / using K-Means algo (more advanced)...
  • YOLO angorithm
    • How construct training set?
      • We go from image of 100x100x3 → ConvNet → obtain 3x3x16 (3x3=grid cells, 16=2x8 ← 2=#anchor, 8 is dim of y)
    • Making predictions?
    • Output the non-max supressed outputs:
      - Each cell has 2 anchors → bỏ hết mấy cái cell with low prob
      - Trên mấy cái còn lại (như hình dưới) → dùng non-max để generate final prediction.
  • Region proposals → very influential in computer vision but less often

    Week 4 - Neural Style Transfer

    • What's Neural Style Transfer?
      • 1 of the most fun and exiting application of ConvNets
      • Re-create some bigture in a style of other image.
      • In order to do → look at features extracted by ConvNets at various layers
    • What are deep ConvNets Learning? (at each layer)
    • Cost Function → J(G) contains 2 different cost functions: J_contents (giống hình) and J_style (giống style)
    • Content Cost Function
    • Style Cost Function
      • Meaning of "style" of an image? = correlation between activations across channels
      • Example of "correlated":
        • red channel is corresponding to red neurons (vertical)
        • yellow channel is corresponding to yellow neurons (orange color batches)
        • "highly correlated" → whenever there is vertical texture, there is orange-ish tint!
        • "uncorrelated" → whenever there is vertical texture, there is NO orange-ish tint!
        • → high level texture components tend to occur of not occur together in part the image.
        • We calculate the same "degree" of these channels in the generated image.
      • The detailed formulas can be found in this ppt file (slide 37, 38)
    • 1D and 3D Generalizations
      • example in 1D: EKG signal (Electrocardiogram)
        • 16 → 16 filters
        • ConvNets can be used on 1D data despite that with 1D data, people usually uses RNN
      • Example of 3D: city scan
        •