Note for couse DL 4: CNN

Anh-Thi Dinh
draft

Week 1 - CNN

  • HxWxC → too large → too many params → need conv
  • edge detection by a filter (3x3 for example) (or kernel) → multiply with original image by a convulution operator (*)
    • Sobel filter, Scharr filter (not only 0, 1, -1)
    • Can use backprop to learn the value of filter.
    • Not only edge, we can learn by degree of images.
  • Padding
    • we don't want image to shrink everytime
      • 6x6 * 3x3 (filter) → 4x4
      • (nxn)*(fxf) → (n-f+1 x n-f+1)
    • we don't want pixel on the corner or edges are used much less on the outputs
    • → we can pad the images (mở rộng ra thêm all around the images): 6x6 → (pad 1) → 8x8
      • 8x8 * 3x3 → 6x6 (instead of 4x4)
      • → n+2p-f+1 x n+2p-f+!
    • 2 common choices
      • valid conv → no padding (nxn * fxf → n-f+1 x n-f+1)
      • same conv → out's size = in's size → choose
    • f is usually odd → just convention
      • 3x3, 5x5 are very common
  • Stride convulutions
    • filter moves more than 1 step
    • ex: 7x7 * 3x3 (with stride=2) → 3x3
    • nxn * fxf →
    • if fraction is not integer → round down → we take the floor()
  • Conv over volumes (not just on 2d images) → ex on RGB images (3 channels)
    • 6x6x3 * 3x3x3 (3 layers of filers) → 4x4x1
      • We multiply each layer together and then sum up all 3 layers → give only 1 number on the resulting matrix → that's why we only have 4x4x1
      • if we wanna detect verticle edge only on the red channel → 1st layer (in 3x3x3) can detect it, the other 2 is 0s.
      • multiple filters at the same time? → 1st filter (verticle), 2nd (horizontal) → 4x4x2 (2 here is 2 filters)
      • we can use 100+ fileters →
  • 1 layer of ConvNet
    • if 10 filters (3x3x3) → how many parameters?
      • each filter = 3x3x3=27 + 1 bias ⇒ 28 params
      • 10 filters → 280 params
      • → no matter what size of image, we only have 280 params with 10 filters (3x3x3)
    • Notations:
    • The number of filters used will be the number of channels in the output
  • SImple example of ConvNet
  • Type of layer in ConvNet
    • Convolution (conv)
    • Pooling (pool)
    • Fully connected (FC)
  • Pooling layer
    • Purpose?
      • to reduce the size of representation → speed up
      • to make some of the features that detects a bit more robust
    • Max pooling → take the max of each region
      • Idea? if these features detected anywhere in this filter → keep the high number (upper left), if this feature not deteced (doesn't exist) (upper right), max is still quite small.
      • Người ta dùng nhiều vì nó works well in convnet nhưng nhiều lúc ngta cũng ko biết meaning behind.
      • Max pooling has no parameter to learn → grad desc doesn't change any thing.
      • usually we don't need any padding! (p=0)
    • Average pooling → like max pool, we take average
    • → max is used much more than avg
  • LeNet-5 (NN example) ← inspire by it, introduced by Yan LeCun
    • Convention: 1 layer contains (conv + pool) → when people talk abt layer, they mean the layer with params to learn → pool doesn't have any param to learn.
    • decrease, increase
    • A typical NN looks something like that: conv → pool → conv → pool → ... → FC → FC → FC → softmax
    • activation size go down
  • Why convolution?
    • 2 main advantages of conv are: (if we use fully connected → too much parameters!!)
      • parameter sharing: some feature is useful in one part of the image, it may be useful in another part.→ no need to learn different feat detected.
      • Sparsity of connections: each layer, output value depends only on a small number of inputs.
 

Week 2 - Case studies & practical advices for using ConvNets

  • Case studies of effective ConvNets
    • ConvNet works well on 1 computer vision task → (usually) works well on other tasks.
    • Classic networks: LeNet-5, AlexNet, VGG
    • (depper) ResNet (152 NN)
    • Inception NN (GoogLeNet)
  • Classic networks
    • LeNet-5 → focus on section II, III in the article, they're interesting! → use sigmoid/tanh
  • ResNets (Residual Networks) → researchers were able to build deep neural nets with higher number of layers
    • My understanding (from various sources) ← read this first (or later)
      • Motivation:
        • Từ VGG đã cho thấy NN càng deep càng tốt, nhưng có thật như vậy ko? → không! (như bài báo resnet), training err và test err đều tăng → ko phải overfitting mà là do "vanishing gradient problem"
        • vanishing gradient problem: khi training on very deep NN → backprop và gradient dễ bị sai lệch và tính toán về 0 → weights ko được update tới status mới và đúng nhất → no learning performs → bad!
        • ResNets sinh ra để giải quyết cái vanishing gradient này!
      • Idea:
        • Nếu 1 mạng đủ tốt thì ít ra nó phải đoán ra được hàm identity (f(x)=x)
        • Nếu 1 mạng NN dự đoán hàm g → thì mạng đo + id (layers) cũng sẽ dự đoán được hàm g. ("được" = same accuracy) → bởi vì những layers sau đơn giản chỉ là learn id function mà thôi.
        • Tuy nhiên, mạng dưới ko học được id, why?
          • chúng ta thường lấy initialization weights around 0
          • learn id function is difficult as learn any other functions
        • Thay vì X → DNN → (id) → X, tại sao chúng ta ko X → X trước rồi sau đó tích hợp X → DNN → X (những gì chúng ta muốn learn)?
      • Với deeper (nhưng ít params hơn) → ko có res blocks thì ra kq kém hơn là có res blocks!
    • The bottleneck of VGG → they cannot go as deep as wanted!
    • Skip connection: take activations from 1 layer → feeded to another layer (even much deeper in NN)
    • Used to train very very deep networks (>100 layers)
    • Residual block
      • Instead of go in "main path" (take longer), we can go "short cut" ("skip connection")
      • Using this res block → allow to train more deeper network!
    • From this article (more understandable)
      • motivation: Nếu network tốt, nó ít nhất phải giải được identify function (f(x)=x). Hay nếu mạng cho ra f(x) + x thì nó phải dự đoán được f(x)=0. Đặt h(x)=f(x) + x hay f(x)=h(x)-x (residual) → ta phải train và learn sau cho h(x)-x=0 (easier for the network).
      • ResNets solve vanishing gradient problem.
      • Vanishing gradient problem: This is because when the network is too deep, the gradients from where the loss function is calculated easily shrink to zero after several applications of the chain rule (backprop). This result on the weights never updating its values and therefore, no learning is being performed.
        • → With ResNets, the gradients can flow directly through the skip connections backwards from later layers to initial filters.
    • 2 ResNet Architecture - YouTube → quickly explain the idea but not enough
    • [Classic] Deep Residual Learning for Image Recognition (Paper Explained) - YouTube —> giải thích dựa vào bài báo gốc. → very good explanation!!!
    • With ResNets block is better
    • Train resnet work well on the training set is a good 1st step to work that!
  • 1x1 convolution (network in network)
    • just multiply by some number (1 filter)
    • If more filters → more sense → apply weights to nodes of channels
    • It's basically having a fully connected neuron network that applies to each of 62 different positions → it puts 32 numbers + outputs #filers
    • Useful for:
      • Shrink #channel: Eg. 28x28x192 → (32 filters of 1x1x192) → 28x28x32
      • it allows to learn more complex func of the network by adding non-linearity. (in case 28x28x192 → (1x1x192) → 28x28x192.
  • Shrink? (eg. 28x28x192)
    • 28x28 → lower? ⇒ use pooling layer
    • 192 channels → lower? ⇒ use 1x1 conv
    • ⇒ very useful in building Inception Network!
  • Inception Network (GoogLeNet)
    • 👉 Đọc thêm: https://dlapplications.github.io/2018-07-06-CNN/#5-googlenet2014 (có lưu backup trong raindrop)
    • Basic building block (motivation)
    • When design a ConvNet, we have to choose to use 1x1, 3x3, 5x5, or use/not pooling layers? → inception net says that "use them all". → more complicated but better!
    • Problem: computation cost? → consider 5x5 layer? → ~120M (28x28x32 * 5x5x192) → need to use 1x1 conv (remain 12.4M params)
      • Without 1x1 conv → 12M params
        Using 1x1 conv → only 12.4M params
    • 1x1 conv like a "bottleneck" layer
    • When shrink → it usually doesn't hurt your performance! (within reason to use)
    • Inception module
    • Inception → combine inception modules together
      • There are some "max pooling" inside to change the H and W. There are also som additional branch using softmax to make the prediction (green)
    • Where is inception come from? → from the movie "Inception" → "We need to go deeper" (Leo said in the movie) → we need to go deeper in the inception network!
  • Practical advices for using ConvNets
    • Using open source implementaion! -> eg. Github
    • Transfer learning → using already-trained model → download weights
      • Image Net, MS COCO, Pascal,...
      • Download not only the codes but also the weights
      • Forget the last (softmax) layer → create your own
      • If you have not many label dataset (training) → freeze all layers exept the last one(s)
      • If you have "larger" label dataset (training) → freeze fewer layers and train later layers
      • If you have a lot of data → use the whole trained weights as initializations + re train again the whole network!
    • Data augmentation
      • In practice, we run in parallel:
        • 1 thread → load + augmentation task
        • other threads → training
    • State of Computer Vision
    • Ebsembling:
      • Train several networks independently and average their outputs. (3-15 networks)
    • Multi-crop at test time
      • Run classifier on multiple versions of test images and average results.

Week 3 - Detection algorithms

  • Object detection: Object Localization:
    • draw bounding box around object → classification with localization.
    • bounding box → based on coordinate of images → export coor of bounding box.
    • How to define label y? → object or not? if yes → we need to clarify other, if not, we don't care others
  • Landmak detection
    • Just want NN export x,y coord of important points on images (called "landmark") → eg. corner of someone's eyes, mouth's shape,...
    • Eg: snapchat apps, computer graphic effects, AR,...
    • Person's pose?
    • The labels must be consistent cross images,...
  • Object detection's algo
    • Use ConvNets to build object detection using sliding window detection algorithm.
    • Start with images croped contains closely the cars.
    • Sliding windows detection: start with small window and go through (using ConvNet) the entire image. → increase the window → hopefully there is sizing of windows that contain the car. → high computational cost! → there is a solution!
    • Convolutional Implementation of Sliding Windows
      • Turning FC layer into convolutional layer. ← use this idea to implement sliding window less computationally!
      • What we want: example using a stride of 2 (yellow in the figure) → we move the window around (4 times) → 4 different time of using ConvNets → costly!
        → Idea of conv imple of sliding windows: share computation of above 4 computations!
      • How? → keep the region of padding → run the ConvNets → the final give 4 different cases (share the common region computations)
        • Instead of compute separately sliding windows, we compute at once and obtain the result at the final step for each region!
      • Weakness? → position of bounding box is not accurate → read next to fix!
    • Bounding box predictions → Using YOLO (You Only Look Once) → sau cái này mấy ý mới là note chính của YOLO.
      • Divide image into grid cells, let say 3x3 grid → apply NN on each grid cell to output a label y → there are totally 3x3x8 target output (3x3=grid, 8=dim of y).
        In each cell, we detect the
        midpoint of the object and check which cell it belongs to.
        • Even object rely on multiple grid cells, the midpoint is located at 1.
      • Instead of 3x3, we could use 19x19 to prevent multiple objects in the same grid cell.
      • CNN run fast in this case → can be used in real time object detection!
      • How to encode coordinates of boxes?
        • bx,by → coor based on coor of grid cell
        • bh, bw → fraction of length w.r.t the side of cell → could be > 1 (out of the cell)
      • YOLO paper is very hard to understand (even for Andrew)
  • Intersection over union (IoU) ← Evaluating method!!!
    • Evaluating object localization: if your algo output vilolet bound → good or not? → find the size of the intersections / size of union ⇒ "correct" if IoU ≥ 0.5 (convention)
  • Non-max suppression ← clean up multple detections and keep only 1!
    • Problem: 1 object detected multiple times (many midpoints accepted)
    • Make sure your algo detect one object only once!!!
    • Non-max → take the max probability (light blue) and then supress the remaining rectangle who have the most IoU with the light blue one.
      • Algo:
        • If there are multiple objects → apply non-max multiple times.
  • Anchor boxes → what if algo detects mutiple objects? → allow your algo have specialize better
    • Without anchor: each object assigned to grid cell containing the midpoint.
    • With anchor: each object assigned to grid cell containing the midpoint + anchor box for the grid cell with highest IoU.
    • How to choose anchor box? → by hand / using K-Means algo (more advanced)...
  • YOLO angorithm
    • How construct training set?
      • We go from image of 100x100x3 → ConvNet → obtain 3x3x16 (3x3=grid cells, 16=2x8 ← 2=#anchor, 8 is dim of y)
    • Making predictions?
    • Output the non-max supressed outputs:
      - Each cell has 2 anchors → bỏ hết mấy cái cell with low prob
      - Trên mấy cái còn lại (như hình dưới) → dùng non-max để generate final prediction.
  • Region proposals → very influential in computer vision but less often
      • R-CNN = Region-CNN → just consider few region (RoI = region of interest) to run ConvNets → run segmentation algo → find blocks → run on these blocks only.
      → R-CNN's output: label + bounding box also
      R-CNN is quite slow ⇒ use Fast R-CNN Faster R-CNN

Week 4 - Neural Style Transfer

  • What's Neural Style Transfer?
    • 1 of the most fun and exiting application of ConvNets
    • Re-create some bigture in a style of other image.
    • In order to do → look at features extracted by ConvNets at various layers
  • What are deep ConvNets Learning? (at each layer)
  • Cost Function → J(G) contains 2 different cost functions: J_contents (giống hình) and J_style (giống style)
  • Content Cost Function
  • Style Cost Function
    • Meaning of "style" of an image? = correlation between activations across channels
    • Example of "correlated":
      • red channel is corresponding to red neurons (vertical)
      • yellow channel is corresponding to yellow neurons (orange color batches)
      • "highly correlated" → whenever there is vertical texture, there is orange-ish tint!
      • "uncorrelated" → whenever there is vertical texture, there is NO orange-ish tint!
      • → high level texture components tend to occur of not occur together in part the image.
      • We calculate the same "degree" of these channels in the generated image.
    • The detailed formulas can be found in this ppt file (slide 37, 38)
  • 1D and 3D Generalizations
    • example in 1D: EKG signal (Electrocardiogram)
      • 16 → 16 filters
      • ConvNets can be used on 1D data despite that with 1D data, people usually uses RNN
    • Example of 3D: city scan
      •  
Loading comments...