- HxWxC → too large → too many params → need conv
- edge detection by a filter (3x3 for example) (or kernel) → multiply with original image by a convulution operator (*)
- Sobel filter, Scharr filter (not only 0, 1, -1)
- Can use backprop to learn the value of filter.
- Not only edge, we can learn by degree of images.
- Padding
- we don't want image to shrink everytime
- 6x6 * 3x3 (filter) → 4x4
- (nxn)*(fxf) → (n-f+1 x n-f+1)
- we don't want pixel on the corner or edges are used much less on the outputs
- 8x8 * 3x3 → 6x6 (instead of 4x4)
- → n+2p-f+1 x n+2p-f+!
- 2 common choices
- valid conv → no padding (nxn * fxf → n-f+1 x n-f+1)
- same conv → out's size = in's size → choose
- f is usually odd → just convention
- 3x3, 5x5 are very common
→ we can pad the images (mở rộng ra thêm all around the images): 6x6 → (pad 1) → 8x8
- Stride convulutions
- filter moves more than 1 step
- ex: 7x7 * 3x3 (with stride=2) → 3x3
- nxn * fxf →
- if fraction is not integer → round down → we take the
floor()
- Conv over volumes (not just on 2d images) → ex on RGB images (3 channels)
- 6x6x3 * 3x3x3 (3 layers of filers) → 4x4x1
- We multiply each layer together and then sum up all 3 layers → give only 1 number on the resulting matrix → that's why we only have 4x4x1
- if we wanna detect verticle edge only on the red channel → 1st layer (in 3x3x3) can detect it, the other 2 is 0s.
- multiple filters at the same time? → 1st filter (verticle), 2nd (horizontal) → 4x4x2 (2 here is 2 filters)
- we can use 100+ fileters →
- 1 layer of ConvNet
- if 10 filters (3x3x3) → how many parameters?
- each filter = 3x3x3=27 + 1 bias ⇒ 28 params
- 10 filters → 280 params
- Notations:
- The number of filters used will be the number of channels in the output
→ no matter what size of image, we only have 280 params with 10 filters (3x3x3)
- SImple example of ConvNet
- Type of layer in ConvNet
- Convolution (conv)
- Pooling (pool)
- Fully connected (FC)
- Pooling layer
- Purpose?
- to reduce the size of representation → speed up
- to make some of the features that detects a bit more robust
- Max pooling → take the max of each region
- Idea? if these features detected anywhere in this filter → keep the high number (upper left), if this feature not deteced (doesn't exist) (upper right), max is still quite small.
- Người ta dùng nhiều vì nó works well in convnet nhưng nhiều lúc ngta cũng ko biết meaning behind.
- Max pooling has no parameter to learn → grad desc doesn't change any thing.
- usually we don't need any padding! (p=0)
- Average pooling → like max pool, we take average
→ max is used much more than avg
- LeNet-5 (NN example) ← inspire by it, introduced by Yan LeCun
- Convention: 1 layer contains (conv + pool) → when people talk abt layer, they mean the layer with params to learn → pool doesn't have any param to learn.
- decrease, increase
- A typical NN looks something like that: conv → pool → conv → pool → ... → FC → FC → FC → softmax
- activation size go down
- Why convolution?
- 2 main advantages of conv are: (if we use fully connected → too much parameters!!)
- parameter sharing: some feature is useful in one part of the image, it may be useful in another part.→ no need to learn different feat detected.
- Sparsity of connections: each layer, output value depends only on a small number of inputs.
- Case studies of effective ConvNets
- ConvNet works well on 1 computer vision task → (usually) works well on other tasks.
- Classic networks: LeNet-5, AlexNet, VGG
- (depper) ResNet (152 NN)
- Inception NN (GoogLeNet)
- Classic networks
- LeNet-5 → focus on section II, III in the article, they're interesting! → use sigmoid/tanh
- 1x1 convolution (network in network)
- just multiply by some number (1 filter)
- If more filters → more sense → apply weights to nodes of channels
- It's basically having a fully connected neuron network that applies to each of 62 different positions → it puts 32 numbers + outputs #filers
- Useful for:
- Shrink #channel: Eg. 28x28x192 → (32 filters of 1x1x192) → 28x28x32
- it allows to learn more complex func of the network by adding non-linearity. (in case 28x28x192 → (1x1x192) → 28x28x192.
- Shrink? (eg. 28x28x192)
- 28x28 → lower? ⇒ use pooling layer
- 192 channels → lower? ⇒ use 1x1 conv
⇒ very useful in building Inception Network!
- Inception Network (GoogLeNet)
- Basic building block (motivation)
- When design a ConvNet, we have to choose to use 1x1, 3x3, 5x5, or use/not pooling layers? → inception net says that "use them all". → more complicated but better!
- Problem: computation cost? → consider 5x5 layer? → ~120M (28x28x32 * 5x5x192) → need to use 1x1 conv (remain 12.4M params)
- 1x1 conv like a "bottleneck" layer
- When shrink → it usually doesn't hurt your performance! (within reason to use)
- Inception module
- Inception → combine inception modules together
- Where is inception come from? → from the movie "Inception" → "We need to go deeper" (Leo said in the movie) → we need to go deeper in the inception network!
👉 Đọc thêm: https://dlapplications.github.io/2018-07-06-CNN/#5-googlenet2014 (có lưu backup trong raindrop)
- Practical advices for using ConvNets
- Using open source implementaion! -> eg. Github
- Transfer learning → using already-trained model → download weights
- Image Net, MS COCO, Pascal,...
- Download not only the codes but also the weights
- Forget the last (softmax) layer → create your own
- If you have not many label dataset (training) → freeze all layers exept the last one(s)
- If you have "larger" label dataset (training) → freeze fewer layers and train later layers
- If you have a lot of data → use the whole trained weights as initializations + re train again the whole network!
- Data augmentation
- In practice, we run in parallel:
- 1 thread → load + augmentation task
- other threads → training
- State of Computer Vision
- Ebsembling:
- Train several networks independently and average their outputs. (3-15 networks)
- Multi-crop at test time
- Run classifier on multiple versions of test images and average results.
- Object detection: Object Localization:
- draw bounding box around object → classification with localization.
- bounding box → based on coordinate of images → export coor of bounding box.
- How to define label y? → object or not? if yes → we need to clarify other, if not, we don't care others
- Landmak detection
- Just want NN export x,y coord of important points on images (called "landmark") → eg. corner of someone's eyes, mouth's shape,...
- Eg: snapchat apps, computer graphic effects, AR,...
- Person's pose?
- The labels must be consistent cross images,...
- Object detection's algo
- Use ConvNets to build object detection using sliding window detection algorithm.
- Start with images croped contains closely the cars.
- Sliding windows detection: start with small window and go through (using ConvNet) the entire image. → increase the window → hopefully there is sizing of windows that contain the car. → high computational cost! → there is a solution!
- Convolutional Implementation of Sliding Windows
- Turning FC layer into convolutional layer. ← use this idea to implement sliding window less computationally!
- What we want: example using a stride of 2 (yellow in the figure) → we move the window around (4 times) → 4 different time of using ConvNets → costly!
→ Idea of conv imple of sliding windows: share computation of above 4 computations! - How? → keep the region of padding → run the ConvNets → the final give 4 different cases (share the common region computations)
- Weakness? → position of bounding box is not accurate → read next to fix!
- Bounding box predictions → Using YOLO (You Only Look Once) → sau cái này mấy ý mới là note chính của YOLO.
- Divide image into grid cells, let say 3x3 grid → apply NN on each grid cell to output a label y → there are totally 3x3x8 target output (3x3=grid, 8=dim of y).
In each cell, we detect the midpoint of the object and check which cell it belongs to. - Even object rely on multiple grid cells, the midpoint is located at 1.
- Instead of 3x3, we could use 19x19 to prevent multiple objects in the same grid cell.
- CNN run fast in this case → can be used in real time object detection!
- How to encode coordinates of boxes?
- bx,by → coor based on coor of grid cell
- bh, bw → fraction of length w.r.t the side of cell → could be > 1 (out of the cell)
- YOLO paper is very hard to understand (even for Andrew)
- Intersection over union (IoU) ← Evaluating method!!!
- Evaluating object localization: if your algo output vilolet bound → good or not? → find the size of the intersections / size of union ⇒ "correct" if IoU ≥ 0.5 (convention)
- Non-max suppression ← clean up multple detections and keep only 1!
- Problem: 1 object detected multiple times (many midpoints accepted)
- Make sure your algo detect one object only once!!!
- Non-max → take the max probability (light blue) and then supress the remaining rectangle who have the most IoU with the light blue one.
- Algo:
If there are multiple objects → apply non-max multiple times.
- Anchor boxes → what if algo detects mutiple objects? → allow your algo have specialize better
- Without anchor: each object assigned to grid cell containing the midpoint.
- With anchor: each object assigned to grid cell containing the midpoint + anchor box for the grid cell with highest IoU.
- How to choose anchor box? → by hand / using K-Means algo (more advanced)...
- YOLO angorithm
- How construct training set?
- Making predictions?
- Output the non-max supressed outputs:
- Each cell has 2 anchors → bỏ hết mấy cái cell with low prob
- Trên mấy cái còn lại (như hình dưới) → dùng non-max để generate final prediction.
- Region proposals → very influential in computer vision but less often
- Check & try YOLO / DarkNet project!
- What's Neural Style Transfer?
- 1 of the most fun and exiting application of ConvNets
- Re-create some bigture in a style of other image.
- In order to do → look at features extracted by ConvNets at various layers
- What are deep ConvNets Learning? (at each layer)
- Cost Function → J(G) contains 2 different cost functions: J_contents (giống hình) and J_style (giống style)
- Content Cost Function
- Style Cost Function
- Meaning of "style" of an image? = correlation between activations across channels
- Example of "correlated":
- red channel is corresponding to red neurons (vertical)
- yellow channel is corresponding to yellow neurons (orange color batches)
- "highly correlated" → whenever there is vertical texture, there is orange-ish tint!
- "uncorrelated" → whenever there is vertical texture, there is NO orange-ish tint!
- We calculate the same "degree" of these channels in the generated image.
- The detailed formulas can be found in this ppt file (slide 37, 38)
→ high level texture components tend to occur of not occur together in part the image.
- 1D and 3D Generalizations
- example in 1D: EKG signal (Electrocardiogram)
- ConvNets can be used on 1D data despite that with 1D data, people usually uses RNN
- Example of 3D: city scan