- HxWxC β too large β too many params β need conv
- edge detection by a filter (3x3 for example) (or kernel) β multiply with original image by a convulution operator (*)
- Sobel filter, Scharr filter (not only 0, 1, -1)
- Can use backprop to learn the value of filter.
- Not only edge, we can learn by degree of images.
- Padding
- we don't want image to shrink everytime
- 6x6 * 3x3 (filter) β 4x4
- (nxn)*(fxf) β (n-f+1 x n-f+1)
- we don't want pixel on the corner or edges are used much less on the outputs
- 8x8 * 3x3 β 6x6 (instead of 4x4)
- β n+2p-f+1 x n+2p-f+!
- 2 common choices
- valid conv β no padding (nxn * fxf β n-f+1 x n-f+1)
- same conv β out's size = in's size β choose
- f is usually odd β just convention
- 3x3, 5x5 are very common
β we can pad the images (mα» rα»ng ra thΓͺm all around the images): 6x6 β (pad 1) β 8x8
- Stride convulutions
- filter moves more than 1 step
- ex: 7x7 * 3x3 (with stride=2) β 3x3
- nxn * fxf β
- if fraction is not integer β round down β we take the
floor()
- Conv over volumes (not just on 2d images) β ex on RGB images (3 channels)
- 6x6x3 * 3x3x3 (3 layers of filers) β 4x4x1
- We multiply each layer together and then sum up all 3 layers β give only 1 number on the resulting matrix β that's why we only have 4x4x1
- if we wanna detect verticle edge only on the red channel β 1st layer (in 3x3x3) can detect it, the other 2 is 0s.
- multiple filters at the same time? β 1st filter (verticle), 2nd (horizontal) β 4x4x2 (2 here is 2 filters)
- we can use 100+ fileters β
- 1 layer of ConvNet
- if 10 filters (3x3x3) β how many parameters?
- each filter = 3x3x3=27 + 1 bias β 28 params
- 10 filters β 280 params
- Notations:
- The number of filters used will be the number of channels in the output
β no matter what size of image, we only have 280 params with 10 filters (3x3x3)
- SImple example of ConvNet
- Type of layer in ConvNet
- Convolution (conv)
- Pooling (pool)
- Fully connected (FC)
- Pooling layer
- Purpose?
- to reduce the size of representation β speed up
- to make some of the features that detects a bit more robust
- Max pooling β take the max of each region
- Idea? if these features detected anywhere in this filter β keep the high number (upper left), if this feature not deteced (doesn't exist) (upper right), max is still quite small.
- NgΖ°α»i ta dΓΉng nhiα»u vΓ¬ nΓ³ works well in convnet nhΖ°ng nhiα»u lΓΊc ngta cΕ©ng ko biαΊΏt meaning behind.
- Max pooling has no parameter to learn β grad desc doesn't change any thing.
- usually we don't need any padding! (p=0)
- Average pooling β like max pool, we take average
β max is used much more than avg
- LeNet-5 (NN example) β inspire by it, introduced by Yan LeCun
- Convention: 1 layer contains (conv + pool) β when people talk abt layer, they mean the layer with params to learn β pool doesn't have any param to learn.
- decrease, increase
- A typical NN looks something like that: conv β pool β conv β pool β ... β FC β FC β FC β softmax
- activation size go down
- Why convolution?
- 2 main advantages of conv are: (if we use fully connected β too much parameters!!)
- parameter sharing: some feature is useful in one part of the image, it may be useful in another part.β no need to learn different feat detected.
- Sparsity of connections: each layer, output value depends only on a small number of inputs.
Β
- Case studies of effective ConvNets
- ConvNet works well on 1 computer vision task β (usually) works well on other tasks.
- Classic networks: LeNet-5, AlexNet, VGG
- (depper) ResNet (152 NN)
- Inception NN (GoogLeNet)
- Classic networks
- LeNet-5 β focus on section II, III in the article, they're interesting! β use sigmoid/tanh
- 1x1 convolution (network in network)
- just multiply by some number (1 filter)
- If more filters β more sense β apply weights to nodes of channels
- It's basically having a fully connected neuron network that applies to each of 62 different positions β it puts 32 numbers + outputs #filers
- Useful for:
- Shrink #channel: Eg. 28x28x192 β (32 filters of 1x1x192) β 28x28x32
- it allows to learn more complex func of the network by adding non-linearity. (in case 28x28x192 β (1x1x192) β 28x28x192.
- Shrink? (eg. 28x28x192)
- 28x28 β lower? β use pooling layer
- 192 channels β lower? β use 1x1 conv
β very useful in building Inception Network!
- Inception Network (GoogLeNet)
- Basic building block (motivation)
- When design a ConvNet, we have to choose to use 1x1, 3x3, 5x5, or use/not pooling layers? β inception net says that "use them all". β more complicated but better!
- Problem: computation cost? β consider 5x5 layer? β ~120M (28x28x32 * 5x5x192) β need to use 1x1 conv (remain 12.4M params)
- 1x1 conv like a "bottleneck" layer
- When shrink β it usually doesn't hurt your performance! (within reason to use)
- Inception module
- Inception β combine inception modules together
- Where is inception come from? β from the movie "Inception" β "We need to go deeper" (Leo said in the movie) β we need to go deeper in the inception network!
πΒ Δα»c thΓͺm: https://dlapplications.github.io/2018-07-06-CNN/#5-googlenet2014 (cΓ³ lΖ°u backup trong raindrop)
- Practical advices for using ConvNets
- Using open source implementaion! -> eg. Github
- Transfer learning β using already-trained model β download weights
- Image Net, MS COCO, Pascal,...
- Download not only the codes but also the weights
- Forget the last (softmax) layer β create your own
- If you have not many label dataset (training) β freeze all layers exept the last one(s)
- If you have "larger" label dataset (training) β freeze fewer layers and train later layers
- If you have a lot of data β use the whole trained weights as initializations + re train again the whole network!
- Data augmentation
- In practice, we run in parallel:
- 1 thread β load + augmentation task
- other threads β training
- State of Computer Vision
- Ebsembling:
- Train several networks independently and average their outputs. (3-15 networks)
- Multi-crop at test time
- Run classifier on multiple versions of test images and average results.
- Object detection: Object Localization:
- draw bounding box around object β classification with localization.
- bounding box β based on coordinate of images β export coor of bounding box.
- How to define label y? β object or not? if yes β we need to clarify other, if not, we don't care others
- Landmak detection
- Just want NN export x,y coord of important points on images (called "landmark") β eg. corner of someone's eyes, mouth's shape,...
- Eg: snapchat apps, computer graphic effects, AR,...
- Person's pose?
- The labels must be consistent cross images,...
- Object detection's algo
- Use ConvNets to build object detection using sliding window detection algorithm.
- Start with images croped contains closely the cars.
- Sliding windows detection: start with small window and go through (using ConvNet) the entire image. β increase the window β hopefully there is sizing of windows that contain the car. β high computational cost! β there is a solution!
- Convolutional Implementation of Sliding Windows
- Turning FC layer into convolutional layer. β use this idea to implement sliding window less computationally!
- What we want: example using a stride of 2 (yellow in the figure) β we move the window around (4 times) β 4 different time of using ConvNets β costly!
β Idea of conv imple of sliding windows: share computation of above 4 computations! - How? β keep the region of padding β run the ConvNets β the final give 4 different cases (share the common region computations)
- Weakness? β position of bounding box is not accurate β read next to fix!
- Bounding box predictions β Using YOLO (You Only Look Once) β sau cΓ‘i nΓ y mαΊ₯y Γ½ mα»i lΓ note chΓnh cα»§a YOLO.
- Divide image into grid cells, let say 3x3 grid β apply NN on each grid cell to output a label y β there are totally 3x3x8 target output (3x3=grid, 8=dim of y).
In each cell, we detect the midpoint of the object and check which cell it belongs to. - Even object rely on multiple grid cells, the midpoint is located at 1.
- Instead of 3x3, we could use 19x19 to prevent multiple objects in the same grid cell.
- CNN run fast in this case β can be used in real time object detection!
- How to encode coordinates of boxes?
- bx,by β coor based on coor of grid cell
- bh, bw β fraction of length w.r.t the side of cell β could be > 1 (out of the cell)
- YOLO paper is very hard to understand (even for Andrew)
- Intersection over union (IoU) β Evaluating method!!!
- Evaluating object localization: if your algo output vilolet bound β good or not? β find the size of the intersections / size of union β "correct" if IoU β₯ 0.5 (convention)
- Non-max suppression β clean up multple detections and keep only 1!
- Problem: 1 object detected multiple times (many midpoints accepted)
- Make sure your algo detect one object only once!!!
- Non-max β take the max probability (light blue) and then supress the remaining rectangle who have the most IoU with the light blue one.
- Algo:
If there are multiple objects β apply non-max multiple times.
- Anchor boxes β what if algo detects mutiple objects? β allow your algo have specialize better
- Without anchor: each object assigned to grid cell containing the midpoint.
- With anchor: each object assigned to grid cell containing the midpoint + anchor box for the grid cell with highest IoU.
- How to choose anchor box? β by hand / using K-Means algo (more advanced)...
- YOLO angorithm
- How construct training set?
- Making predictions?
- Output the non-max supressed outputs:
- Each cell has 2 anchors β bα» hαΊΏt mαΊ₯y cΓ‘i cell with low prob
- TrΓͺn mαΊ₯y cΓ‘i cΓ²n lαΊ‘i (nhΖ° hΓ¬nh dΖ°α»i) β dΓΉng non-max Δα» generate final prediction.
- Region proposals β very influential in computer vision but less often
- Check & try YOLO / DarkNet project!
- What's Neural Style Transfer?
- 1 of the most fun and exiting application of ConvNets
- Re-create some bigture in a style of other image.
- In order to do β look at features extracted by ConvNets at various layers
- What are deep ConvNets Learning? (at each layer)
- Cost Function β J(G) contains 2 different cost functions: J_contents (giα»ng hΓ¬nh) and J_style (giα»ng style)
- Content Cost Function
- Style Cost Function
- Meaning of "style" of an image? = correlation between activations across channels
- Example of "correlated":
- red channel is corresponding to red neurons (vertical)
- yellow channel is corresponding to yellow neurons (orange color batches)
- "highly correlated" β whenever there is vertical texture, there is orange-ish tint!
- "uncorrelated" β whenever there is vertical texture, there is NO orange-ish tint!
- We calculate the same "degree" of these channels in the generated image.
- The detailed formulas can be found in this ppt file (slide 37, 38)
β high level texture components tend to occur of not occur together in part the image.
- 1D and 3D Generalizations
- example in 1D: EKG signal (Electrocardiogram)
- ConvNets can be used on 1D data despite that with 1D data, people usually uses RNN
- Example of 3D: city scan
Β