NN 09
NN 09
NN 09
Agenda
CNN Architectures for
• Object Detection
• Segmentation
• Semantic Segmentation
1
Object Detection
• The task of assigning a label and a bounding box to all objects in the image.
• There are more than just class labels in an image.
2
Object Detection
• Object detection as CLASSIFICATION
• Object detection as REGRESSION Problem
Cat ? NO
Cat ? NO
Dog ? NO
Duck? NO Dog ? NO
Duck? NO
3
Object Detection as Classification
Problem:
Too many positions & scales to test
Shortcoming:
• Slow, impossible for real-time detection.
• Hard to optimize.
4
Object Detection as Regression Problem
YOLO Features:
• Extremely Fast (45 frames per second).
• Global reasoning on the Entire Image.
Reference: Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
5
YOLO, Detection Procedure
1- Divide the image into an S*S grid.
• If the center of an object falls into a grid cell, the cell is
responsible for detecting that object.
7x7 grid
2- Each grid cell predicts B boxes(x,y,w,h) and confidences of each box
P(Object).
Reference: Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 B=2
Finally do
Then combine the box - threshold detections
and class predictions: and
Pr(Classi|Object)*P(O - Non-Maximum
bject) = P (Class) Suppression (NMS)
Reference: Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
6
YOLO , Experimental using 20 classes
• There are 20 categories.
• 7x7 grid cell.
• Each grid cell needs to predict 2 bounding boxes and 20 class probabilities.
7
YOLO , Experimental using 20 classes, cont.
Reference: https://zhuanlan.zhihu.com/p/24916786?refer=xiaoleimlnote
YOLO Design
• The YOLO detection network includes
• 24 convolutional layers and 2 fully connected layers.
• Among them, the convolution layer is used to extract image features, and the fully
connected layer is used to predict image position and class probability values.
Reference: Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
8
YOLO Design, cont.
• Modified GoogLeNet.
• 1x1 reduction layer.
Reference: Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
YOLO: Training
1) Pretrain with ImageNet 1000-class
competition dataset.
9
YOLO: Training, cont.
2) “Network on Convolutional Feature Maps”
Increased input resolution (224x224)
Epochs = 135
batch_size = 64
momentum_a = 0.9
Decay = 0.0005
lr = [10-3, 10-2, 10-3, 10-4]
dropout_rate = 0.5
augmentation
=[scaling, translation, exposure, saturation]
10
YOLO: Inference
YOLO: Inference
11
12
13
14
15
16
17
18
Look at detection Procedure
19
Get first class scores for each bbox
20
Non-Maximum Suppression
21
Non-Maximum Suppression
Non-Maximum Suppression
How it compute?
22
Intersection over Union (IoU)
Non-Maximum Suppression
23
Non-Maximum Suppression
Non-Maximum Suppression
24
Non-Maximum Suppression
Non-Maximum Suppression
25
Non-Maximum Suppression
Non-Maximum Suppression
26
Do this procedure for next class
27
28
29
Object Detection Dataset
• COCO: Dataset for object detection, image segmentation and image captioning. It
has more than 200k images with 80 object categories.
http://cocodataset.org/#home
• Pascal VOC: Dataset of 20k images labelled with bounding boxes and 20 classes.
http://host.robots.ox.ac.uk/pascal/VOC/
• Open Images: 9M images that have been annotated with image-level labels and
object bounding boxes.
https://storage.googleapis.com/openimages/web/index.html
30
What is Segmentation?
1. Input: images
2. Output: regions, structures
3. Most of the time, we need to "process the image"
1. filters
2. gradient information
3. color information
4. etc.
• That's not quite so human.
• What if we want to understand the image?
Semantic Segmentation
• Semantic segmentation refers to the process of linking each pixel in an image to a
class label. These labels could include a person, car, flower, piece of furniture, etc.,
just to mention a few. We can think of semantic segmentation as image classification
at a pixel level.
31
Why Semantic Segmentation?
It is an important step towards complete scene understanding in computer vision.
1. robot vision and understanding
2. autonomous driving
3. medical purposes (ISBI Challenge)
from left to wright (Li, et. al., (CSI), CVPR, 2013; Long, et. al., (FCN), CVPR 2015; Chen et. al., (DeepLab), PAMI 2018)
32
Deep Learning in Segmentation
33
Deep Learning in semantic Segmentation: FCNs
• Fully Convolutional Networks (FCNs) for Semantic Segmentation
Resources
1. Roger Grosse and Jimmy Ba, CSC421 /2516 winter 2019 Neural Network and
Deep Learning, http://www.cs.toronto.edu.
2. Related Lecture from CS231n @ Stanford. http://cs231n.stanford.edu/
3. Redmon et al. “You only look once: Unified, real-time object detection.”
proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
2016.
4. YOLO, https://zhuanlan.zhihu.com/p/24916786?refer=xiaoleimlnote
34