NN 09

Artificial Neural Network
and Deep Learning

Lecture 9
Convolution Neural Networks
(CNN)
CNN Architectures for Object detection
Agenda
CNN Architectures for
• Object Detection
• Segmentation
• Semantic Segmentation
1
Object Detection
• The task of assigning a label and a bounding box to all objects in the image.
• There are more than just class labels in an image.
(PASCAL VOC 2012)
Different Deep Learning Architectures for Object

Detection
• R-CNN, Girshick et al. Rich feature hierarchies for accurate object detection and
semantic segmentation. CVPR 2014
• Fast R-CNN, Girshick Fast R-CNN. ICCV 2015
• Faster R-CNN, Ren et al. Faster R-CNN: Towards real-time object detection with
region proposal networks. NIPS 2015
• SSD, Liu et al. SSD: Single Shot MultiBox Detector, arXiv 2015
• YOLO, Redmon et al. You Only Look Once: Unified, Real-Time Object Detection,
CVPR 2016
2
Object Detection
• Object detection as CLASSIFICATION
• Object detection as REGRESSION Problem
Object Detection as Classification

Classes = [cat, dog, duck]
Cat ? NO Cat ? Yes
Dog ? NO Dog ? NO
Duck? NO Duck? NO
Cat ? NO
Cat ? NO
Dog ? NO
Duck? NO Dog ? NO
Duck? NO
3
Object Detection as Classification
Problem:
Too many positions & scales to test
Solution: If your classifier is fast enough, go for it.
Object Detection with R-CNNs

• R-CNN: find regions that we think have objects. Use CNN to classify.
Shortcoming:
• Slow, impossible for real-time detection.
• Hard to optimize.
4
Object Detection as Regression Problem
Reference: Redmon et al. You Only Look Once: Unified, Real-Time

Object Detection, CVPR 2016
• YOLO: Single Regression Problem
YOLO Features:
• Extremely Fast (45 frames per second).
• Global reasoning on the Entire Image.
YOLO Neural Network: You Only Look Once

• You Only Look Once (YOLO) only needs to
process an image once to perform
detection.
• A neural network predicts
• bounding boxes and
• class probabilities
directly from full images in one evaluation.
Reference: Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
5
YOLO, Detection Procedure
1- Divide the image into an S*S grid.
• If the center of an object falls into a grid cell, the cell is
responsible for detecting that object.
7x7 grid
2- Each grid cell predicts B boxes(x,y,w,h) and confidences of each box
P(Object).
P(Object): probability that

the box contains an object
Reference: Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 B=2
YOLO, Detection Procedure, cont.

3- Each grid cell also predicts C conditional class
probabilities (conditioned on the grid cell containing an
object) (Class Scores). As:
Pr(Classi|Object)
Finally do
Then combine the box - threshold detections
and class predictions: and
Pr(Classi|Object)*P(O - Non-Maximum
bject) = P (Class) Suppression (NMS)
6
YOLO , Experimental using 20 classes
• There are 20 categories.
• 7x7 grid cell.
• Each grid cell needs to predict 2 bounding boxes and 20 class probabilities.
YOLO , Experimental using 20 classes, cont.

• The output of the last layer is S x S x (5 * B + C) = 7x7x (5x2 + 20)
= 7 x 7 x 30 dimensions
• Each 1 * 1 * 30 dimension corresponds to one of the 7 * 7 cells in the original
picture, and 1 * 1 * 30 contains B box coordinate prediction and category
prediction.
7
YOLO , Experimental using 20 classes, cont.
Reference: https://zhuanlan.zhihu.com/p/24916786?refer=xiaoleimlnote
YOLO Design
• The YOLO detection network includes
• 24 convolutional layers and 2 fully connected layers.
• Among them, the convolution layer is used to extract image features, and the fully
connected layer is used to predict image position and class probability values.
8
YOLO Design, cont.
• Modified GoogLeNet.
• 1x1 reduction layer.
YOLO: Training
1) Pretrain with ImageNet 1000-class
competition dataset.
9
YOLO: Training, cont.
2) “Network on Convolutional Feature Maps”
Increased input resolution (224x224)
YOLO: Train Strategy
Epochs = 135
batch_size = 64
momentum_a = 0.9
Decay = 0.0005
lr = [10-3, 10-2, 10-3, 10-4]
dropout_rate = 0.5
augmentation
=[scaling, translation, exposure, saturation]
10
YOLO: Inference
• Just like in training?

S=7, B=2 for Pascal VOC
YOLO: Inference
11
12
13
14
15
16
17
18
Look at detection Procedure
19
Get first class scores for each bbox
Get first class scores for each bbox
20
Non-Maximum Suppression
21
Get bbox with max score. Let’s denote it

“bbox_max”
Compare “bbox_max” with others less

score (non-zero!) bboxes. Let’s denote it
“bbox_cur”.
If IoU (bbox_max, bbox_cur)> 0.5 then set

0 score to bbox_cur.
How it compute?
22
Intersection over Union (IoU)
23
24
25
26
Do this procedure for next class
Do this procedure for all classes
27
28
29
Object Detection Dataset
• COCO: Dataset for object detection, image segmentation and image captioning. It
has more than 200k images with 80 object categories.
http://cocodataset.org/#home
• Pascal VOC: Dataset of 20k images labelled with bounding boxes and 20 classes.
http://host.robots.ox.ac.uk/pascal/VOC/
• Open Images: 9M images that have been annotated with image-level labels and
object bounding boxes.
https://storage.googleapis.com/openimages/web/index.html
30
What is Segmentation?
1. Input: images
2. Output: regions, structures
3. Most of the time, we need to "process the image"
1. filters
2. gradient information
3. color information
4. etc.
• That's not quite so human.
• What if we want to understand the image?
Semantic Segmentation
• Semantic segmentation refers to the process of linking each pixel in an image to a
class label. These labels could include a person, car, flower, piece of furniture, etc.,
just to mention a few. We can think of semantic segmentation as image classification
at a pixel level.
"Two men riding on a bike

in front of a building on
the road. And there is a
car."
31
Why Semantic Segmentation?
It is an important step towards complete scene understanding in computer vision.
1. robot vision and understanding
2. autonomous driving
3. medical purposes (ISBI Challenge)
Deep Learning in Segmentation

• After the success of CNN classifiers, segmentation models quickly moved away
from hand-craft features and pipelines but instead use CNN as the main
structure.
• Pre-trained ImageNet classification network serves as a building block for all the
state-of-the-art CNN-based segmentation models.
from left to wright (Li, et. al., (CSI), CVPR, 2013; Long, et. al., (FCN), CVPR 2015; Chen et. al., (DeepLab), PAMI 2018)
32
Deep Learning in Segmentation
Driving Scene Segmentation
33
Deep Learning in semantic Segmentation: FCNs
• Fully Convolutional Networks (FCNs) for Semantic Segmentation
(Fully Convolutional Networks, 2015)
Resources
1. Roger Grosse and Jimmy Ba, CSC421 /2516 winter 2019 Neural Network and
Deep Learning, http://www.cs.toronto.edu.
2. Related Lecture from CS231n @ Stanford. http://cs231n.stanford.edu/
3. Redmon et al. “You only look once: Unified, real-time object detection.”
proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
2016.
4. YOLO, https://zhuanlan.zhihu.com/p/24916786?refer=xiaoleimlnote
34

NN 09

Uploaded by

Copyright:

Available Formats

NN 09

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NN 09

Uploaded by

Copyright:

Available Formats

Artificial Neural Network

and Deep Learning

(PASCAL VOC 2012)

Different Deep Learning Architectures for Object

Object Detection as Classification

Solution: If your classifier is fast enough, go for it.

Object Detection with R-CNNs

Reference: Redmon et al. You Only Look Once: Unified, Real-Time

• YOLO: Single Regression Problem

YOLO Neural Network: You Only Look Once

P(Object): probability that

YOLO, Detection Procedure, cont.

YOLO , Experimental using 20 classes, cont.

YOLO: Train Strategy

• Just like in training?

Get first class scores for each bbox

Get bbox with max score. Let’s denote it

Compare “bbox_max” with others less

If IoU (bbox_max, bbox_cur)> 0.5 then set

Do this procedure for all classes

"Two men riding on a bike

Deep Learning in Segmentation

Driving Scene Segmentation

(Fully Convolutional Networks, 2015)

You might also like