Nothing Special   »   [go: up one dir, main page]

NN 09

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Artificial Neural Network

and Deep Learning


Lecture 9
Convolution Neural Networks
(CNN)
CNN Architectures for Object detection

Agenda
CNN Architectures for
• Object Detection
• Segmentation
• Semantic Segmentation

1
Object Detection
• The task of assigning a label and a bounding box to all objects in the image.
• There are more than just class labels in an image.

(PASCAL VOC 2012)

Different Deep Learning Architectures for Object


Detection
• R-CNN, Girshick et al. Rich feature hierarchies for accurate object detection and
semantic segmentation. CVPR 2014
• Fast R-CNN, Girshick Fast R-CNN. ICCV 2015
• Faster R-CNN, Ren et al. Faster R-CNN: Towards real-time object detection with
region proposal networks. NIPS 2015
• SSD, Liu et al. SSD: Single Shot MultiBox Detector, arXiv 2015
• YOLO, Redmon et al. You Only Look Once: Unified, Real-Time Object Detection,
CVPR 2016

2
Object Detection
• Object detection as CLASSIFICATION
• Object detection as REGRESSION Problem

Object Detection as Classification


Classes = [cat, dog, duck]
Cat ? NO Cat ? Yes
Dog ? NO Dog ? NO
Duck? NO Duck? NO

Cat ? NO
Cat ? NO
Dog ? NO
Duck? NO Dog ? NO
Duck? NO

3
Object Detection as Classification

Problem:
Too many positions & scales to test

Solution: If your classifier is fast enough, go for it.

Object Detection with R-CNNs


• R-CNN: find regions that we think have objects. Use CNN to classify.

Shortcoming:
• Slow, impossible for real-time detection.
• Hard to optimize.

4
Object Detection as Regression Problem

Reference: Redmon et al. You Only Look Once: Unified, Real-Time


Object Detection, CVPR 2016

• YOLO: Single Regression Problem

YOLO Features:
• Extremely Fast (45 frames per second).
• Global reasoning on the Entire Image.

YOLO Neural Network: You Only Look Once


• You Only Look Once (YOLO) only needs to
process an image once to perform
detection.
• A neural network predicts
• bounding boxes and
• class probabilities
directly from full images in one evaluation.

Reference: Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

5
YOLO, Detection Procedure
1- Divide the image into an S*S grid.
• If the center of an object falls into a grid cell, the cell is
responsible for detecting that object.

7x7 grid
2- Each grid cell predicts B boxes(x,y,w,h) and confidences of each box
P(Object).

P(Object): probability that


the box contains an object

Reference: Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 B=2

YOLO, Detection Procedure, cont.


3- Each grid cell also predicts C conditional class
probabilities (conditioned on the grid cell containing an
object) (Class Scores). As:
Pr(Classi|Object)

Finally do
Then combine the box - threshold detections
and class predictions: and
Pr(Classi|Object)*P(O - Non-Maximum
bject) = P (Class) Suppression (NMS)
Reference: Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

6
YOLO , Experimental using 20 classes
• There are 20 categories.
• 7x7 grid cell.
• Each grid cell needs to predict 2 bounding boxes and 20 class probabilities.

YOLO , Experimental using 20 classes, cont.


• The output of the last layer is S x S x (5 * B + C) = 7x7x (5x2 + 20)
= 7 x 7 x 30 dimensions
• Each 1 * 1 * 30 dimension corresponds to one of the 7 * 7 cells in the original
picture, and 1 * 1 * 30 contains B box coordinate prediction and category
prediction.

7
YOLO , Experimental using 20 classes, cont.

Reference: https://zhuanlan.zhihu.com/p/24916786?refer=xiaoleimlnote

YOLO Design
• The YOLO detection network includes
• 24 convolutional layers and 2 fully connected layers.
• Among them, the convolution layer is used to extract image features, and the fully
connected layer is used to predict image position and class probability values.

Reference: Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

8
YOLO Design, cont.
• Modified GoogLeNet.
• 1x1 reduction layer.

Reference: Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

YOLO: Training
1) Pretrain with ImageNet 1000-class
competition dataset.

9
YOLO: Training, cont.
2) “Network on Convolutional Feature Maps”
Increased input resolution (224x224)

YOLO: Train Strategy

Epochs = 135
batch_size = 64
momentum_a = 0.9
Decay = 0.0005
lr = [10-3, 10-2, 10-3, 10-4]
dropout_rate = 0.5
augmentation
=[scaling, translation, exposure, saturation]

10
YOLO: Inference

• Just like in training?


S=7, B=2 for Pascal VOC

YOLO: Inference

11
12
13
14
15
16
17
18
Look at detection Procedure

19
Get first class scores for each bbox

Get first class scores for each bbox

20
Non-Maximum Suppression

21
Non-Maximum Suppression

Get bbox with max score. Let’s denote it


“bbox_max”

Non-Maximum Suppression

Compare “bbox_max” with others less


score (non-zero!) bboxes. Let’s denote it
“bbox_cur”.

If IoU (bbox_max, bbox_cur)> 0.5 then set


0 score to bbox_cur.

How it compute?

22
Intersection over Union (IoU)

Non-Maximum Suppression

23
Non-Maximum Suppression

Non-Maximum Suppression

24
Non-Maximum Suppression

Non-Maximum Suppression

25
Non-Maximum Suppression

Non-Maximum Suppression

26
Do this procedure for next class

Do this procedure for all classes

27
28
29
Object Detection Dataset

• COCO: Dataset for object detection, image segmentation and image captioning. It
has more than 200k images with 80 object categories.
http://cocodataset.org/#home
• Pascal VOC: Dataset of 20k images labelled with bounding boxes and 20 classes.
http://host.robots.ox.ac.uk/pascal/VOC/
• Open Images: 9M images that have been annotated with image-level labels and
object bounding boxes.
https://storage.googleapis.com/openimages/web/index.html

30
What is Segmentation?
1. Input: images
2. Output: regions, structures
3. Most of the time, we need to "process the image"
1. filters
2. gradient information
3. color information
4. etc.
• That's not quite so human.
• What if we want to understand the image?

Semantic Segmentation
• Semantic segmentation refers to the process of linking each pixel in an image to a
class label. These labels could include a person, car, flower, piece of furniture, etc.,
just to mention a few. We can think of semantic segmentation as image classification
at a pixel level.

"Two men riding on a bike


in front of a building on
the road. And there is a
car."

31
Why Semantic Segmentation?
It is an important step towards complete scene understanding in computer vision.
1. robot vision and understanding
2. autonomous driving
3. medical purposes (ISBI Challenge)

Deep Learning in Segmentation


• After the success of CNN classifiers, segmentation models quickly moved away
from hand-craft features and pipelines but instead use CNN as the main
structure.
• Pre-trained ImageNet classification network serves as a building block for all the
state-of-the-art CNN-based segmentation models.

from left to wright (Li, et. al., (CSI), CVPR, 2013; Long, et. al., (FCN), CVPR 2015; Chen et. al., (DeepLab), PAMI 2018)

32
Deep Learning in Segmentation

Driving Scene Segmentation

33
Deep Learning in semantic Segmentation: FCNs
• Fully Convolutional Networks (FCNs) for Semantic Segmentation

(Fully Convolutional Networks, 2015)

Resources
1. Roger Grosse and Jimmy Ba, CSC421 /2516 winter 2019 Neural Network and
Deep Learning, http://www.cs.toronto.edu.
2. Related Lecture from CS231n @ Stanford. http://cs231n.stanford.edu/
3. Redmon et al. “You only look once: Unified, real-time object detection.”
proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
2016.
4. YOLO, https://zhuanlan.zhihu.com/p/24916786?refer=xiaoleimlnote

34

You might also like