Object Detection and Recognition For A Pick and Place Robot: Rahul Kumar Sanjesh Kumar

Object Detection and Recognition for a Pick and
Place Robot
Rahul Kumar Sanjesh Kumar
The University of the South Pacific The University of the South Pacific
Suva, Fiji Suva, Fiji
Email: rahul.kumar@usp.ac.fj Email: s11065712@student.usp.ac.fj
Sunil Lal Praneel Chand

The University of the South Pacific The University of the South Pacific
Suva, Fiji Suva, Fiji
Email: sunil.lal@usp.ac.fj Email: chand_pc@usp.ac.fj
Abstract—Controlling a Robotic arm for applications such as error approach). Moreover, the works of [2], [3] and [4]
object sorting with the use of vision sensors would need a robust presents the IP algorithms and approaches to reduce response
image processing algorithm to recognize and detect the target time and increase in the efficiency for the object recognition
object. This paper is directed towards the development of the tasks. In [8], discussion is based on the reduction of
image processing algorithm which is a pre-requisite for the full
computation time using Trainarp algorithm (derived from
operation of a pick and place Robotic arm intended for object
sorting task. For this type of task, first the objects are detected, ANN). It also presents the method to migrate from the
and this is accomplished by feature extraction algorithm. Next, statistical approach to Artificial Neural Networks (ANN). The
the extracted image (parameters in compliance with the author has stated the efficiency as 95% and response time of
classifier) is sent to the classifier to recognize what object it is and 94ms. Likewise, [3] has conversed on employing a parallel
once this is finalized, the output would be the type of the object programming approach called object surface reconstruction
along with it’s coordinates to be ready for the Robotic Arm to method. Upon comparison with serial approach, parallel
execute the pick and place task. The major challenge faced in programming method is ten times faster. To reduce cost and
developing this image processing algorithm was that upon improve on performance, [4] has presented the communication
making the test subjects in compliance with the classifier
of vision system via USB. The vision system used was a
parameters, resizing of the images conceded in the loss of pixel
data. Therefore, a centered image approach was taken. The webcam for which via MATLAB, the system is enabled to
accuracy of the classifier developed in this paper was 99.33% and perceive environment through artificial vision via IP
for the feature extraction algorithm, the accuracy was 83.6443%. algorithms.
Finally, the overall system performance of the image processing
algorithm developed after experimentation was 82.7162%. Along with the classification part, the concept of Feature
Extraction (FE) is also studied. FE mostly acts as a pre-
Keywords – Object Detection, Object Recognition, Feature processing algorithm to furnish the dataset for the classifier to
Extraction, Classifier. make important decisions/classification. The work of [5]
elaborated on the usage of multi-stereo vision technique for
I. INTRODUCTION the detection of 3D Object. Eliminating the background i.e.
Vision based control of the robotic system is the use of the objects of least interest, using opening and closing
visual sensors as a feedback information to control the morphological techniques, 3D detection of a particular object
operation of the robot. Integration of the vision based was achieved. Similarly, [6] conversed about one of the
algorithms can enhance the performance and the efficiency of Robust Object detection algorithm. This algorithm is known
the system. Vision based configurations have been as the Viola-Jones IP method, a state of the art face detector.
implemented to mimic human visual sensors. Orienting The robustness of this algorithm was due to cascaded
towards robotic arms, object recognition is vital for the architecture of the strong classifiers arranged in the order of
operation of arms for navigation and grasping tasks. Often it complexity. This approach was incorporated to reduce the
has been the case that image processing (IP) algorithms processing time. Lastly, feature extraction via Contour
require huge processing time for the successful matching is also one of the best methods to detect objects [7].
implementation of object recognition. The trained shape is matched according to a probabilistically
motivated distance measure which enhances the shape
The work presented in [1] critically explains the basic comparisons within the framework. The works in [7] also
algorithms to be addressed before applying image processing presented on the noise reduction and other image optimization
techniques. These techniques include; image enhancement, via segmentation and other IP techniques.
noise reduction and a visual loop algorithm (based on trial and
The goal of this paper is to develop IP technique which will III. THE FEATURE EXTRACTION ALGORITHM
involve the FE and classification algorithms suitable for object The feature extraction part in the development of this model
sorting task. Additionally, the system to be developed needs to plays a vital role as it furnishes the raw image and complies it
be robust as it will be tested on a real time basis. It is planned according to the classifier’s specifications.
to use the developed IP technique on SCORBOT ER-4U
(robotic arm platform) [9] which will be refurbished and
utilized to sort electronic components such as resistors and
capacitors for laboratory technicians. The remainder of this
paper covers the algorithms of feature extraction and
classification. Further discusses on the determination of object
location and also portrays all the results carried out for the Figure 2: Feature Extraction before Classification
development of the algorithms. Finally, making the
concluding remarks on the results and further The above framework represents the images in the cluttered
recommendations on how to improve the developed model. scene (test subjects) to be tested for correct classifications. For
accurate classification of the objects, feature extraction
algorithm needs to be considered prudently.
II. CONCEPTUAL FRAMEWORK OF THE ENTIRE SYSTEM
The above figure shows how the image processing algorithm A. Algorithm
will work. The constant variable in this case is the x-y 1. The image is read and converted to grayscale. The
dimensions of the workspace. The image taken is first grayscale conversion is achieved by eliminating hue
standardized according to the workspace dimensions. This is and saturation information while preserving the
achieved by resizing the taken image according to the luminance. The colored image (RGB image) is 3
dimensions of the workspace. dimensional. To convert to a 2 dimensional grayscale
width of image (pixels) = (37.875275591) w s (1) image, the following equation represents the correct
height of image (pixels) = (37.875275591) h s (2) proportion of RED, GREEN and BLUE pixels to be
taken into account:
Note : w s and h s are width and height of the workspace 0.2989 R + 0.5870G + 0.1140 B (3)
(where the components are) in centimeters
Next, via feature extraction algorithm, objects are detected,

cropped and resized according to the classifier specifications.
Then using the ANN classifier, the object detected is classified
and the coordinates of the corresponding object is determined.
The scope of this paper is limited up-till here, however, from
here-on is the task of the Robotic arm to pick the object from
the location (coordinates) specified and place it according to
the user’s discretion.
VISION BASED CLASSIFICATION OF OBJECTS

Figure 3: The Original Image
Image taken from

Workspace and Complying to the 2. Then the grayscale image is converted into a binary
resizing according Feature Extraction Test Standards of
to workspace the classifier
image. The method used to convert to binary image is
dimensions known as Otsu’s method. This method is a non-
parametric and an unsupervised method of automatic
threshold selection for picture segmentation. The
threshold selected by Otsu’s method will intend to
Object Classified
Coordinates of the Classifier
minimize the intra-class variance of black and white
object given pixels.
*Even though the threshold value can be selected
independently by the user, it is set to auto-mode as
during the tests, images will vary.
Robotic Arm
Performs the
sorting task
Figure 1: Conceptual Framework of the whole system

d. Double Thresholding: This is where edge selections
are optimized, this step uses double Thresholding
and is not likely be fooled by the noise/color
variations due to rough surfaces. The pixels
which are stronger than the high threshold value
are preserved; however, the pixels which are
weaker than the lower threshold value are
suppressed.
e. Edge tracking by Hysteresis: Here, edges which are
strong edges (with higher threshold values) and
edges (weak ones if any) which are connected to
the strong edges are finally preserved. By default,
8 connected pixels in the canny method are
Figure 4: Binary Image divided into BLOB’s. The BLOB’s containing at
least one strong pixel are selected and preserved,
3. Edge Detection while others are suppressed.
Once the images are converted to binary, edge detection
criterion is applied to detect edges in a given image. This
part is very crucial as this criterion will act as a pre-
requisite for BLOB’s analysis in the later procedures. The
Edge Detection algorithm developed is based on the
“Canny method”. This method finds edges by looking for
the local maxima of the gradient of the image. This
gradient is based on the Gaussian Filter.
Overview of the Canny Method

a. Smoothing: the effect of Gaussian smoothing is
blurring of an image and this degree of blurring is
entirely based on the standard deviation of the Figure 5: Edge Detection
Gaussian distribution. The more the standard 4. Image Dilation: This procedure enlarges the edges
deviation, the more blurry the image. In this created by the edge detection algorithm on the basis of
model a value of 1.41 is set as a standard a rectangular structuring element.
deviation value.
b. Finding Gradients: the Canny method finds edges
where intensity of the images changes the most.
The gradient magnitudes are determined by the
Euclidean distance. Hence, the gradients are
calculated by:
2 2
G = Gx + G y (4)
G = Gx + G y (5)
Where Gx and Gy are the gradients in the x and y

directions.
The direction of the gradient needs to be stored to Figure 6: Image Dilation
keep track for the upcoming Non-maximum 5. Image Filling: The edges dilated are filled such that,
Suppression method. The direction is stored in: outline of the edges are more clear and visible.
⎛ Gy ⎞
θ = arctan ⎜ ⎟ (6)
⎜ Gx ⎟
⎝ ⎠
c. Non-Maximum Suppression: at the above step, the
edges were blurry, this method converts these
blurry edges to sharp edges by preserving all the
local-maxima in the gradient image and deleting
the rest.
and then resized into 20 by 20 pixels sizes. Thereafter, the data
file (in .txt format) is trained by the classifier. Once, via Back-
Propagation algorithm the weights are optimized, the values of
weights are written to a file which will be used to perform tests
(to determine the reliability of the classifier).
Software package used: MATLAB r2012b

V. DEVELOPMENT AND TRAINING OF THE CLASSIFIER
The classifier is developed using the Artificial Neural
Networks whereby the following describes the actual
modelling oriented towards the object sorting task:
1. Organize and partitioning of data into test and training
Figure 7: Image Filling
6. BLOB’s Analysis: Since the image being analyzed is sets.
in the form of a cluttered scene, connected objects as 2. Assign a number to Capacitor images and Resistor images
per the strong filled edges will form a BLOB and are (e.g. Capacitor images are assigned the value of 1 and
identified as an object. Likewise, each BLOB will Resistor images are assigned the value of 2). This will be
form an object. the output (y) for the training example, as this is a
supervised learning algorithm.
Once the objects are detected, a rectangular bounding box will
3. Put the training output data (y) in form of a dimensional
be formed around the detected object and it will be cropped as
outlined by the bounding box. Then that cropped image will be vector.
resized to 20 by 20 pixels and converted to grayscale to be 4. Decide on the number of layers (entirely based on the
ready for testing (for classification). features of the images which are being manipulated).For
the system proposed in the paper, total layers will be 3,
i.e. Input layer, One Hidden layer and an output later.
5. Implement the Feed-Forward Propagation and Cost
function
1 m K (i ) (i ) (i ) (i )
J (θ ) = ∑ ∑ [ − yk log( hθ ( x )) k − (1 − yk ) log(1 − hθ ( x )) k (7)
m i =1 k =1
Regularizing the Cost function:
1 m K (i ) (i ) (i ) (i )
J (θ ) = ∑ ∑ [ − yk log( hθ ( x )) k − (1 − y k ) log(1 − hθ ( x )) k + ...
m i =1 k =1
λ H 400 (1) 2 2 H (2) 2
... + ∑ ∑ (θ ) + ∑ ∑ (θ
2m j ,k j ,k ) (8)
Figure 8: Object Detection j =1 k =1 j =1 k =1
Regularization of the parameters depends on the number of

IV. CONCEPTUAL FRAMEWORK FOR THE TRAINING OF
layers, and the learning rate, λ . The value of K=2 as there are
CLASSIFIER
only two possible outputs. M stands for the number of training
yk(i ) is the output of the ith training example and
TRAINING THE CLASSIFIER
examples,
ARTIFICIAL NEURAL NETWORK
Images from
Batch grayscale
and Resizing to 20
Initialize Random
weighting and
Run Back-
Perform 5-fold
Cross-Validation
hθ ( x (i ) ) is the hypothesis function whereby:
Feature Extraction by 20 Pixels and Propagation to
perform Forward And select best
Algorithm write image data to optimize weights
Propagation model
.txt file
1
hθ ( x ) = (9)
θT x
1+ e
Figure 9: Conceptual Framework for the training phase
The hypothesis function is a sigmoid function which has 1 as
its upper bound and 0 as its lower bound. In addition to that
The above block diagram shows the process by which the
the sigmoid function is differentiable at all points.
training data is manipulated and trained by the classifier. The
JPEG formatted images are the training data which consists of
6. The above process computes the cost of the feed forward
the images of Resistor and the capacitors equally weighted
(same number). These images are first converted to grayscale propagation. To determine the weights of the model, first
a random run must be carried out to start off with the
optimization via Back-propagation. Therefore, in this
initial stage, theta values (weights) are randomized, for
the input and hidden layers.
7. Implement Back-Propagation for optimization of theta
values: Once the hypothesis/prediction is made according
to the initialized random weights. The Back-propagation
starts off with the output layer; it measures the difference
between the networks activation value and the true target
value and further goes towards in the direction of input
layer assigning the error to each neuron.
Figure 12: Model 2 Block Diagram
Note: For the images which exceed the boundary of 20 by 20

pixels, the respective dimension is only resized. (e.g. if an
image is 35 by 18 pixels, 35 will be resized to 20 and 18 will
remain the same)
VII. CROSS VALIDATION OF THE CLASSIFIER

The total training data contained 312 images and the classifier
accuracy was determined by carrying out n-fold cross
validation. The number of folds created was 5. For comparison
Figure 10: Back-Propagation (weights adjustment) purpose the following models were developed and cross
The above block diagram represents how the error is assigned validated to determine the best classifier. The output (2
neurons) and the input (400 neurons) number of layers were
to each term for a particular layer.
constant and the numbers of hidden layers were varied.
8. For higher accuracies, the Neural Network is trained Model No. of Neurons % Accuracy
using higher number of iterations. For the development of 1 25 99.33333
this model 20000 iterations were run to denote the best 1 30 99.66667
values of the weights (theta values). 1 40 99.00000
1 50 98.66667
1 70 99.00000
VI. MODELS
2 25 99.00000
For the object sorting task, in-terms of image processing, two 2 30 98.00000
models of the classifier were developed. The differences in the 2 40 98.66667
model are specified below:
2 50 98.66667
2 70 98.66667
MODEL 1 Table 1: Cross Validation Results
Convert to
Object Cropped Send to the
Cluttered Scene Feature Extraction and Object
Grayscale and
Resize image to
classifier for The Bold numerics in the table represent the best model.
location filed Testing
20 by 20 pixels
Best Model: Model 1
• 400 Input Layers
Figure 11: Model 1 Block Diagram
• 25 Hidden Layers
• 2 Output Layers
The Model 1 technique converts the cropped image to
grayscale and resizes it to comply with the testing standards VIII. LOCATION OF THE EXTRACTED IMAGE (COORDINATES)
(20 by 20 pixels images used for testing). However, in Model
2, considering that resizing of an image detoriates the quality The location of the object is a very important parameter
of the image, the cropped image is placed on the center of a because without the coordinate of the classified object, the
white background. object sorting task would be impossible. Before applying
feature extraction, the whole image was resized as per the
workspace dimensions (converting the metric dimensions to
pixels and this will become the size of the image). From the
scene, during the feature extraction, the locations of the objects
were filed and upon testing (classifier) the program gave out
the center coordinates of the object also assuming the robotic
C. Iterations and Learning Rate Used for training
arm is confined within the boundary of the scene.
For training of the dataset, 20000 iterations were first ran with
The locations of the objects were obtained in form of a a randomized values of weights, once via Back-Propagation
bounding box. In MATLAB this was in form of [xmin, ymin, new weights (theta) are determined, the iterations (20000) were
width, height]. The (xmin, ymin) were actually the coordinate rerun to get optimized values of weights (theta). Moreover, to
of the top-left most edge of the rectangular bounding box and avoid large deviations in the gradient descend algorithm, the
the (width,height) were actually the width in x direction and value of learning rate was kept to 0.1.
height in y direction.
X. RESULTS
A. Feature Extraction Accuracy

The feature extraction algorithm was tested using a separate set
of test data and its accuracy was 83.6443%.
B. Classifier Accuracy
No. True Positive + No. True Negative 100
x (13)
Total Samples 1
The classifier accuracy was determined using equation 13.
C. Final Results
Figure 13: Location for the bounding box Below is the result table of 32 scenes which altogether has a
total of 448 objects inclusive of both capacitor and non-
Center of the object is given by: capacitor images. The operations regarding the testing starts
1 from the feature extraction to the classification task and this
xc = x min + width (10) test images are a separate set of data from the training
2
examples.
1
yc = y min + height (11)
2 Non FE Overall
Scene Capacitors TP TN
whereby xc and yc are in pixels Capacitors Accuracy Accuracy
Converting from pixels to centimeters: 1 6 4 1 4 2 0.6
2 3 0 1 2 1 1
[ xn, yn ] = 0.0264583333(xc,yc) (12)
3 0 6 1 0 6 1
where 0.026458333 is the conversion factor to obtain cm
4 12 13 0.36 6 3 1
values from pixels.
5 25 3 0.57 16 0 1
IX. EXPERIMENTATION
6 7 8 0.8 4 7 0.916667
A. Training Dataset Description
7 9 0 1 1 3 0.444444
The training data consisted of 312 images which comprised of
156 individual capacitor images and 156 individual non- 8 13 5 0.83 11 0 0.733333
capacitor images. The capacitor images were taken on a white
background and the positioning of the images were varied. The 9 1 6 0.57 1 3 1
positioning were not only centered but also cornered in the
10 6 12 0.5 3 5 0.888889
training images. The training data was converted to grayscale
and resized to 20 by 20 pixels and then sent in for testing. 11 8 1 1 4 1 0.555556
12 5 0 1 3 2 1
B. Test Dataset Description
The test data had 32 images with cluttered scene. The test data 13 10 0 0.9 8 0 0.888889
was converted to gray scale, complied with the classifier 14 7 1 1 5 3 1
specifications and then sent for testing.
15 6 0 1 3 0 0.5 Model 1 had their accuracies ranging towards 100%. The
choice of the best model was not only made according to the
16 58 0 1 40 0 0.689655 accuracy but the selection was made upon going over the
following criterions: least cost to attain favorable results (i.e.
17 0 13 0.38 0 5 1
less number of neurons), processing and execution time and
18 0 6 0.5 0 2 0.666667 simplicity of the model. Once the model is classified, the
MATLAB program also outputs the coordinate of the
19 10 7 0.59 6 4 1 classified object to be ready for the robotic arm to execute the
sorting task.
20 4 3 0.86 3 2 0.833333
21 12 0 0.83 10 0 1
XII. RECOMMENDATION
22 24 0 0.67 14 2 1 The major challenge when developing the models was that
23 2 1 1 1 2 1 upon resizing of the images, there’s loss in the pixel data as the
images were in raster formats. The potential solution to this
24 18 17 0.6 10 11 1 predicament would be to invoke the concept of Scalable Vector
Graphics (SVG) formatting of the images. However, since
25 4 0 1 3 0 0.75
MATLAB is not able to process the vector graphics file, the
26 8 9 1 6 5 0.647059 first proposition would be to write a function file which enables
MATLAB to read and modify the .svg/ or any other vector
27 1 0 1 1 0 1 files. Having formed this foundation would solve many
problems in terms of scaling the images.
28 6 7 1 4 6 0.769231
REFERENCES
29 6 5 1 1 5 0.545455
30 10 15 1 6 10 0.64 [1] T. P. Cabre, M. T. Cairol, D. F. Calafell, M. T. Ribes, and J. P. Roca,

"Project-Based Learning Example: Controlling an Educational Robotic
31 0 5 0.8 0 2 0.5 Arm With Computer Vision," Tecnologias del Aprendizaje, IEEE
Revista Iberoamericana de, vol. 8, pp. 135-142, 2013.
32 8 12 1 6 12 0.9 [2] P. J. Sanz, R. Marin, and J. S. Sánchez, "Including efficient object
recognition capabilities in online robots: from a statistical to a Neural-
Average - - 0.83644 0.82716 network classifier," Systems, Man, and Cybernetics, Part C:
Applications and Reviews, IEEE Transactions on, vol. 35, pp. 87-96,
2005.
Table 2: Final Results
[3] V. Lippiello, F. Ruggiero, B. Siciliano, and L. Villani, "Visual grasp
planning for unknown objects using a multifingered robotic hand,"
The above test results are from the test data described in Mechatronics, IEEE/ASME Transactions on, vol. 18, pp. 1050-1059,
2013.
section IX (Experimentation). For this particular test data, the
[4] A. C. Bernal and G. M. Aguilar, "Vision System via USB for Object
FE Accuracy is 83.6443% and the Overall System
Recognition and Manipulation with Scorbot-ER 4U," International
Performance (OSP) as a function of FE and Classification is Journal of Computer Applications, vol. 56, 2012.
said to be 82.7162%. [5] Farahmand, Fazel, Mahsa T. Pourazad, and Zahra Moussavi. "An
intelligent assistive robotic manipulator." In Engineering in Medicine
XI. CONCLUSION and Biology Society, 2005. IEEE-EMBS 2005. 27th Annual
International Conference of the, pp. 5028-5031. IEEE, 2006.
The modeling and implementation of feature extraction [6] A. Ellgammal, “Object Detection and Recognition” Spring 2005,
algorithm and two classifiers for object recognition and Rugters University, Dept of Computer Science.
detection were presented in this paper. The feature extraction [7] Schindler, Konrad, and David Suter. "Object detection by global contour
algorithm yielded an accuracy of 83.6443%. The classifier shape." Pattern Recognition 41, no. 12 (2008): 3736-3748.
[8] P. J. Sanz, R. Marin, and J. S. Sánchez, "Including efficient object
yielded an accuracy of 99.33% (upon Cross validation) and recognition capabilities in online robots: from a statistical to a Neural-
82.7162% (the OSP as it also takes into account the FE network classifier," Systems, Man, and Cybernetics, Part C:
accuracy) upon final testing given a cluttered scene. For the Applications and Reviews, IEEE Transactions on, vol. 35, pp. 87-96,
selection of the classifier, the accuracies of the models 2005.
[9] User Manual by Intelitek: Catalog #100342 Rev. E, SCORBASE
developed were very close. However, the best one chosen version 4.9 and higher for SCORBOT-ER 4u, SCORBOT-ER 2u, 2006.
from those two models was Model 1, which had 25 neurons in
its hidden layer. This is because upon inspecting the
accuracies for individual CV piles, the majority of the piles in

Object Detection and Recognition For A Pick and Place Robot: Rahul Kumar Sanjesh Kumar

Uploaded by

Copyright:

Available Formats

Object Detection and Recognition For A Pick and Place Robot: Rahul Kumar Sanjesh Kumar

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Object Detection and Recognition For A Pick and Place Robot: Rahul Kumar Sanjesh Kumar

Uploaded by

Copyright:

Available Formats

Object Detection and Recognition for a Pick and

Sunil Lal Praneel Chand

Next, via feature extraction algorithm, objects are detected,

VISION BASED CLASSIFICATION OF OBJECTS

Image taken from

Figure 1: Conceptual Framework of the whole system

Overview of the Canny Method

Where Gx and Gy are the gradients in the x and y

Software package used: MATLAB r2012b

Regularizing the Cost function:

Regularization of the parameters depends on the number of

Note: For the images which exceed the boundary of 20 by 20

VII. CROSS VALIDATION OF THE CLASSIFIER

A. Feature Extraction Accuracy

Converting from pixels to centimeters: 1 6 4 1 4 2 0.6

30 10 15 1 6 10 0.64 [1] T. P. Cabre, M. T. Cairol, D. F. Calafell, M. T. Ribes, and J. P. Roca,

You might also like