Robotic Pick-and-Place of Novel Objects in Clutter With Multi-Affordance Grasping and Cross-Domain Image Matching

2018 IEEE International Conference on Robotics and Automation (ICRA)
May 21-25, 2018, Brisbane, Australia
Robotic Pick-and-Place of Novel Objects in Clutter

with Multi-Affordance Grasping and Cross-Domain Image Matching
Andy Zeng1 , Shuran Song1 , Kuan-Ting Yu2 , Elliott Donlon2 , Francois R. Hogan2 , Maria Bauza2 , Daolin Ma2 ,
Orion Taylor2 , Melody Liu2 , Eudald Romo2 , Nima Fazeli2 , Ferran Alet2 , Nikhil Chavan Dafle2 , Rachel Holladay2 ,
Isabella Morona2 , Prem Qu Nair1 , Druck Green2 , Ian Taylor2 , Weber Liu1 , Thomas Funkhouser1 , Alberto Rodriguez2
1 Princeton University 2 Massachusetts Institute of Technology
http://arc.cs.princeton.edu
https://youtu.be/6fG7zwGfIkI
Abstract— This paper presents a robotic pick-and-place sys-

tem that is capable of grasping and recognizing both known
and novel objects in cluttered environments. The key new
feature of the system is that it handles a wide range of object
categories without needing any task-specific training data for
novel objects. To achieve this, it first uses a category-agnostic
affordance prediction algorithm to select and execute among
four different grasping primitive behaviors. It then recog-
nizes picked objects with a cross-domain image classification
framework that matches observed images to product images.
Since product images are readily available for a wide range of
objects (e.g., from the web), the system works out-of-the-box for
novel objects without requiring any additional training data. t 䘟 t
Exhaustive experimental results demonstrate that our multi-
affordance grasping achieves high success rates for a wide Fig. 1. Our picking system grasping a towel from a bin full of objects,
variety of objects in clutter, and our recognition algorithm holding it up away from clutter, and recognizing it by matching observed
achieves high accuracy for both known and novel grasped images of the towel to an available representative product image. The entire
objects. The approach was part of the MIT-Princeton Team system works out-of-the-box for novel objects (appearing for the first time
system that took 1st place in the stowing task at the 2017 during testing) without the need for additional data collection or re-training.
Amazon Robotics Challenge. All code, datasets, and pre-trained
models are available online at http://arc.cs.princeton.edu environments, since they rely on 3D object models and/or
large amounts of training data to achieve robust performance.
I. I NTRODUCTION
Although there has been inspiring recent work on detecting
A human’s remarkable ability to grasp and recognize grasps directly from RGB-D pointclouds as well as learning-
unfamiliar objects with little prior knowledge of them is based recognition systems to handle the constraints of novel
a constant inspiration for robotics research. This ability to objects and limited data, these methods have yet to be proven
grasp the unknown is central to many applications: from in the constraints and accuracy required by a real task with
picking packages in a logistic center to bin-picking in a heavy clutter, severe occlusions, and object variability.
manufacturing plant; from unloading groceries at home to In this paper, we propose a system that picks and recog-
clearing debris after a disaster. The main goal of this work nizes objects in cluttered environments. We have designed
is to demonstrate that it is possible – and practical – for a the system specifically to handle a wide range of objects
robotic system to pick and recognize novel objects with very novel to the system without gathering any task-specific
limited prior information about them (e.g. with only a few training data for them. To make this possible, our system
representative images scraped from the web). consists of two components: 1) a multi-modal grasping
Despite the interest of the research community, and despite framework featuring four primitive behaviors, which uses
its practical value, robust manipulation and recognition of deep convolutional neural networks (ConvNets) to predict
novel objects in cluttered environments still remains a largely affordances for a scene without a priori object segmentation
unsolved problem. Classical solutions for robotic picking and classification; and 2) a cross-domain image matching
require recognition and pose estimation prior to model-based framework for recognizing grasped objects by matching them
grasp planning, or require object segmentation to associate to product images, which uses a ConvNet architecture that
grasp detections with object identities. These solutions tend adapts to novel objects without additional re-training. Both
to fall short when dealing with novel objects in cluttered components work hand-in-hand to achieve robust picking
performance of novel objects in heavy clutter.
The authors would like to thank the MIT-Princeton ARC team members We provide exhaustive experiments and ablation studies to
for their contributions to this project, and ABB Robotics, Mathworks, Intel,
Google, NSF (IIS-1251217 and VEC 1539014/1539099), and Facebook for evaluate both components. We demonstrate that the multi-
hardware, technical, and financial support. affordance predictor for grasp planning achieves high suc-
978-1-5386-3081-5/18/$31.00 ©2018 IEEE 3750

cess rates for a wide variety of objects in clutter, and the
recognition algorithm achieves high accuracy for known and
novel grasped objects. These algorithms were developed as
part of the MIT-Princeton Team system that took 1st place in
the stowing task of the Amazon Robotics Challenge (ARC),
being the only system to have successfully stowed all known
and novel objects from an unstructured tote into a storage
system within the allotted time frame. Fig. 1 shows our robot
in action during the competition.
In summary, our main contributions are:
• An object-agnostic picking framework using four prim-
itive behaviors for fast and robust picking, utilizing
a novel approach for estimating parallel jaw grasp
affordances (Section IV).
• A perception framework for recognizing both known
and novel objects using only product images without
extra data collection or re-training (Section V).
• A system combining these two frameworks for picking
novel objects in heavy clutter.
All code, datasets, and pre-trained models are available on- Fig. 2. The bin and camera setup. Our system consists of 4 units (top),
where each unit has a bin with 4 stationary cameras: two overlooking the
line at http://arc.cs.princeton.edu [1]. We also provide a video bin (bottom-left) are used for predicting grasp affordances while the other
summarizing our approach at https://youtu.be/6fG7zwGfIkI, two (bottom-right) are used for recognizing the grasped object.
and a supplementary appendix with more details on our
system at https://arxiv.org/abs/1710.01330. error propagation from a prior recognition step. Matsumoto
et al. [20] apply this idea in a full picking system by using
II. R ELATED W ORK a ConvNet to compute grasp proposals, while in parallel
In this section, we review works related to robotic picking predicting semantic segmentations for a fixed set of known
systems. Works specific to grasping (Section IV) and recog- objects. Although these pick-and-place systems use object-
nition (Section V) are in their respective sections. agnostic grasping methods, they still require some form of in-
place object recognition in order to associate grasp proposals
A. Recognition followed by Model-based Grasping
with object identities, which is particularly challenging when
A large number of autonomous pick-and-place solutions dealing with novel objects in clutter.
follow a standard two-step approach: object recognition and
pose estimation followed by model-based grasp planning. For C. Active Perception
example, Jonschkowski et al. [2] designed object segmenta- Active perception – exploiting control strategies for ac-
tion methods over handcrafted image features to compute quiring data to improve perception [21], [22] – can facilitate
suction proposals for picking objects with a vacuum. More the recognition of novel objects in clutter. For example, Jiang
recent data-driven approaches [3], [4], [5], [6] use ConvNets et al. [23] describe a robotic system that actively rearranges
to provide bounding box proposals or segmentations, fol- objects in the scene (by pushing) in order to improve recog-
lowed by geometric registration to estimate object poses, nition accuracy. Other works [24], [25] explore next-best-
which ultimately guide handcrafted picking heuristics [7], view based approaches to improve recognition, segmentation
[8]. Nieuwenhuisen et al. [9] improve many aspects of this and pose estimation results. Inspired by these works, our
pipeline by leveraging robot mobility, while Liu et al. [10] system applies active perception by using a grasp-first-
adds a pose correction stage when the object is in the gripper. then-recognize paradigm where we leverage object-agnostic
These works typically require 3D models of the objects dur- grasping to isolate each object from clutter in order to
ing test time, and/or training data with the physical objects significantly improve recognition accuracy for novel objects.
themselves. This is practical for tightly constrained pick-and-
III. S YSTEM OVERVIEW
place scenarios, but is not easily scalable to applications that
consistently encounter novel objects, for which only limited We present a robotic pick-and-place system that grasps
data (i.e. product images from the web) is available. and recognized both known and novel objects in cluttered
environments. The “known” objects are provided to the
B. Recognition in parallel with Object-Agnostic Grasping system at training time, both as physical objects and as
It is also possible to exploit local features of objects with- representative product images (images of objects available
out object identity to efficiently detect grasps [11], [12], [13], on the web); while the “novel” objects are provided only at
[14], [15], [16], [17], [18], [19]. Since these methods are test time in the form of representative product images.
agnostic to object identity, they better adapt to novel objects Overall approach. The system follows a grasp-first-then-
and experience higher picking success rates by eliminating recognize work-flow. For each pick-and-place operation, it
3751
VXFWLRQGRZQ VXFWLRQVLGH JUDVSGRZQ IOXVKJUDVS
Fig. 3. Multi-functional gripper with a retractable mechanism that enables Fig. 4. Multiple motion primitives for suction and grasping to ensure
quick and automatic switching between suction (pink) and grasping (blue). successful picking for a wide variety of objects in any orientation.
first performs an object-agnostic affordance prediction, con- their object identities or poses. To this end, we define a
sidering multiple different grasping modes from suction to set of motion primitives that are complimentary to each
parallel-jaw grasps (Section IV). It then selects the best other in terms of utility across different object types and
affordance, picks up one object, isolates it from the clutter, scenarios – empirically maximizing the variety of objects and
holds it up in front of cameras, recognizes its category, orientations that can be picked with at least one primitive.
and places it in the appropriate bin. Although the object Given RGB-D images of the cluttered scene at test time, we
recognition algorithm is trained only on known objects, predict a set of affordances to generate grasp proposals with
it is able to recognize novel objects through a learned confidence scores for each primitive. These are then used by
cross-domain image matching embedding between observed a task planner to choose which primitive to use.
images of held objects and product images (Section V).
A. Motion primitives
Advantages. This system design has several advantages.
We define four motion primitives to achieve robust picking
First, the affordance prediction algorithm is model-free and
for typical household objects. Fig. 4 shows example motions
agnostic to object identities and generalizes to novel ob-
for each primitive. Each of them are implemented as a set of
jects without re-training. Second, the category recognition
guarded moves, with collision avoidance and quick success
algorithm works without task-specific data collection or
or failure feedback mechanisms. They are as follows:
re-training for novel objects, which makes it scalable for
applications in warehouse automation and service robots Suction down grasps objects with a vacuum gripper ver-
where the range of observed object categories is large and tically. This primitive is particularly robust for objects
dynamic. Third, the affordance prediction algorithm supports with large and flat suctionable surfaces (e.g. boxes, books,
multiple grasping modes and thus handles a wide variety of wrapped objects), and performs well in heavy clutter.
objects. Finally, the entire processing pipeline requires only Suction side grasps objects from the side by approaching
two forward passes through deep networks and thus executes with a vacuum gripper tilted an an angle. This primitive is
quickly (Table II). robust to thin and flat objects resting against walls, which
System setup. Our system features a 6DOF ABB IRB may not have suctionable surfaces from the top.
1600id robot arm next to four picking work-cells. The robot Grasp down grasps objects vertically using the two-finger
arm’s end-effector is a multi-functional gripper with two parallel-jaw gripper. This primitive is complementary to
fingers for parallel-jaw grasps and a retractable suction cup the suction primitives in that it is able to pick up objects
(Fig. 3). This gripper was designed to function in cluttered with smaller, irregular surfaces (e.g. small tools, deformable
environments: finger and suction cup length are specifically objects), or made of semi-porous materials that prevent a
chosen such that the bulk of the gripper body does not good suction seal (e.g. cloth).
need to enter the cluttered space. Each work-cell has a Flush grasp retrieves unsuctionable objects that are flushed
storage bin and four statically-mounted RealSense SR300 against a wall. The primitive is similar to grasp down, but
RGB-D cameras (Fig. 2): two cameras overlooking the with the additional behavior of using a flexible spatula to
storage bins are used to predict grasp affordances, while slide between the target object and the wall.
the other two pointing towards the robot gripper are used to
recognize objects in the gripper. Although our experiments B. Affordance Prediction
were performed with this setup, the system was designed to Given the set of pre-defined picking primitives and RGB-
be flexible for picking and placing between any number of D images of the scene, we predict pixel-level affordances
reachable work-cells and camera locations. Furthermore, all for each motion primitive, from which we can generate
manipulation and recognition algorithms in this paper were suction and grasp proposals. Our approach relies on the
designed to be easily adapted to other system setups. assumption that graspable regions can be deduced from the
local geometry and material properties, as reflected in visual
IV. M ULTI -A FFORDANCE G RASPING information. This is inspired by recent data-driven methods
The goal of the first step in our system is to robustly for grasp planning [11], [12], [13], [15], [16], [17], [18], [19],
grasp objects from a cluttered scene without relying on which do not rely on object identities or state estimation. We
3752
,QSXW5*%',PDJHV
6XFWLRQ 䘟VVXFWLRQGRZQ
$IIRUGDQFH
&RQY1HW
tVX
VXFWLRQVLGH
5RWDWHG+HLJKWPDSV
+RUL]RQWDO
*UDVS
䘟JJUDVSGRZQ
$IIRUGDQFH tIOIOXVKJUDVS
&RQY1HW
Fig. 5. Suction and grasp affordance prediction. Given multi-view RGB-D images, we estimate suction affordances for each image with a fully
convolutional residual network. We then aggregate the predictions on a 3D point cloud, and generate suction down or suction side proposals based on
surface normals. In parallel, we merge RGB-D images into an RGB-D heightmap, rotate it by 16 different angles, and estimate horizontal grasp for each
heightmap. This effectively produces affordance maps for 16 different grasp angles, from which we generate the grasp down and flush grasp proposals.
extend these data-driven approaches by training models to camera intrinsics and poses to project the probability maps
predict pixel-level affordances for multiple types of grasps, and aggregate the affordance predictions onto a combined
and employ fully convolutional networks (FCN) [26] to 3D point cloud. We then compute surface normals for each
efficiently obtain dense predictions over a single image of 3D point, which are used to classify which suction primitive
the scene to achieve faster run time speeds. (down or side) to use for the point. To handle objects without
In this subsection, we present an overview of how we depth, we use a simple hole filling algorithm [28] on the
predict affordances for our suction and grasping primitives. depth images, and project predicted probability scores onto
For more details about our network architectures, their train- the hallucinated depth.
ing parameters, post-processing steps, and training datasets, Predicting Grasp Affordances. Each grasp proposal is
please refer to our project webpage [1]. represented by the x, y, z position of the gripper in 3D space,
Predicting Suction Affordances. We define suction pro- the orientation θ of the gripper around the vertical axis, the
posals as 3D positions where the vacuum gripper’s suction desired gripper opening distance do , and confidence score cg .
cup should come in contact with the object’s surface in To predict grasping affordances, we first aggregate the
order to successfully grasp it. Good suction proposals should two RGB-D images of the scene into a registered 3D
be located on suctionable surfaces, and nearby the target point cloud, which is then orthographically back-projected
object’s center of mass to avoid an unstable suction seal upwards in the gravity direction to obtain a “heightmap”
(e.g. particularly for heavy objects). Each suction proposal is image representation of the scene, with both color (RGB) and
defined as a 3D position x, y, z, its surface normal nx , ny , nz , height from bottom (D) channels. To handle objects without
and confidence score cs . depth, we triangulate no-depth regions in the heightmap
We train a fully convolutional residual network (ResNet- using both views, and fill in the regions with a height of
101 [27]), that takes an RGB-D image as input, and outputs 3cm. We feed this RGB-D heightmap as input to a fully
a densely labeled pixel-level binary probability map cs , convolutional ResNet-101 [27], which densely predicts pixel-
where values closer to one imply a more preferable suction level binary probability maps, which serve as confidences
location, shown in Fig. 5 first row. Our network architecture values cg for horizontally oriented grasps, shown in Fig. 5
is multi-modal, where the color data is fed into one ResNet- second row. The architecture of this network is similar
101 tower, and 3-channel depth (cloned across channels, in structure to the network predicting suction affordances.
normalized by subtracting mean and dividing by standard By rotating the heightmap in 16 different orientations and
deviation) is fed into another ResNet-101 tower. Features feeding each individually through the network, we obtain 16
from the ends of both towers are concatenated across chan- binary probability maps, each representing a confidence map
nels, followed by 3 additional spatial convolution layers to for a grasp in a different orientation.
merge the features; then spatially bilinearly upsampled and We find this network architecture to be more flexible to
softmaxed to output a single binary probability map. We various grasp orientations, and less likely to diverge during
train our model over a manually annotated dataset of RGB-D training due to the sparsity of manual grasp annotations.
images of cluttered scenes with diverse objects, where pixels We train our model over a manually annotated dataset of
are densely labeled either positive, negative, or neither (using RGB-D heightmaps, where each positive and negative grasp
wide-area brushstrokes from the labeling interface). We train label is represented by a pixel on the heightmap as well as a
our network with 0 loss propagation for the regions that are corresponding angle parallel to the jaw motion of the gripper.
labeled as neither positive nor negative. Our grasp affordance predictions return grasp locations
During testing, we feed each captured RGB-D image (x, y, z), orientations (θ ), and confidence scores (cg ). During
through our trained network to generate probability maps post-processing, we use the geometry of the 3D point cloud
for each view. As a post-processing step, we use calibrated to estimate grasp widths (do ) for each proposal. We also use
3753
7UDLQLQJ 7HVWLQJ
SURGXFWLPDJHV LQSXW
ƐGLVWDQFH
UDWLRORVV IHDWXUH
HPEHGGLQJ
NQRZQ
PDWFK"
REVHUYHGLPDJHV
QRYHO PDWFK
VRIWPD[ORVV
IRU.1HWRQO\
Fig. 6. Recognition framework for novel objects. We train a two-stream convolutional neural network where one stream computes 2048-dimensional
feature vectors for product images while the other stream computes 2048-dimensional feature vectors for observed images, and optimize both streams so
that features are more similar for images of the same object and dissimilar otherwise. During testing, product images of both known and novel objects are
mapped onto a common feature space. We recognize observed images by mapping them to the same feature space and finding the nearest neighbor match.
the location of each proposal relative to the bin to classify feature embedding that encapsulates object shape, color, and
which grasping primitive (down or flush) should be used. other visual discriminative properties, which can generalize
and be used to match observed images of novel objects to
V. R ECOGNIZING N OVEL O BJECTS their respective product images (Fig. 6).
After successfully grasping an object and isolating it from Avoiding metric collapse by guided feature embeddings.
clutter, the goal of the second step in our system is to One issue commonly encountered in metric learning occurs
recognize the identity of the grasped object. when the number of training object categories is small – the
Since we encounter both known and novel objects, and we network can easily overfit its feature space to capture only
have only product images for the novel objects, we address the small set of training categories, making generalization
this recognition problem by retrieving the best match among to novel object categories difficult. We refer to this problem
a set of product images. Of course, observed images and as metric collapse. To avoid this issue, we use a model pre-
product images can be captured in significantly different trained on ImageNet [33] for the product image stream and
environments in terms of lighting, object pose, background train only the stream that computes features for observed
color, post-process editing, etc. Therefore, we need a model images. ImageNet contains a large collection of images from
that is able to find the semantic correspondences between many categories, and models pre-trained on it have been
images from these two different domains. This is a cross- shown to produce relatively comprehensive and homogenous
domain image matching problem [29], [30], [31]. feature embeddings for transfer tasks [34] – i.e. providing
discriminating features for images of a wide range of objects.
A. Metric Learning for Cross-Domain Image Matching
Our training procedure trains the observed image stream to
To do the cross-domain image matching between observed produce features similar to the ImageNet features of product
images and product images, we learn a metric function images – i.e., it learns a mapping from observed images to
that takes in an observed image and a candidate product ImageNet features. Those features are then suitable for direct
image and outputs a distance value that models how likely comparison to features of product images, even for novel
the images are of the same object. The goal of the metric objects not encountered during training.
function is to map both the observed image and product Using multiple product images. For many applications,
image onto a meaningful feature embedding space so that there can be multiple product images per object. However,
smaller 2 feature distances indicate higher similarities. The with multiple product images, supervision of the two-stream
product image with the smallest metric distance to the network can become confusing - on which pair of matching
observed image is the final matching result. observed and product images should the backpropagated
We model this metric function with a two-stream convo- gradients be based? To solve this problem, we add a module
lutional neural network (ConvNet) architecture where one we call a “multi-anchor switch” in the network. During
stream computes features for the observed images, and a training, this module automatically chooses which “anchor”
different stream computes features for the product images. product image to compare against based on nearest neighbor
We train the network by feeding it a balanced 1:1 ratio of 2 distance. We find that allowing the network to select
matching and non-matching image pairs (one observed image its own criterion for choosing “anchor” product images
and one product image) from the set of known objects, and provides a significant boost in performance in comparison
backpropagate gradients from the distance ratio loss (Triplet to alternative methods like random sampling.
loss [32]). This effectively optimizes the network in a way
that minimizes the 2 distances between features of matching B. Two Stage Framework for a Mixture of Known and Novel
pairs while pulling apart the 2 distances between features Objects
of non-matching pairs. By training over enough examples of In settings where both types of objects are present, we
these image pairs across known objects, the network learns a find that training two different network models to handle
3754
TABLE I
known and novel objects separately can yield higher overall
M ULTI -A FFORDANCE P REDICTION P ERFORMANCE
matching accuracies. One is trained to be good at “over-
fitting” to the known objects (K-net) and the other is trained Primitive Method Top-1 Top 1% Top 5% Top 10%
to be better at “generalizing” to novel objects (N-net). Baseline 35.2 55.4 46.7 38.5
Suction
ConvNet 92.4 83.4 66.0 52.0
Yet, how do we know which network to use for a given Baseline 92.5 90.7 87.2 73.8
Grasping
image? To address this issue, we execute our recognition ConvNet 96.7 91.9 87.6 84.1
pipeline in two stages: a “recollection” stage that determines % precision of predictions across different confidence percentiles.
whether the observed object is known or novel, and a
“hypothesis” stage that uses the appropriate network model positive and negative grasps over re-projected height maps
based on the first stage’s output to perform image matching. of cluttered bins, where each grasp is represented by a pixel
First, the recollection stage predicts whether the input on the height map and an angle parallel to the jaw motion of
observed image from test time is that of a known object the gripper. We further augment each grasp label by adding
that has appeared during training. Intuitively, an observed additional labels with small jittering (less than 1.6cm). In
image is of a novel object if and only if its deep features total, the dataset contains 1837 RGB-D images with suction
cannot match to that of any images of known objects. and grasp labels. We use a 4:1 training/testing split across
We explicitly model this conditional by thresholding on this dataset to train and evaluate different models.
the nearest neighbor distance to product image features of Evaluation. In the context of our system, an affordance
known objects. In other words, if the 2 distance between prediction method is robust if it is able to consistently
the K-net features of an observed image and the nearest find at least one suction or grasp proposal that works. To
neighbor product image of a known object is greater than reflect this, our evaluation metric is the precision of predicted
some threshold k, then the observed images is a novel object. proposals versus manual annotations. For suction, a proposal
In the hypothesis stage, we perform object recognition is considered a true positive if its pixel center is manually
based on one of two network models: K-net for known ob- labeled as a suctionable area. For grasping, a proposal is
jects and N-net for novel objects. The K-net and N-net share considered a true positive prediction if its pixel center is
the same network architecture. However, the K-net has an within 4 pixels and 11.25 degrees from a positive grasp label.
additional auxiliary classification loss during training for the We report the precision of our predicted proposals for
known objects. This classification loss increases the accuracy different confidence percentiles in Table I. The precision of
of known objects at test time to near perfect performance, the top-1 proposal is reliably above 90% for both suction
and also boosts up the accuracy of the recollection stage, and grasping. We further compare our methods to heuristic-
but fails to maintain the accuracy of novel objects. On the based baseline algorithms that compute suction affordances
other hand, without the restriction of the classification loss, by estimating surface normal variance over the observed
N-net has a lower accuracy for known objects, but maintains 3D point cloud (lower variance = higher affordance), and
a better accuracy for novel objects. computes anti-podal grasps by detecting hill-like geometric
By adding the recollection stage, we can exploit both structures in the 3D point cloud. Baseline details and code
the high accuracy of known objects with K-net and good are available on our project webpage [1].
accuracy of novel objects with N-net, though incurring a cost Speed. Our suction and grasp affordance algorithms were
in accuracy from erroneous known vs novel classification. designed to achieve fast run-time speeds during test time by
We find that this two stage system overall provides higher densely predicting affordances over a single image of the
total matching accuracy for recognizing both known and entire scene. In Table II, we compare our run-time speeds to
novel objects (mixed) than all other baselines (Table III). several state-of-the-art alternatives for grasp planning. Our
own numbers measure the time of each FCN forward pass,
VI. E XPERIMENTS
reported with an NVIDIA Titan X on an Intel Core i7-3770K
In this section, we evaluate our multi-affordance prediction clocked at 3.5 GHz, excluding time for image capture and
for suction and grasp primitives, our recognition algorithm other system-related overhead.
over both known and novel objects, as well as our full system
in the context of the Amazon Robotics Challenge 2017. B. Recognition of Novel Objects Evaluation
We evaluate our recognition algorithms using a 1 vs 20
A. Multi-Affordance Prediction Experiments
classification benchmark. Each test sample in the benchmark
Datasets. To generate datasets for affordance predictions, we contains 20 possible object classes, where 10 are known and
designed a simple labeling interface that prompts users to 10 are novel, chosen at random. During each test sample, we
manually annotate suction and grasp proposals over RGB-D feed the recognition algorithm the product images for all 20
images collected from the real system. For suction, users who objects as well as an observed image of a grasped object.
have had experience working with our suction gripper are In Table III, we measure performance in terms of top-1
asked to annotate pixels of suctionable and non-suctionable accuracy for matching the observed image to a product image
areas on raw RGB-D images overlooking cluttered bins of the correct object match. We evaluate our method against
full of various objects. Similarly, users with experience a baseline algorithm, a state-of-the-art network architecture
using our parallel-jaw gripper are asked to sparsely annotate for both visual search [31] and one shot learning without
3755
TABLE II TABLE III
G RASP P LANNING RUN -T IMES ( SEC .) R ECOGNITION E VALUATION (% ACCURACY OF T OP -1 P REDICTION )
Method Time Method K vs N Known Novel Mixed

Lenz et al. [12] 13.5 Nearest Neighbor 69.2 27.2 52.6 35.0
Zeng et al. [4] 10 - 15 Siamese ([31], [35]) 70.3 76.9 68.2 74.2
Hernandez et al. [3] 5 - 40 a Two-stream 70.8 85.3 75.1 82.2
Schwarz et al. [5] 0.9 - 3.3 Two-stream + GE 69.2 64.3 79.8 69.0
Dex-Net 2.0 [17] 0.8 Two-stream + GE + MP (N-net) 69.2 56.8 82.1 64.6
Matsumoto et al. [20] 0.2 N-net + AC (K-net) 93.2 99.7 29.5 78.1
Redmon et al. [13] 0.07 Two-stage K-net + N-net 93.2 93.6 77.5 88.6
Ours (suction) 0.06
Ours (grasping) 0.05×n b
a times reported from [20] derived from [3].
b n = number of possible grasp angles.
C. Full System Evaluation in Amazon Robotics Challenge
To evaluate the performance of our system as a whole,

retraining [35], and several variations of our method. The
we used it as part of our MIT-Princeton entry for the
latter provides an ablation study to show the improvements
2017 Amazon Robotics Challenge (ARC), where state-of-
in performance with every added component:
the-art pick-and-place solutions competed in the context
Nearest Neighbor is a baseline algorithm where we compute of a warehouse automation task. Participants were tasked
features of product images and observed images using a with designing a robot system to grasp and recognize a
ResNet-50 pre-trained on ImageNet, and use nearest neigh- large variety of different objects in unstructured storage
bor matching with 2 distance. systems. The objects were characterized by a number of
Siamese network with weight sharing is a re- difficult-to-handle properties. Unlike earlier versions of the
implementation of Bell et al. [31] for visual search and Koch competition [36], half of the objects were novel in the 2017
et al. [35] for one shot recognition without retraining. We use edition of the competition. The physical objects as well as
a Siamese ResNet-50 pre-trained on ImageNet and optimized related item data (i.e. product images, weight, 3D scans),
over training pairs in a Siamese fashion. The main difference were given to teams just 30 minutes before the competition.
between this method and ours is that the weights between While other teams used the 30 minutes to collect training
the networks computing deep features for product images data for the new objects and retrain models, our unique
and observed images are shared. system did not require any of that during those 30 minutes.
Two-stream network without weight sharing is a two- Setup. Our system setup for the competition features several
stream network, where the networks’ weights for product differences. We incorporated weight sensors to our system,
images and observed images are not shared. Without weight using them as a guard to signal stop or modify primitive
sharing the network has more flexibility to learn the mapping behavior during execution. We also used the measured
function and thus achieves higher matching accuracy. All the weights of objects provided by Amazon to boost recognition
later models describe later in this section use this two stream accuracy to near perfect performance. Green screens made
network without weight sharing. the background more uniform to further boost accuracy of the
Two-stream + guided-embedding (GE) includes a guided system in the recognition phase. For predicting affordances,
feature embedding with ImageNet features for the product Table I shows that our data-driven methods with ConvNets
image stream. We find this model has better performance give improved affordance predictions for both suction and
for novel objects than for known objects. grasping, with respect to the baseline algorithms. For the
Two-stream + guided-embedding (GE) + multi-product- case of grasping, however, we did not have time to develop
images (MP) By adding a multi-anchor switch, we see more a fully stable ConvNet before the day of the competition,
improvements to accuracy for novel objects. This is the final so we decided to avoid risks and use the baseline grasping
network architecture for N-net. algorithm. The ConvNet approach became stable with the
Two-stream + guided-embedding (GE) + multi-product- reduction to predicting only horizontal grasps and rotating
images (MP) + auxiliary classification (AC) By adding the heightmaps. Additionally for the competition, we also
an auxiliary classification, we achieve near perfect accuracy designed a placing algorithm that uses heightmaps and
of known objects for later models, however, at the cost of object bounding boxes to determine stable placements for
lower accuracy for novel objects. This also improves known the objects after recognition.
vs novel (K vs N) classification accuracy for the recollection Results. During the ARC 2017 final stowing task, we had
stage. This is the final network architecture for K-net. a 58.3% pick success with suction, 75% pick success with
Two-stage system As described in Section V, we combine grasping, and 100% recognition accuracy during the stow
the two different models - one that is good at known objects task of the ARC, stowing all 20 objects within 24 suction
(K-net) and the other that is good at novel objects (N-net) - in attempts and 8 grasp attempts. Our system took 1st place in
the two stage system. This is our final recognition algorithm, the stowing task, being the only system to have successfully
and it achieves better performance than any single model for stowed all known and novel objects and to have finished the
test cases with a mixture of known and novel objects. task well within the allotted time frame.
3756
VII. D ISCUSSION AND F UTURE W ORK [5] M. Schwarz, A. Milan, C. Lenz, A. Munoz, A. S. Periyasamy,
M. Schreiber, S. Schüller, and S. Behnke, “Nimbro picking: Versatile
We present a system to pick and recognize novel objects part handling for warehouse automation,” in ICRA, 2017.
with very limited prior information about them (a handful of [6] J. M. Wong, V. Kee, T. Le, S. Wagner, G.-L. Mariottini, A. Schneider,
product images). The system first uses a category-agnostic af- L. Hamilton, R. Chipalkatty, M. Hebert, et al., “Segicp: Integrated
deep semantic segmentation and pose estimation,” arXiv, 2017.
fordance prediction algorithm to select among four different [7] A. Bicchi and V. Kumar, “Robotic Grasping and Contact,” ICRA.
grasping primitive behaviors, and then recognizes grasped [8] A. Miller, S. Knoop, H. Christensen, and P. K. Allen, “Automatic
objects by matching them to their product images. We grasp planning using shape primitives,” ICRA, 2003.
[9] M. Nieuwenhuisen, D. Droeschel, D. Holz, J. Stückler, A. Berner,
evaluate both components and demonstrate their combination J. Li, R. Klein, and S. Behnke, “Mobile bin picking with an anthro-
in a robot system that picks and recognizes novel objects in pomorphic service robot,” in ICRA, 2013.
heavy clutter, and that took 1st place in the stowing task of [10] M.-Y. Liu, O. Tuzel, A. Veeraraghavan, Y. Taguchi, T. K. Marks, and
R. Chellappa, “Fast object localization and pose estimation in heavy
the Amazon Robotics Challenge 2017. Here are some of the clutter for robotic bin picking,” IJRR, 2012.
most salient features/limitations of the system: [11] e. a. Morales, Antonio, “Using experience for assessing grasp relia-
bility,” in IJHR, 2004.
Object-Agnostic Manipulation. The system finds grasp [12] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic
affordances directly in the RGBD image. This proved faster grasps,” in IJRR, 2015.
and more reliable than doing object segmentation and state [13] J. Redmon and A. Angelova, “Real-time grasp detection using convo-
lutional neural networks,” in ICRA, 2015.
estimation prior to grasp planning [4]. The ConvNet learns [14] A. ten Pas and R. Platt, “Using geometry to detect grasp poses in 3d
the visual features that make a region of an image graspable point clouds,” in ISRR, 2015.
or suctionable. It also seems to learn more complex rules, [15] L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to
grasp from 50k tries and 700 robot hours,” in ICRA, 2016.
e.g., that tags are often easier to suction that the object itself, [16] L. Pinto, J. Davidson, and A. Gupta, “Supervision via competition:
or that the center of a long object is preferable than its ends. It Robot adversaries for learning tasks,” in ICRA, 2017.
would be interesting to explore the limits of the approach. For [17] e. a. Mahler, Jeffrey, “Dex-net 2.0: Deep learning to plan robust grasps
with synthetic point clouds and analytic grasp metrics,” in RSS, 2017.
example learning affordances for more complex behaviors, [18] M. Gualtieri, A. ten Pas, K. Saenko, and R. Platt, “High precision
e.g., scooping an object against a wall, which require a more grasp pose detection in dense clutter,” in arXiv, 2017.
global understanding of the geometry of the environment. [19] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learn-
ing hand-eye coordination for robotic grasping with large-scale data
Pick First, Ask Questions Later. The standard grasping collection,” in ISER, 2016.
pipeline is to first recognize and then plan a grasp. In this [20] E. Matsumoto, M. Saito, A. Kume, and J. Tan, “End-to-end learning
paper we demonstrate that it is possible and sometimes of object grasp poses in the amazon robotics challenge.”
[21] R. Bajcsy and M. Campos, “Active and exploratory perception,”
beneficial to reverse the order. Our system leverages object- CVGIP: Image Understanding, vol. 56, no. 1, 1992.
agnostic picking to remove the need for state estimation [22] S. Chen, Y. Li, and N. M. Kwok, “Active vision in robotic systems:
in clutter. Isolating the picked object drastically increases A survey of recent developments,” IJRR, 2011.
[23] D. Jiang, H. Wang, W. Chen, and R. Wu, “A novel occlusion-free
object recognition reliability, especially for novel objects. active recognition algorithm for objects in clutter,” in ROBIO, 2016.
We conjecture that ”pick first, ask questions later” is a good [24] K. Wu, R. Ranasinghe, and G. Dissanayake, “Active recognition and
approach for applications such as bin-picking, emptying a pose estimation of household objects in clutter,” in ICRA, 2015.
[25] D. Jayaraman and K. Grauman, “Look-ahead before you leap: End-to-
bag of groceries, or clearing debris. It is, however, not suited end active recognition by forecasting the effect of motion,” in ECCV,
for all applications – nominally when we need to pick a 2016.
particular object. In that case, the described system needs to [26] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in CVPR, 2015.
be augmented with state tracking/estimation algorithms. [27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
Towards Scalable Solutions. Our system is designed to pick image recognition,” in CVPR, 2016.
[28] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmenta-
and recognize novel objects without extra data collection or tion and support inference from RGBD images,” in ECCV, 2012.
retraining. This is a step forward towards robotic solutions [29] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual
that scale to the challenges of service robots and warehouse category models to new domains,” ECCV, 2010.
[30] A. Shrivastava, T. Malisiewicz, A. Gupta, and A. A. Efros, “Data-
automation, where the daily number of novel objects ranges driven visual similarity for cross-domain image matching,” in TOG,
from the tens to the thousands, making data-collection and 2011.
retraining cumbersome in one case and impossible in the [31] S. Bell and K. Bala, “Learning visual similarity for product design
with convolutional neural networks,” TOG, 2015.
other. It is interesting to consider what data, besides product [32] E. Hoffer, I. Hubara, and N. Ailon, “Deep unsupervised learning
images, is available that could be used for recognition using through spatial contrasting,” arXiv, 2016.
out-of-the-box algorithms like ours. [33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“Imagenet: A large-scale hierarchical image database,” in CVPR,
R EFERENCES 2009.
[34] M. Huh, P. Agrawal, and A. A. Efros, “What makes imagenet good
[1] Webpage for code and data. [Online]. Available: arc.cs.princeton.edu for transfer learning?” arXiv, 2016.
[2] R. Jonschkowski, C. Eppner, S. Höfer, R. Martı́n-Martı́n, and [35] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks
O. Brock, “Probabilistic multi-class segmentation for the amazon for one-shot image recognition,” in ICML Workshop, 2015.
picking challenge,” 2016. [36] N. Correll, K. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser,
[3] C. Hernandez, M. Bharatheesha, W. Ko, H. Gaiser, J. Tan, K. van K. Okada, A. Rodriguez, J. Romano, and P. Wurman, “Analysis
Deurzen, M. de Vries, B. Van Mil, et al., “Team delft’s robot winner and Observations from the First Amazon Picking Challenge,” T-ASE,
of the amazon picking challenge 2016,” arXiv, 2016. 2016.
[4] A. Zeng, K.-T. Yu, S. Song, D. Suo, E. Walker Jr, A. Rodriguez,
and J. Xiao, “Multi-view self-supervised deep learning for 6d pose
estimation in the amazon picking challenge,” in ICRA, 2017.
3757

Robotic Pick-and-Place of Novel Objects in Clutter With Multi-Affordance Grasping and Cross-Domain Image Matching

Uploaded by

Copyright:

Available Formats

Robotic Pick-and-Place of Novel Objects in Clutter With Multi-Affordance Grasping and Cross-Domain Image Matching

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Robotic Pick-and-Place of Novel Objects in Clutter With Multi-Affordance Grasping and Cross-Domain Image Matching

Uploaded by

Copyright:

Available Formats

2018 IEEE International Conference on Robotics and Automation (ICRA)

May 21-25, 2018, Brisbane, Australia

Robotic Pick-and-Place of Novel Objects in Clutter

Abstract— This paper presents a robotic pick-and-place sys-

978-1-5386-3081-5/18/$31.00 ©2018 IEEE 3750

Method Time Method K vs N Known Novel Mixed

To evaluate the performance of our system as a whole,

You might also like