Robotic Pick-and-Place of Novel Objects in Clutter With Multi-Affordance Grasping and Cross-Domain Image Matching
Robotic Pick-and-Place of Novel Objects in Clutter With Multi-Affordance Grasping and Cross-Domain Image Matching
Robotic Pick-and-Place of Novel Objects in Clutter With Multi-Affordance Grasping and Cross-Domain Image Matching
http://arc.cs.princeton.edu
https://youtu.be/6fG7zwGfIkI
3751
VXFWLRQGRZQ VXFWLRQVLGH JUDVSGRZQ IOXVKJUDVS
Fig. 3. Multi-functional gripper with a retractable mechanism that enables Fig. 4. Multiple motion primitives for suction and grasping to ensure
quick and automatic switching between suction (pink) and grasping (blue). successful picking for a wide variety of objects in any orientation.
first performs an object-agnostic affordance prediction, con- their object identities or poses. To this end, we define a
sidering multiple different grasping modes from suction to set of motion primitives that are complimentary to each
parallel-jaw grasps (Section IV). It then selects the best other in terms of utility across different object types and
affordance, picks up one object, isolates it from the clutter, scenarios – empirically maximizing the variety of objects and
holds it up in front of cameras, recognizes its category, orientations that can be picked with at least one primitive.
and places it in the appropriate bin. Although the object Given RGB-D images of the cluttered scene at test time, we
recognition algorithm is trained only on known objects, predict a set of affordances to generate grasp proposals with
it is able to recognize novel objects through a learned confidence scores for each primitive. These are then used by
cross-domain image matching embedding between observed a task planner to choose which primitive to use.
images of held objects and product images (Section V).
A. Motion primitives
Advantages. This system design has several advantages.
We define four motion primitives to achieve robust picking
First, the affordance prediction algorithm is model-free and
for typical household objects. Fig. 4 shows example motions
agnostic to object identities and generalizes to novel ob-
for each primitive. Each of them are implemented as a set of
jects without re-training. Second, the category recognition
guarded moves, with collision avoidance and quick success
algorithm works without task-specific data collection or
or failure feedback mechanisms. They are as follows:
re-training for novel objects, which makes it scalable for
applications in warehouse automation and service robots Suction down grasps objects with a vacuum gripper ver-
where the range of observed object categories is large and tically. This primitive is particularly robust for objects
dynamic. Third, the affordance prediction algorithm supports with large and flat suctionable surfaces (e.g. boxes, books,
multiple grasping modes and thus handles a wide variety of wrapped objects), and performs well in heavy clutter.
objects. Finally, the entire processing pipeline requires only Suction side grasps objects from the side by approaching
two forward passes through deep networks and thus executes with a vacuum gripper tilted an an angle. This primitive is
quickly (Table II). robust to thin and flat objects resting against walls, which
System setup. Our system features a 6DOF ABB IRB may not have suctionable surfaces from the top.
1600id robot arm next to four picking work-cells. The robot Grasp down grasps objects vertically using the two-finger
arm’s end-effector is a multi-functional gripper with two parallel-jaw gripper. This primitive is complementary to
fingers for parallel-jaw grasps and a retractable suction cup the suction primitives in that it is able to pick up objects
(Fig. 3). This gripper was designed to function in cluttered with smaller, irregular surfaces (e.g. small tools, deformable
environments: finger and suction cup length are specifically objects), or made of semi-porous materials that prevent a
chosen such that the bulk of the gripper body does not good suction seal (e.g. cloth).
need to enter the cluttered space. Each work-cell has a Flush grasp retrieves unsuctionable objects that are flushed
storage bin and four statically-mounted RealSense SR300 against a wall. The primitive is similar to grasp down, but
RGB-D cameras (Fig. 2): two cameras overlooking the with the additional behavior of using a flexible spatula to
storage bins are used to predict grasp affordances, while slide between the target object and the wall.
the other two pointing towards the robot gripper are used to
recognize objects in the gripper. Although our experiments B. Affordance Prediction
were performed with this setup, the system was designed to Given the set of pre-defined picking primitives and RGB-
be flexible for picking and placing between any number of D images of the scene, we predict pixel-level affordances
reachable work-cells and camera locations. Furthermore, all for each motion primitive, from which we can generate
manipulation and recognition algorithms in this paper were suction and grasp proposals. Our approach relies on the
designed to be easily adapted to other system setups. assumption that graspable regions can be deduced from the
local geometry and material properties, as reflected in visual
IV. M ULTI -A FFORDANCE G RASPING information. This is inspired by recent data-driven methods
The goal of the first step in our system is to robustly for grasp planning [11], [12], [13], [15], [16], [17], [18], [19],
grasp objects from a cluttered scene without relying on which do not rely on object identities or state estimation. We
3752
,QSXW5*%',PDJHV
6XFWLRQ 䘟VVXFWLRQGRZQ
$IIRUGDQFH
&RQY1HW
tVX
VXFWLRQVLGH
5RWDWHG+HLJKWPDSV
+RUL]RQWDO
*UDVS
䘟JJUDVSGRZQ
$IIRUGDQFH tIOIOXVKJUDVS
&RQY1HW
Fig. 5. Suction and grasp affordance prediction. Given multi-view RGB-D images, we estimate suction affordances for each image with a fully
convolutional residual network. We then aggregate the predictions on a 3D point cloud, and generate suction down or suction side proposals based on
surface normals. In parallel, we merge RGB-D images into an RGB-D heightmap, rotate it by 16 different angles, and estimate horizontal grasp for each
heightmap. This effectively produces affordance maps for 16 different grasp angles, from which we generate the grasp down and flush grasp proposals.
extend these data-driven approaches by training models to camera intrinsics and poses to project the probability maps
predict pixel-level affordances for multiple types of grasps, and aggregate the affordance predictions onto a combined
and employ fully convolutional networks (FCN) [26] to 3D point cloud. We then compute surface normals for each
efficiently obtain dense predictions over a single image of 3D point, which are used to classify which suction primitive
the scene to achieve faster run time speeds. (down or side) to use for the point. To handle objects without
In this subsection, we present an overview of how we depth, we use a simple hole filling algorithm [28] on the
predict affordances for our suction and grasping primitives. depth images, and project predicted probability scores onto
For more details about our network architectures, their train- the hallucinated depth.
ing parameters, post-processing steps, and training datasets, Predicting Grasp Affordances. Each grasp proposal is
please refer to our project webpage [1]. represented by the x, y, z position of the gripper in 3D space,
Predicting Suction Affordances. We define suction pro- the orientation θ of the gripper around the vertical axis, the
posals as 3D positions where the vacuum gripper’s suction desired gripper opening distance do , and confidence score cg .
cup should come in contact with the object’s surface in To predict grasping affordances, we first aggregate the
order to successfully grasp it. Good suction proposals should two RGB-D images of the scene into a registered 3D
be located on suctionable surfaces, and nearby the target point cloud, which is then orthographically back-projected
object’s center of mass to avoid an unstable suction seal upwards in the gravity direction to obtain a “heightmap”
(e.g. particularly for heavy objects). Each suction proposal is image representation of the scene, with both color (RGB) and
defined as a 3D position x, y, z, its surface normal nx , ny , nz , height from bottom (D) channels. To handle objects without
and confidence score cs . depth, we triangulate no-depth regions in the heightmap
We train a fully convolutional residual network (ResNet- using both views, and fill in the regions with a height of
101 [27]), that takes an RGB-D image as input, and outputs 3cm. We feed this RGB-D heightmap as input to a fully
a densely labeled pixel-level binary probability map cs , convolutional ResNet-101 [27], which densely predicts pixel-
where values closer to one imply a more preferable suction level binary probability maps, which serve as confidences
location, shown in Fig. 5 first row. Our network architecture values cg for horizontally oriented grasps, shown in Fig. 5
is multi-modal, where the color data is fed into one ResNet- second row. The architecture of this network is similar
101 tower, and 3-channel depth (cloned across channels, in structure to the network predicting suction affordances.
normalized by subtracting mean and dividing by standard By rotating the heightmap in 16 different orientations and
deviation) is fed into another ResNet-101 tower. Features feeding each individually through the network, we obtain 16
from the ends of both towers are concatenated across chan- binary probability maps, each representing a confidence map
nels, followed by 3 additional spatial convolution layers to for a grasp in a different orientation.
merge the features; then spatially bilinearly upsampled and We find this network architecture to be more flexible to
softmaxed to output a single binary probability map. We various grasp orientations, and less likely to diverge during
train our model over a manually annotated dataset of RGB-D training due to the sparsity of manual grasp annotations.
images of cluttered scenes with diverse objects, where pixels We train our model over a manually annotated dataset of
are densely labeled either positive, negative, or neither (using RGB-D heightmaps, where each positive and negative grasp
wide-area brushstrokes from the labeling interface). We train label is represented by a pixel on the heightmap as well as a
our network with 0 loss propagation for the regions that are corresponding angle parallel to the jaw motion of the gripper.
labeled as neither positive nor negative. Our grasp affordance predictions return grasp locations
During testing, we feed each captured RGB-D image (x, y, z), orientations (θ ), and confidence scores (cg ). During
through our trained network to generate probability maps post-processing, we use the geometry of the 3D point cloud
for each view. As a post-processing step, we use calibrated to estimate grasp widths (do ) for each proposal. We also use
3753
7UDLQLQJ 7HVWLQJ
SURGXFWLPDJHV LQSXW
ƐGLVWDQFH
UDWLRORVV IHDWXUH
HPEHGGLQJ
NQRZQ
PDWFK"
REVHUYHGLPDJHV
QRYHO PDWFK
VRIWPD[ORVV
IRU.1HWRQO\
Fig. 6. Recognition framework for novel objects. We train a two-stream convolutional neural network where one stream computes 2048-dimensional
feature vectors for product images while the other stream computes 2048-dimensional feature vectors for observed images, and optimize both streams so
that features are more similar for images of the same object and dissimilar otherwise. During testing, product images of both known and novel objects are
mapped onto a common feature space. We recognize observed images by mapping them to the same feature space and finding the nearest neighbor match.
the location of each proposal relative to the bin to classify feature embedding that encapsulates object shape, color, and
which grasping primitive (down or flush) should be used. other visual discriminative properties, which can generalize
and be used to match observed images of novel objects to
V. R ECOGNIZING N OVEL O BJECTS their respective product images (Fig. 6).
After successfully grasping an object and isolating it from Avoiding metric collapse by guided feature embeddings.
clutter, the goal of the second step in our system is to One issue commonly encountered in metric learning occurs
recognize the identity of the grasped object. when the number of training object categories is small – the
Since we encounter both known and novel objects, and we network can easily overfit its feature space to capture only
have only product images for the novel objects, we address the small set of training categories, making generalization
this recognition problem by retrieving the best match among to novel object categories difficult. We refer to this problem
a set of product images. Of course, observed images and as metric collapse. To avoid this issue, we use a model pre-
product images can be captured in significantly different trained on ImageNet [33] for the product image stream and
environments in terms of lighting, object pose, background train only the stream that computes features for observed
color, post-process editing, etc. Therefore, we need a model images. ImageNet contains a large collection of images from
that is able to find the semantic correspondences between many categories, and models pre-trained on it have been
images from these two different domains. This is a cross- shown to produce relatively comprehensive and homogenous
domain image matching problem [29], [30], [31]. feature embeddings for transfer tasks [34] – i.e. providing
discriminating features for images of a wide range of objects.
A. Metric Learning for Cross-Domain Image Matching
Our training procedure trains the observed image stream to
To do the cross-domain image matching between observed produce features similar to the ImageNet features of product
images and product images, we learn a metric function images – i.e., it learns a mapping from observed images to
that takes in an observed image and a candidate product ImageNet features. Those features are then suitable for direct
image and outputs a distance value that models how likely comparison to features of product images, even for novel
the images are of the same object. The goal of the metric objects not encountered during training.
function is to map both the observed image and product Using multiple product images. For many applications,
image onto a meaningful feature embedding space so that there can be multiple product images per object. However,
smaller 2 feature distances indicate higher similarities. The with multiple product images, supervision of the two-stream
product image with the smallest metric distance to the network can become confusing - on which pair of matching
observed image is the final matching result. observed and product images should the backpropagated
We model this metric function with a two-stream convo- gradients be based? To solve this problem, we add a module
lutional neural network (ConvNet) architecture where one we call a “multi-anchor switch” in the network. During
stream computes features for the observed images, and a training, this module automatically chooses which “anchor”
different stream computes features for the product images. product image to compare against based on nearest neighbor
We train the network by feeding it a balanced 1:1 ratio of 2 distance. We find that allowing the network to select
matching and non-matching image pairs (one observed image its own criterion for choosing “anchor” product images
and one product image) from the set of known objects, and provides a significant boost in performance in comparison
backpropagate gradients from the distance ratio loss (Triplet to alternative methods like random sampling.
loss [32]). This effectively optimizes the network in a way
that minimizes the 2 distances between features of matching B. Two Stage Framework for a Mixture of Known and Novel
pairs while pulling apart the 2 distances between features Objects
of non-matching pairs. By training over enough examples of In settings where both types of objects are present, we
these image pairs across known objects, the network learns a find that training two different network models to handle
3754
TABLE I
known and novel objects separately can yield higher overall
M ULTI -A FFORDANCE P REDICTION P ERFORMANCE
matching accuracies. One is trained to be good at “over-
fitting” to the known objects (K-net) and the other is trained Primitive Method Top-1 Top 1% Top 5% Top 10%
to be better at “generalizing” to novel objects (N-net). Baseline 35.2 55.4 46.7 38.5
Suction
ConvNet 92.4 83.4 66.0 52.0
Yet, how do we know which network to use for a given Baseline 92.5 90.7 87.2 73.8
Grasping
image? To address this issue, we execute our recognition ConvNet 96.7 91.9 87.6 84.1
pipeline in two stages: a “recollection” stage that determines % precision of predictions across different confidence percentiles.
whether the observed object is known or novel, and a
“hypothesis” stage that uses the appropriate network model positive and negative grasps over re-projected height maps
based on the first stage’s output to perform image matching. of cluttered bins, where each grasp is represented by a pixel
First, the recollection stage predicts whether the input on the height map and an angle parallel to the jaw motion of
observed image from test time is that of a known object the gripper. We further augment each grasp label by adding
that has appeared during training. Intuitively, an observed additional labels with small jittering (less than 1.6cm). In
image is of a novel object if and only if its deep features total, the dataset contains 1837 RGB-D images with suction
cannot match to that of any images of known objects. and grasp labels. We use a 4:1 training/testing split across
We explicitly model this conditional by thresholding on this dataset to train and evaluate different models.
the nearest neighbor distance to product image features of Evaluation. In the context of our system, an affordance
known objects. In other words, if the 2 distance between prediction method is robust if it is able to consistently
the K-net features of an observed image and the nearest find at least one suction or grasp proposal that works. To
neighbor product image of a known object is greater than reflect this, our evaluation metric is the precision of predicted
some threshold k, then the observed images is a novel object. proposals versus manual annotations. For suction, a proposal
In the hypothesis stage, we perform object recognition is considered a true positive if its pixel center is manually
based on one of two network models: K-net for known ob- labeled as a suctionable area. For grasping, a proposal is
jects and N-net for novel objects. The K-net and N-net share considered a true positive prediction if its pixel center is
the same network architecture. However, the K-net has an within 4 pixels and 11.25 degrees from a positive grasp label.
additional auxiliary classification loss during training for the We report the precision of our predicted proposals for
known objects. This classification loss increases the accuracy different confidence percentiles in Table I. The precision of
of known objects at test time to near perfect performance, the top-1 proposal is reliably above 90% for both suction
and also boosts up the accuracy of the recollection stage, and grasping. We further compare our methods to heuristic-
but fails to maintain the accuracy of novel objects. On the based baseline algorithms that compute suction affordances
other hand, without the restriction of the classification loss, by estimating surface normal variance over the observed
N-net has a lower accuracy for known objects, but maintains 3D point cloud (lower variance = higher affordance), and
a better accuracy for novel objects. computes anti-podal grasps by detecting hill-like geometric
By adding the recollection stage, we can exploit both structures in the 3D point cloud. Baseline details and code
the high accuracy of known objects with K-net and good are available on our project webpage [1].
accuracy of novel objects with N-net, though incurring a cost Speed. Our suction and grasp affordance algorithms were
in accuracy from erroneous known vs novel classification. designed to achieve fast run-time speeds during test time by
We find that this two stage system overall provides higher densely predicting affordances over a single image of the
total matching accuracy for recognizing both known and entire scene. In Table II, we compare our run-time speeds to
novel objects (mixed) than all other baselines (Table III). several state-of-the-art alternatives for grasp planning. Our
own numbers measure the time of each FCN forward pass,
VI. E XPERIMENTS
reported with an NVIDIA Titan X on an Intel Core i7-3770K
In this section, we evaluate our multi-affordance prediction clocked at 3.5 GHz, excluding time for image capture and
for suction and grasp primitives, our recognition algorithm other system-related overhead.
over both known and novel objects, as well as our full system
in the context of the Amazon Robotics Challenge 2017. B. Recognition of Novel Objects Evaluation
We evaluate our recognition algorithms using a 1 vs 20
A. Multi-Affordance Prediction Experiments
classification benchmark. Each test sample in the benchmark
Datasets. To generate datasets for affordance predictions, we contains 20 possible object classes, where 10 are known and
designed a simple labeling interface that prompts users to 10 are novel, chosen at random. During each test sample, we
manually annotate suction and grasp proposals over RGB-D feed the recognition algorithm the product images for all 20
images collected from the real system. For suction, users who objects as well as an observed image of a grasped object.
have had experience working with our suction gripper are In Table III, we measure performance in terms of top-1
asked to annotate pixels of suctionable and non-suctionable accuracy for matching the observed image to a product image
areas on raw RGB-D images overlooking cluttered bins of the correct object match. We evaluate our method against
full of various objects. Similarly, users with experience a baseline algorithm, a state-of-the-art network architecture
using our parallel-jaw gripper are asked to sparsely annotate for both visual search [31] and one shot learning without
3755
TABLE II TABLE III
G RASP P LANNING RUN -T IMES ( SEC .) R ECOGNITION E VALUATION (% ACCURACY OF T OP -1 P REDICTION )
3756
VII. D ISCUSSION AND F UTURE W ORK [5] M. Schwarz, A. Milan, C. Lenz, A. Munoz, A. S. Periyasamy,
M. Schreiber, S. Schüller, and S. Behnke, “Nimbro picking: Versatile
We present a system to pick and recognize novel objects part handling for warehouse automation,” in ICRA, 2017.
with very limited prior information about them (a handful of [6] J. M. Wong, V. Kee, T. Le, S. Wagner, G.-L. Mariottini, A. Schneider,
product images). The system first uses a category-agnostic af- L. Hamilton, R. Chipalkatty, M. Hebert, et al., “Segicp: Integrated
deep semantic segmentation and pose estimation,” arXiv, 2017.
fordance prediction algorithm to select among four different [7] A. Bicchi and V. Kumar, “Robotic Grasping and Contact,” ICRA.
grasping primitive behaviors, and then recognizes grasped [8] A. Miller, S. Knoop, H. Christensen, and P. K. Allen, “Automatic
objects by matching them to their product images. We grasp planning using shape primitives,” ICRA, 2003.
[9] M. Nieuwenhuisen, D. Droeschel, D. Holz, J. Stückler, A. Berner,
evaluate both components and demonstrate their combination J. Li, R. Klein, and S. Behnke, “Mobile bin picking with an anthro-
in a robot system that picks and recognizes novel objects in pomorphic service robot,” in ICRA, 2013.
heavy clutter, and that took 1st place in the stowing task of [10] M.-Y. Liu, O. Tuzel, A. Veeraraghavan, Y. Taguchi, T. K. Marks, and
R. Chellappa, “Fast object localization and pose estimation in heavy
the Amazon Robotics Challenge 2017. Here are some of the clutter for robotic bin picking,” IJRR, 2012.
most salient features/limitations of the system: [11] e. a. Morales, Antonio, “Using experience for assessing grasp relia-
bility,” in IJHR, 2004.
Object-Agnostic Manipulation. The system finds grasp [12] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic
affordances directly in the RGBD image. This proved faster grasps,” in IJRR, 2015.
and more reliable than doing object segmentation and state [13] J. Redmon and A. Angelova, “Real-time grasp detection using convo-
lutional neural networks,” in ICRA, 2015.
estimation prior to grasp planning [4]. The ConvNet learns [14] A. ten Pas and R. Platt, “Using geometry to detect grasp poses in 3d
the visual features that make a region of an image graspable point clouds,” in ISRR, 2015.
or suctionable. It also seems to learn more complex rules, [15] L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to
grasp from 50k tries and 700 robot hours,” in ICRA, 2016.
e.g., that tags are often easier to suction that the object itself, [16] L. Pinto, J. Davidson, and A. Gupta, “Supervision via competition:
or that the center of a long object is preferable than its ends. It Robot adversaries for learning tasks,” in ICRA, 2017.
would be interesting to explore the limits of the approach. For [17] e. a. Mahler, Jeffrey, “Dex-net 2.0: Deep learning to plan robust grasps
with synthetic point clouds and analytic grasp metrics,” in RSS, 2017.
example learning affordances for more complex behaviors, [18] M. Gualtieri, A. ten Pas, K. Saenko, and R. Platt, “High precision
e.g., scooping an object against a wall, which require a more grasp pose detection in dense clutter,” in arXiv, 2017.
global understanding of the geometry of the environment. [19] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learn-
ing hand-eye coordination for robotic grasping with large-scale data
Pick First, Ask Questions Later. The standard grasping collection,” in ISER, 2016.
pipeline is to first recognize and then plan a grasp. In this [20] E. Matsumoto, M. Saito, A. Kume, and J. Tan, “End-to-end learning
paper we demonstrate that it is possible and sometimes of object grasp poses in the amazon robotics challenge.”
[21] R. Bajcsy and M. Campos, “Active and exploratory perception,”
beneficial to reverse the order. Our system leverages object- CVGIP: Image Understanding, vol. 56, no. 1, 1992.
agnostic picking to remove the need for state estimation [22] S. Chen, Y. Li, and N. M. Kwok, “Active vision in robotic systems:
in clutter. Isolating the picked object drastically increases A survey of recent developments,” IJRR, 2011.
[23] D. Jiang, H. Wang, W. Chen, and R. Wu, “A novel occlusion-free
object recognition reliability, especially for novel objects. active recognition algorithm for objects in clutter,” in ROBIO, 2016.
We conjecture that ”pick first, ask questions later” is a good [24] K. Wu, R. Ranasinghe, and G. Dissanayake, “Active recognition and
approach for applications such as bin-picking, emptying a pose estimation of household objects in clutter,” in ICRA, 2015.
[25] D. Jayaraman and K. Grauman, “Look-ahead before you leap: End-to-
bag of groceries, or clearing debris. It is, however, not suited end active recognition by forecasting the effect of motion,” in ECCV,
for all applications – nominally when we need to pick a 2016.
particular object. In that case, the described system needs to [26] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in CVPR, 2015.
be augmented with state tracking/estimation algorithms. [27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
Towards Scalable Solutions. Our system is designed to pick image recognition,” in CVPR, 2016.
[28] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmenta-
and recognize novel objects without extra data collection or tion and support inference from RGBD images,” in ECCV, 2012.
retraining. This is a step forward towards robotic solutions [29] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual
that scale to the challenges of service robots and warehouse category models to new domains,” ECCV, 2010.
[30] A. Shrivastava, T. Malisiewicz, A. Gupta, and A. A. Efros, “Data-
automation, where the daily number of novel objects ranges driven visual similarity for cross-domain image matching,” in TOG,
from the tens to the thousands, making data-collection and 2011.
retraining cumbersome in one case and impossible in the [31] S. Bell and K. Bala, “Learning visual similarity for product design
with convolutional neural networks,” TOG, 2015.
other. It is interesting to consider what data, besides product [32] E. Hoffer, I. Hubara, and N. Ailon, “Deep unsupervised learning
images, is available that could be used for recognition using through spatial contrasting,” arXiv, 2016.
out-of-the-box algorithms like ours. [33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“Imagenet: A large-scale hierarchical image database,” in CVPR,
R EFERENCES 2009.
[34] M. Huh, P. Agrawal, and A. A. Efros, “What makes imagenet good
[1] Webpage for code and data. [Online]. Available: arc.cs.princeton.edu for transfer learning?” arXiv, 2016.
[2] R. Jonschkowski, C. Eppner, S. Höfer, R. Martı́n-Martı́n, and [35] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks
O. Brock, “Probabilistic multi-class segmentation for the amazon for one-shot image recognition,” in ICML Workshop, 2015.
picking challenge,” 2016. [36] N. Correll, K. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser,
[3] C. Hernandez, M. Bharatheesha, W. Ko, H. Gaiser, J. Tan, K. van K. Okada, A. Rodriguez, J. Romano, and P. Wurman, “Analysis
Deurzen, M. de Vries, B. Van Mil, et al., “Team delft’s robot winner and Observations from the First Amazon Picking Challenge,” T-ASE,
of the amazon picking challenge 2016,” arXiv, 2016. 2016.
[4] A. Zeng, K.-T. Yu, S. Song, D. Suo, E. Walker Jr, A. Rodriguez,
and J. Xiao, “Multi-view self-supervised deep learning for 6d pose
estimation in the amazon picking challenge,” in ICRA, 2017.
3757