Abstract
Localizing an object accurately with respect to a robot is a key step for autonomous robotic manipulation. In this work, we propose to tackle this task knowing only 3D models of the robot and object in the particular case where the scene is viewed from uncalibrated cameras—a situation which would be typical in an uncontrolled environment, e.g., on a construction site. We demonstrate that this localization can be performed very accurately, with millimetric errors, without using a single real image for training, a strong advantage since acquiring representative training data is a long and expensive process. Our approach relies on a classification Convolutional Neural Network trained using hundreds of thousands of synthetically rendered scenes with randomized parameters. To evaluate our approach quantitatively and make it comparable to alternative approaches, we build a new rich dataset of real robot images with accurately localized blocks.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The project page with this UnLoc dataset (Uncalibrated Relative Localization) is imagine.enpc.fr/\(\sim \)loingvi/unloc.
References
Aubry, M., Maturana, D., Efros, A. A., Russell, B. C., & Sivic, J. (2014). Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models. In Conference on computer vision and pattern recognition (CVPR) (pp. 3762–3769). IEEE.
Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., Lischinski, D., Cohen-Or, D., & Chen, B. (2016). Synthesizing training images for boosting human 3D pose estimation. In 4th international conference on 3D vision (3DV) (pp. 479–488). IEEE.
Collet, A., & Srinivasa, S. S. (2010). Efficient multi-view object recognition and full pose estimation. In International conference on robotics and automation (ICRA) (pp. 2050–2055). IEEE.
Collet, A., Martinez, M., & Srinivasa, S. S. (2011). The MOPED framework: Object recognition and pose estimation for manipulation. The International Journal of Robotics Research (IJRR), 30(10), 1284–1306.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In International conference on computer vision and pattern recognition (CVPR) (Vol. 1, pp. 886–893). IEEE.
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In International conference on computer vision (ICCV) (pp. 2758–2766). IEEE.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 32(9), 1627–1645.
Feng, C., Xiao, Y., Willette, A., Mcgee, W. & Kamat, V. R. (2014). Towards autonomous robotic in-situ assembly on unstructured construction sites using monocular vision. In International symposium on automation and robotics in construction and mining (ISARC)
Fidler, S., Dickinson, S. & Urtasun, R. (2012). 3D object detection and viewpoint estimation with a deformable 3D cuboid model. In Advances in Neural Information Processing Systems (NIPS) (pp. 611–619).
Garrido-Jurado, S., Muoz-Salinas, R., Madrid-Cuevas, F., & Marn-Jimnez, M. (2014). Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognition, 47(6), 2280–2292.
Garrido-Jurado, S., Muoz-Salinas, R., Madrid-Cuevas, F., & Medina-Carnicer, R. (2016). Generation of fiducial marker dictionaries using mixed integer linear programming. Pattern Recognition, 51, 481–491.
Girshick, R., Donahue, J., Darrell, T. & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In International conference on computer vision and pattern recognition (CVPR) (pp. 580–587). IEEE.
Glasner, D., Galun, M., Alpert, S., Basri, R., & Shakhnarovich, G. (2011). Viewpoint-aware object detection and pose estimation. In International conference on computer vision (ICCV) (pp. 1275–1282). IEEE.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. arXiv preprint arXiv:1703.06870.
Hejrati, M., & Ramanan, D. (2012). Analyzing 3D objects in cluttered images. In Advances in neural information processing systems (NIPS), (pp. 593–601).
Hodaň, T., Matas, J., & Obdržálek, Š. (2016). On evaluation of 6D object pose estimation. In European conference on computer vision workshops (ECCVw) (pp. 606–619). New York: Springer.
Huttenlocher, D. P., & Ullman, S. (1990). Recognizing solid objects by alignment with an image. International Journal of Computer Vision (IJCV), 5(2), 195–212.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.
Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research (JMLR), 17(39), 1–40.
Levine, S., Pastor, P., Krizhevsky, A., & Quillen, D. (2018). Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research (ISER), 37(4–5), 421–436.
Lowe, D. G. (1987). Three-dimensional object recognition from single two-dimensional images. Artificial Intelligence, 31(3), 355–395.
Lowe, D. G. (1999). Object recognition from local scale-invariant features. In The proceedings of the seventh IEEE international conference on computer vision (Vol. 2, pp. 1150–1157). IEEE.
Massa, F., Marlet, R. & Aubry, M. (2016a). Crafting a multi-task CNN for viewpoint estimation. In 27th British machine vision conference (BMVC)
Massa, F., Russell, B. C., & Aubry, M. (2016b). Deep exemplar 2D–3D detection by adapting from real to rendered views. In International conference on computer vision and pattern recognition (CVPR) (pp. 6024–6033). IEEE.
Mundy, J. L. (2006). Object recognition in the geometric era: A retrospective. In Toward category-level object recognition (pp. 3–28). New York: Springer.
Peng, X., & Saenko, K. (2017). Synthetic to real adaptation with deep generative correlation alignment networks. arXiv preprint arXiv:170105524
Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object detectors from 3D models. In International conference on computer vision (ICCV) (pp. 1278–1286). IEEE.
Pepik, B., Stark, M., Gehler, P., & Schiele, B. (2012). Teaching 3D geometry to deformable part models. In International conference on computer vision and pattern recognition (CVPR) (pp. 3362–3369). IEEE.
Pepik, B., Benenson, R., Ritschel, T., & Schiele, B. (2015). What is holding back convnets for detection? In 37th German conference on pattern recognition (GCPR) 9358 in LNCS (pp. 517–528). New York: Springer.
Pinto, L., & Gupta, A. (2016). Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. International conference on robotics and automation (ICRA) (pp. 3406–3413). Stockholm, Sweden: IEEE.
Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In European conference on computer vision (ECCV) (pp. 102–118). New York: Springer.
Roberts, L. G. (1963). Machine perception of three-dimensional solids. In PhD thesis Massachusetts Institute of Technology (MIT)
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A. M. (2016). The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Internatgional conference on computer vision and pattern recognition (CVPR) (pp. 3234–3243). IEEE.
Sadeghi, F., & Levine, S. (2018). (CAD)2RL: Real single-image flight without a single real image. In Robotics: Science and systems (RSS) conference.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015) Trust region policy optimization. In 32nd international conference on machine learning (ICML) (pp. 1889–1897).
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). Overfeat: Integrated recognition, localization and detection using convolutional networks. In International conference on learning representations (ICLR)
Shafaei, A., Little, J. J., & Schmidt, M. (2016). Play and learn: Using video games to train computer vision models. In 27th British machine vision conference (BMVC).
Su, H., Qi, CR., Li, Y., & Guibas, L. J. (2015). Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In International conference on computer vision (ICCV) (pp. 2686–2694). IEEE.
Sun, B., & Saenko, K. (2014). From virtual to reality: Fast adaptation of virtual object detectors to real domains. In 25th British machine vision conference (BMVC).
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. In 30th international conference on intelligent robots and systems (IROS), IEEE/RSJ.
Tulsiani, S., & Malik, J. (2015). Viewpoints and keypoints. In International conference on computer vision and pattern recognition (CVPR) (pp. 1510–1519). IEEE.
Vazquez, D., Lopez, A. M., Marin, J., Ponsa, D., & Geronimo, D. (2014). Virtual and real world adaptation for pedestrian detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 36(4), 797–809.
Wu, J., Xue, T., Lim, J. J., Tian, Y., Tenenbaum, J. B., Torralba, A., & Freeman, W. T. (2016). Single image 3D interpreter network. In European conference on computer vision (ECCV) (pp. 365–382). New York: Springer.
Xiao, J., Russell, B., & Torralba, A. (2012). Localizing 3D cuboids in single-view images. In Advances in neural information processing systems (NIPS), (pp. 746–754). Curran Associates, Inc.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Adrien Gaidon, Florent Perronnin and Antonio Lopez.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Training Dataset Based on Synthetic Images
As introduced in Sect. 2.2, we generated three synthetic datasets for training a network for each subtask:
-
1.
robot and block in random pose, for coarse estimation,
-
2.
robot with clamp in random vertical pose pointing downwards, for 2D tool detection,
-
3.
close-up of vertical clamp and random block nearby, for fine estimation.
We created a simple room model where a robot is placed on the floor and a cuboid block is laid nearby. The robot we experimented with is an IRB120 from ABB company, for which we have a 3D model. We also have a 3D model of the clamp. However, we did not model the cables on the robot base nor on the clamp (compare, e.g., Fig. 1a–c). We considered configurations that are similar to what can be found in the evaluation dataset, although with slightly greater variations for robustness. The actual randomization for the generation of images is as follows:
-
The size of the room is 20 m \(\times \) 20 m so that the walls are visible on some images.
-
The robot base (20 cm \(\times \) 30 cm) orientation and position are sampled randomly. (The robot height is around 70 cm and the arm length around 50 cm.)
-
The orientations (angles) of the robot joints are sampled randomly among all possible values, except for the clamp 2D detection task and the fine estimation task, where the arm extremity is placed vertically on top of the floor.
-
Each dimension of the cuboid block is sampled randomly between 2.5 and 10 cm.
-
The block is laid flat on the floor with an orientation and a position sampled randomly in the attainable robot positions for the coarse estimation task or in a 12 cm square below the clamp for the fine estimation task.
-
All the textures, for the floor, robot and block are sampled among 69 widely different texture images.
-
The camera center is sampled randomly at a height between 70 and 130 cm from the floor in a cylindrical sleeve of minimum radius 1 m and maximum radius 2.8 m, centered on the robot, as illustrated in Fig. 9.
-
For the coarse estimation setting (wide views), the camera target is sampled in a cylinder of 30 cm radius and 50 cm height around the robot base.
-
For the fine estimation setting (close-ups), the target is the center of the clamp, with a small random shift.
-
The camera is rotated around its main axis (line between camera center and camera target) with an angle sampled randomly between -8 and +8 degrees.
-
The camera focal length is randomly sampled between 45 mm and 75 mm for an equivalent sensor frame size of 24 mm \(\times \) 24 mm.
-
Synthetic images are 256 \(\times \) 256 pixels.
The pictures were generated with the Unreal Engine 4 game engine. The dataset we created for coarse estimation consists of approximately 420k images (examples can be seen on Fig. 4a), the one for 2D clamp detection consists approximately of 2800k images (examples can be seen on Fig. 4b) and the dataset for fine estimation consists approximately of 600k images (examples can be seen on Fig. 4c). We used 90% of the images for training and the remaining 10% for validation.
B Evaluation Dataset Based on Real Images
As explained in Sect. 5.1, our evaluation dataset actually divides in three parts, corresponding to different settings, illustrated on Fig. 8:
-
1.
in the ’lab’ dataset, the robot and the block are on a table with no particular distractor or texture,
-
2.
in the ’field’ dataset, the table is covered with dirt, sand and gravels, making the flat surface uneven,
-
3.
in the ’adv’ dataset, the table is covered with pieces of paper that can be confused with cuboid blocks.
We use the robot itself to accurately move the block to random poses, which provides an reliable measure of its relative position and orientation, for each configuration. In practice, the block can slowly drift from its expect position as the robot repeatedly picks it, moves it and puts it down. To ensure there is no drift, the block position is checked and realigned every ten position. Because of the limited stroke of the clamp, we considered only a single block sizes: 5 cm \(\times \) 8 cm \(\times \) 4 cm. Note however that our method does not exploit this information; we believe a robust method should be able to process a wide range of block shapes. As we want to model situations where the robot can pick a block, we have to restrict the reach of the arm to a range for which the tool can be set vertically above the block, i.e., 0.505 m.
We collected images from 3 cameras at various viewpoints, looking at the scene slightly from above. To make sure the block is visible on most pictures, we actually considered successively different regions of the experiment table for sampling block poses, moving the cameras to ensure a good visibility for each region. The cameras are moved manually without any particular care. The distance between a camera and the block is typically between 1 and 2.5 m. The maximum angle between the left-most and the right-most camera is typically on the order of 120 degrees. This setting is illustrated on Fig. 6.
For each block position with respect to the robot base, we consider two main articulations of the robot arm: a random arm configuration for the coarse location subtask, and an articulation where the clamp is vertical, pointing downward and positioned next to the block for the fine location subtask. In the latter case, we positioned the clamp 150 mm above the table surface, at a random horizontal position in a 120 mm square around the block.
We actually recorded two clamp orientations along the vertical axis: a first random orientation, and then a second orientation where the clamp is rotated by 90 degrees (see Fig. 5). As the clamp orientation, with respect to which the fine estimation is defined, can be hard to estimate for some configurations, using two orientations allows more accurate predictions.
In total, we considered approximately 1300 poses (positions and orientations) of the robot and the block together. This lead to a total of approximately 12,000 images, of size 1080 \(\times \) 1080 for wide views and 432 \(\times \) 432 for close-ups with corresponding ground-truth relative position and orientation. The cameras used are eLight full HD 1080p webcams from Trust. Camera intrinsics were not available nor estimated. Nevertheless, the focal length was roughly determined to be about 50 mm and, in the synthetic images, the camera focal length was randomly set between 45 mm and 75 mm (see “Appendix A”).
Datasets and 3D models are available from imagine.enpc.fr/\(\sim \)loingvi/unloc.
C Network Architecture Details
We define here what are the bins for the three different networks, addressing the three different subtasks.
The number of bins for the last layer of the coarse estimation network depends on the bin size and on the maximum range of the robotic arm with the tool maintained vertically, i.e., 0.505 m (see “Appendix B”). In practice, we defined bins of 5 mm for the coarse estimation and 2 mm for the fine estimation, both visualized by the fine red grid in Figs. 1a and 3. These pictures also give a sense of how accurate the localization is compared to the sizes of the block and of the robot. For the angular estimation, we used bins of 5 and 2 degrees respectively. For the clamp detection network, we defined bins whose size is 2% of the picture width. Since as stated above, we predict each dimension separately, this leads to 202 bins for x and y and 36 bins for \(\theta \) for the coarse localization network, 60 bins for x and y and 90 bins for \(\theta \) for the fine localization network and 50 bins for x and y for the clamp localization network.
Rights and permissions
About this article
Cite this article
Loing, V., Marlet, R. & Aubry, M. Virtual Training for a Real Application: Accurate Object-Robot Relative Localization Without Calibration. Int J Comput Vis 126, 1045–1060 (2018). https://doi.org/10.1007/s11263-018-1102-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-018-1102-6