Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Learning Three-dimensional Skeleton Data from Sign Language Video

Published: 17 April 2020 Publication History

Abstract

Data for sign language research is often difficult and costly to acquire. We therefore present a novel pipeline able to generate motion three-dimensional (3D) skeleton data from single-camera sign language videos only. First, three recurrent neural networks are learned to infer the three-dimensional position data of body, face, and finger joints for a high resolution of the signer’s skeleton. Subsequently, the angular displacements of all joints over time are estimated using inverse kinematics and mapped to a virtual sign avatar for animation. Last, the generated data are evaluated in detail, including a sign language recognition and sign language synthesis scenario. Utilizing a neural word classifier trained on real motion capture data, we reliably classify word segments built from our newly generated position data with similar accuracy as motion capture data (absolute difference 3.8%). Furthermore, qualitative evaluation of sign animations shows that the avatar performs natural movements that are comprehensible and resemble animations created with original motion capture data.

References

[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
[2]
Agathe Balayn, Heike Brock, and Kazuhiro Nakadai. 2018. Data-driven development of virtual sign language communication agents. In Proceedings of the 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN’18). IEEE, 370--377.
[3]
Heike Brock and Kazuhiro Nakadai. 2018. Deep JSLC: A multimodal corpus collection for data-driven generation of Japanese sign language expressions. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language Resources Association (ELRA).
[4]
Heike Brock, Juliette Rengot, and Kazuhiro Nakadai. 2018. Augmenting sparse corpora for enhanced sign language recognition and generation. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018) and the 8th Workshop on the Representation and Processing of Sign Languages: Involving the Language Community. European Language Resources Association (ELRA).
[5]
Patrick Buehler, Mark Everingham, Daniel P. Huttenlocher, and Andrew Zisserman. 2011. Upper body detection and tracking in extended signing sequences. Int. J. Comput. Vis. 95, 2 (2011), 180.
[6]
Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, and Richard Bowden. 2017. Subunets: End-to-end hand shape and continuous sign language recognition. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17). IEEE, 3075--3084.
[7]
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291–7299.
[8]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291--7299.
[9]
James Charles, Tomas Pfister, Mark Everingham, and Andrew Zisserman. 2014. Automatic and efficient human pose estimation for sign language videos. Int. J. Comput. Vis. 110, 1 (2014), 70--90.
[10]
Dan Claudiu Cireşan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. 2010. Deep, big, simple neural nets for handwritten digit recognition. Neur. Comput. 22, 12 (2010), 3207--3220.
[11]
Thomas Coogan, George Awad, Junwei Han, and Alistair Sutherland. 2006. Real time hand gesture recognition including hand segmentation and tracking. In Proceedings of the International Symposium on Visual Computing. Springer, 495--504.
[12]
Behzad Dariush, Michael Gienger, Bing Jian, Christian Goerick, and Kikuo Fujimura. 2008. Whole body humanoid control from human motion descriptors. In Proceedings of the 2008 IEEE International Conference on Robotics and Automation. IEEE, 2677--2684.
[13]
Sarah Ebling and John Glauert. 2016. Building a Swiss German Sign Language avatar with JASigning and evaluating it among the Deaf community. Univ. Access Inf. Soc. 15, 4 (2016), 577--587.
[14]
Ali Farhadi, David Forsyth, and Ryan White. 2007. Transfer learning in sign language. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.
[15]
Jens Forster, Christoph Schmidt, Oscar Koller, Martin Bellgardt, and Hermann Ney. 2014. Extensions of the sign language recognition and translation corpus RWTH-PHOENIX-weather. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14). 1911--1916.
[16]
Srujana Gattupalli, Amir Ghaderi, and Vassilis Athitsos. 2016. Evaluation of deep learning based pose estimation for sign language recognition. In Proceedings of the 9th ACM International Conference on Pervasive Technologies Related to Assistive Environments. ACM, 12.
[17]
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 6645--6649.
[18]
Junwei Han, George Awad, and Alistair Sutherland. 2009. Automatic skin segmentation and tracking in sign language recognition. IET Comput. Vis. 3, 1 (Mar. 2009), 24--35.
[19]
Jessica Hodgins, Sophie Jörg, Carol O’Sullivan, Sang Il Park, and Moshe Mahler. 2010. The saliency of anomalies in animated human characters. ACM Trans. Appl. Percept. 7, 4, Article 22 (Jul. 2010), 14 pages.
[20]
Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018. Video-based sign language recognition without temporal segmentation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (2018).
[21]
Matt Huenerfauth. 2008. Evaluation of a psycholinguistically motivated timing model for animations of American sign language. In Proceedings of the 10th International ACM SIGACCESS Conference on Computers and Accessibility. ACM, 129--136.
[22]
Matt Huenerfauth. 2008. Generating American sign language animation: Overcoming misconceptions and technical challenges. Univ. Access Inf. Soc. 6, 4 (2008), 419--434.
[23]
Matt Huenerfauth, Liming Zhao, Erdan Gu, and Jan Allbeck. 2008. Evaluation of American sign language generation by native ASL signers. ACM Trans. Access. Comput. 1, 1, Article 3 (May 2008), 27 pages.
[24]
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36, 7 (2014), 1325--1339.
[25]
Aaron S. Jackson, Adrian Bulat, Vasileios Argyriou, and Georgios Tzimiropoulos. 2017. Large pose 3D face reconstruction from a single image via direct volumetric CNN regression. In Proceedings of the IEEE International Conference on Computer Vision. 1031--1039.
[26]
Hernisa Kacorri and Matt Huenerfauth. 2016. Continuous profile models in asl syntactic facial expression synthesis. In Proceedgins of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 2084--2093.
[27]
Hernisa Kacorri, Matt Huenerfauth, Sarah Ebling, Kasmira Patel, Kellie Menzies, and Mackenzie Willard. 2017. Regression analysis of demographic and technology-experience factors influencing acceptance of sign language animation. ACM Trans. Access. Comput. 10, 1 (2017), 3.
[28]
Michael Kipp, Alexis Heloir, and Quan Nguyen. 2011. Sign language avatars: Animation and comprehensibility. In Proceedings of the International Workshop on Intelligent Virtual Agents. Springer, 113--126.
[29]
Michael Kipp, Quan Nguyen, Alexis Heloir, and Silke Matthes. 2011. Assessing the deaf user perspective on sign language avatars. In Proceedings of the 13th International ACM SIGACCESS Conference on Computers 8 Accessibility (ASSETS’11). ACM, New York, NY, 107--114.
[30]
Oscar Koller, Sepehr Zargaran, Hermann Ney, and Richard Bowden. 2018. Deep sign: Enabling robust statistical continuous sign language recognition via hybrid CNN-HMMs. Int. J. Comput. Vis. 126, 12 (2018), 1311--1325.
[31]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.
[32]
Jack B. Kuipers et al. 1999. Quaternions and Rotation Sequences. Vol. 66. Princeton University Press, Princeton, NJ.
[33]
E. Kiran Kumar, P. V. V. Kishore, A. S. C. S. Sastry, M. Teja Kiran Kumar, and D. Anil Kumar. 2018. Training CNNs for 3-D sign language recognition with color texture coded joint angular displacement maps. IEEE Sign. Process. Lett. 25, 5 (2018), 645--649.
[34]
François Lefebvre-Albaret, Sylvie Gibet, Ahmed Turki, Ludovic Hamon, and Rémi Brun. 2013. Overview of the Sign3D project high-fidelity 3D recording, indexing and editing of French sign language content. In Proceedings of the 3rd International Symposium on Sign Language Translation and Avatar Technology (SLTAT’13) 2013.
[35]
Sijin Li and Antoni B. Chan. 2014. 3d human pose estimation from monocular images with deep convolutional neural network. In Proceedings of the Asian Conference on Computer Vision. Springer, 332--347.
[36]
Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. 2019. MediaPipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019).
[37]
Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. 2017. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 2640--2649.
[38]
John McDonald, Rosalee Wolfe, Jerry Schnepp, Julie Hochgesang, Diana Gorman Jamrozik, Marie Stumbo, Larwan Berke, Melissa Bialek, and Farah Thomas. 2016. An automated technique for real-time production of lifelike animations of American Sign Language. Univ. Access Inf. Soc. 15, 4 (2016), 551--566.
[39]
Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. 2017. Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Trans. Graph. 36, 4 (2017), 44.
[40]
Maddock Meredith, Steve Maddock, et al. 2001. Motion capture file formats explained. Department of Computer Science, University of Sheffield 211 (2001), 241--244.
[41]
Thomas B. Moeslund and Erik Granum. 2001. A survey of computer vision-based human motion capture. Comput. Vis. Image Understand. 81, 3 (2001), 231--268.
[42]
Caio D. D. Monteiro, Christy Maria Mathew, Ricardo Gutierrez-Osuna, and Frank Shipman. 2016. Detecting and identifying sign languages through visual features. In Proceedings of the 2016 IEEE International Symposium on Multimedia (ISM’16). 287--290.
[43]
Masahiro Mori. 1970. The uncanny valley. Energy 7, 4 (1970), 33--35.
[44]
The University of Wisconsin. 1999. Biovision BVH. Retrieved from http://research.cs.wisc.edu/graphics/Courses/cs-838-1999/Jeff/BVH.html.
[45]
Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas Daniilidis. 2017. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7025--7034.
[46]
Nikolaos Sarafianos, Bogdan Boteanu, Bogdan Ionescu, and Ioannis A. Kakadiaris. 2016. 3d human pose estimation: A review of the literature and analysis of covariates. Comput. Vis. Image Understand. 152 (2016), 1--20.
[47]
Frank M. Shipman, Ricardo Gutierrez-Osuna, and Caio D. D. Monteiro. 2014. Identifying sign language videos in video sharing sites. ACM Trans. Access. Comput. 5, 4, Article 9 (Mar. 2014), 14 pages.
[48]
Patrice Y. Simard, David Steinkraus, John C. Platt, et al. 2003. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’03), Vol. 3.
[49]
Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand keypoint detection in single images using multiview bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1145--1153.
[50]
William Steptoe, Anthony Steed, Aitor Rovira, and John Rae. 2010. Lie tracking: Social presence, truth and deception in Avatar-mediated telecommunication. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’10). ACM, New York, NY, 1039--1048.
[51]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 3104--3112.
[52]
Bugra Tekin, Pablo Márquez-Neila, Mathieu Salzmann, and Pascal Fua. 2017. Learning to fuse 2d and 3d image cues for monocular body pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 3941--3950.

Cited By

View all
  • (2024)Building Sign Language DatasetsSign Language Processing10.1007/978-3-031-68763-1_7(109-127)Online publication date: 1-Sep-2024
  • (2023)Augmentative and Alternative Communication/ Hearing ImpairmentsComputer Assistive Technologies for Physically and Cognitively Challenged Users10.2174/9789815079159123020008(117-134)Online publication date: 21-Mar-2023
  • (2023)Denoising Diffusion for 3D Hand Pose Estimation from Images2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00338(3128-3137)Online publication date: 2-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 11, Issue 3
Survey Paper and Regular Papers
June 2020
286 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/3392081
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2020
Accepted: 01 December 2019
Revised: 01 November 2019
Received: 01 July 2019
Published in TIST Volume 11, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 3D pose estimation
  2. data augmentation
  3. recurrent neural networks
  4. sign language recognition
  5. sign language synthesis

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Honda Research Institute Europe

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)518
  • Downloads (Last 6 weeks)40
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Building Sign Language DatasetsSign Language Processing10.1007/978-3-031-68763-1_7(109-127)Online publication date: 1-Sep-2024
  • (2023)Augmentative and Alternative Communication/ Hearing ImpairmentsComputer Assistive Technologies for Physically and Cognitively Challenged Users10.2174/9789815079159123020008(117-134)Online publication date: 21-Mar-2023
  • (2023)Denoising Diffusion for 3D Hand Pose Estimation from Images2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00338(3128-3137)Online publication date: 2-Oct-2023
  • (2023)Sign language recognition based on skeleton and SK3D-Residual networkMultimedia Tools and Applications10.1007/s11042-023-16117-y83:6(18059-18072)Online publication date: 22-Jul-2023
  • (2023)Towards a Multidisciplinary Approach for Designing Multimodal Sensory Communication Devices for AeronauticsDesign for Sustainable Inclusion10.1007/978-3-031-28528-8_16(146-155)Online publication date: 24-Mar-2023
  • (2022)A Systematic Review of User Studies as a Basis for the Design of Systems for Automatic Sign Language ProcessingACM Transactions on Accessible Computing10.1145/356339515:4(1-33)Online publication date: 11-Nov-2022
  • (2022)An Approach to Sri Lankan Sign Language Recognition Using Deep Learning with MediaPipeDigital Technologies and Applications10.1007/978-3-031-01942-5_45(449-459)Online publication date: 8-May-2022
  • (2021)Artificial Intelligence Technologies for Sign LanguageSensors10.3390/s2117584321:17(5843)Online publication date: 30-Aug-2021
  • (2020)Capture of 3D Human Motion Pose in Virtual Reality Based on Video RecognitionComplexity10.1155/2020/88577482020Online publication date: 1-Jan-2020
  • (2020)Teaching American Sign Language in Mixed RealityProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/34322114:4(1-27)Online publication date: 18-Dec-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media