research-article

Open access

Learning Three-dimensional Skeleton Data from Sign Language Video

Authors:

Kazuhiro Nakadai,

Yuji NagashimaAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 11, Issue 3

Article No.: 30, Pages 1 - 24

https://doi.org/10.1145/3377552

Published: 17 April 2020 Publication History

All formats PDF

Abstract

Data for sign language research is often difficult and costly to acquire. We therefore present a novel pipeline able to generate motion three-dimensional (3D) skeleton data from single-camera sign language videos only. First, three recurrent neural networks are learned to infer the three-dimensional position data of body, face, and finger joints for a high resolution of the signer’s skeleton. Subsequently, the angular displacements of all joints over time are estimated using inverse kinematics and mapped to a virtual sign avatar for animation. Last, the generated data are evaluated in detail, including a sign language recognition and sign language synthesis scenario. Utilizing a neural word classifier trained on real motion capture data, we reliably classify word segments built from our newly generated position data with similar accuracy as motion capture data (absolute difference 3.8%). Furthermore, qualitative evaluation of sign animations shows that the avatar performs natural movements that are comprehensible and resemble animations created with original motion capture data.

References

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

[2]

Agathe Balayn, Heike Brock, and Kazuhiro Nakadai. 2018. Data-driven development of virtual sign language communication agents. In Proceedings of the 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN’18). IEEE, 370--377.

[3]

Heike Brock and Kazuhiro Nakadai. 2018. Deep JSLC: A multimodal corpus collection for data-driven generation of Japanese sign language expressions. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language Resources Association (ELRA).

[4]

Heike Brock, Juliette Rengot, and Kazuhiro Nakadai. 2018. Augmenting sparse corpora for enhanced sign language recognition and generation. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018) and the 8th Workshop on the Representation and Processing of Sign Languages: Involving the Language Community. European Language Resources Association (ELRA).

[5]

Patrick Buehler, Mark Everingham, Daniel P. Huttenlocher, and Andrew Zisserman. 2011. Upper body detection and tracking in extended signing sequences. Int. J. Comput. Vis. 95, 2 (2011), 180.

Digital Library

[6]

Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, and Richard Bowden. 2017. Subunets: End-to-end hand shape and continuous sign language recognition. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17). IEEE, 3075--3084.

[7]

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291–7299.

[8]

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291--7299.

[9]

James Charles, Tomas Pfister, Mark Everingham, and Andrew Zisserman. 2014. Automatic and efficient human pose estimation for sign language videos. Int. J. Comput. Vis. 110, 1 (2014), 70--90.

Digital Library

[10]

Dan Claudiu Cireşan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. 2010. Deep, big, simple neural nets for handwritten digit recognition. Neur. Comput. 22, 12 (2010), 3207--3220.

Digital Library

[11]

Thomas Coogan, George Awad, Junwei Han, and Alistair Sutherland. 2006. Real time hand gesture recognition including hand segmentation and tracking. In Proceedings of the International Symposium on Visual Computing. Springer, 495--504.

Digital Library

[12]

Behzad Dariush, Michael Gienger, Bing Jian, Christian Goerick, and Kikuo Fujimura. 2008. Whole body humanoid control from human motion descriptors. In Proceedings of the 2008 IEEE International Conference on Robotics and Automation. IEEE, 2677--2684.

[13]

Sarah Ebling and John Glauert. 2016. Building a Swiss German Sign Language avatar with JASigning and evaluating it among the Deaf community. Univ. Access Inf. Soc. 15, 4 (2016), 577--587.

Digital Library

[14]

Ali Farhadi, David Forsyth, and Ryan White. 2007. Transfer learning in sign language. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.

[15]

Jens Forster, Christoph Schmidt, Oscar Koller, Martin Bellgardt, and Hermann Ney. 2014. Extensions of the sign language recognition and translation corpus RWTH-PHOENIX-weather. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14). 1911--1916.

[16]

Srujana Gattupalli, Amir Ghaderi, and Vassilis Athitsos. 2016. Evaluation of deep learning based pose estimation for sign language recognition. In Proceedings of the 9th ACM International Conference on Pervasive Technologies Related to Assistive Environments. ACM, 12.

Digital Library

[17]

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 6645--6649.

[18]

Junwei Han, George Awad, and Alistair Sutherland. 2009. Automatic skin segmentation and tracking in sign language recognition. IET Comput. Vis. 3, 1 (Mar. 2009), 24--35.

[19]

Jessica Hodgins, Sophie Jörg, Carol O’Sullivan, Sang Il Park, and Moshe Mahler. 2010. The saliency of anomalies in animated human characters. ACM Trans. Appl. Percept. 7, 4, Article 22 (Jul. 2010), 14 pages.

Digital Library

[20]

Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018. Video-based sign language recognition without temporal segmentation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (2018).

[21]

Matt Huenerfauth. 2008. Evaluation of a psycholinguistically motivated timing model for animations of American sign language. In Proceedings of the 10th International ACM SIGACCESS Conference on Computers and Accessibility. ACM, 129--136.

Digital Library

[22]

Matt Huenerfauth. 2008. Generating American sign language animation: Overcoming misconceptions and technical challenges. Univ. Access Inf. Soc. 6, 4 (2008), 419--434.

Digital Library

[23]

Matt Huenerfauth, Liming Zhao, Erdan Gu, and Jan Allbeck. 2008. Evaluation of American sign language generation by native ASL signers. ACM Trans. Access. Comput. 1, 1, Article 3 (May 2008), 27 pages.

Digital Library

[24]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36, 7 (2014), 1325--1339.

Digital Library

[25]

Aaron S. Jackson, Adrian Bulat, Vasileios Argyriou, and Georgios Tzimiropoulos. 2017. Large pose 3D face reconstruction from a single image via direct volumetric CNN regression. In Proceedings of the IEEE International Conference on Computer Vision. 1031--1039.

[26]

Hernisa Kacorri and Matt Huenerfauth. 2016. Continuous profile models in asl syntactic facial expression synthesis. In Proceedgins of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 2084--2093.

[27]

Hernisa Kacorri, Matt Huenerfauth, Sarah Ebling, Kasmira Patel, Kellie Menzies, and Mackenzie Willard. 2017. Regression analysis of demographic and technology-experience factors influencing acceptance of sign language animation. ACM Trans. Access. Comput. 10, 1 (2017), 3.

Digital Library

[28]

Michael Kipp, Alexis Heloir, and Quan Nguyen. 2011. Sign language avatars: Animation and comprehensibility. In Proceedings of the International Workshop on Intelligent Virtual Agents. Springer, 113--126.

[29]

Michael Kipp, Quan Nguyen, Alexis Heloir, and Silke Matthes. 2011. Assessing the deaf user perspective on sign language avatars. In Proceedings of the 13th International ACM SIGACCESS Conference on Computers 8 Accessibility (ASSETS’11). ACM, New York, NY, 107--114.

Digital Library

[30]

Oscar Koller, Sepehr Zargaran, Hermann Ney, and Richard Bowden. 2018. Deep sign: Enabling robust statistical continuous sign language recognition via hybrid CNN-HMMs. Int. J. Comput. Vis. 126, 12 (2018), 1311--1325.

Digital Library

[31]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.

Digital Library

[32]

Jack B. Kuipers et al. 1999. Quaternions and Rotation Sequences. Vol. 66. Princeton University Press, Princeton, NJ.

[33]

E. Kiran Kumar, P. V. V. Kishore, A. S. C. S. Sastry, M. Teja Kiran Kumar, and D. Anil Kumar. 2018. Training CNNs for 3-D sign language recognition with color texture coded joint angular displacement maps. IEEE Sign. Process. Lett. 25, 5 (2018), 645--649.

[34]

François Lefebvre-Albaret, Sylvie Gibet, Ahmed Turki, Ludovic Hamon, and Rémi Brun. 2013. Overview of the Sign3D project high-fidelity 3D recording, indexing and editing of French sign language content. In Proceedings of the 3rd International Symposium on Sign Language Translation and Avatar Technology (SLTAT’13) 2013.

[35]

Sijin Li and Antoni B. Chan. 2014. 3d human pose estimation from monocular images with deep convolutional neural network. In Proceedings of the Asian Conference on Computer Vision. Springer, 332--347.

[36]

Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. 2019. MediaPipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019).

[37]

Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. 2017. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 2640--2649.

[38]

John McDonald, Rosalee Wolfe, Jerry Schnepp, Julie Hochgesang, Diana Gorman Jamrozik, Marie Stumbo, Larwan Berke, Melissa Bialek, and Farah Thomas. 2016. An automated technique for real-time production of lifelike animations of American Sign Language. Univ. Access Inf. Soc. 15, 4 (2016), 551--566.

Digital Library

[39]

Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. 2017. Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Trans. Graph. 36, 4 (2017), 44.

Digital Library

[40]

Maddock Meredith, Steve Maddock, et al. 2001. Motion capture file formats explained. Department of Computer Science, University of Sheffield 211 (2001), 241--244.

[41]

Thomas B. Moeslund and Erik Granum. 2001. A survey of computer vision-based human motion capture. Comput. Vis. Image Understand. 81, 3 (2001), 231--268.

Digital Library

[42]

Caio D. D. Monteiro, Christy Maria Mathew, Ricardo Gutierrez-Osuna, and Frank Shipman. 2016. Detecting and identifying sign languages through visual features. In Proceedings of the 2016 IEEE International Symposium on Multimedia (ISM’16). 287--290.

[43]

Masahiro Mori. 1970. The uncanny valley. Energy 7, 4 (1970), 33--35.

[44]

The University of Wisconsin. 1999. Biovision BVH. Retrieved from http://research.cs.wisc.edu/graphics/Courses/cs-838-1999/Jeff/BVH.html.

[45]

Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas Daniilidis. 2017. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7025--7034.

[46]

Nikolaos Sarafianos, Bogdan Boteanu, Bogdan Ionescu, and Ioannis A. Kakadiaris. 2016. 3d human pose estimation: A review of the literature and analysis of covariates. Comput. Vis. Image Understand. 152 (2016), 1--20.

Digital Library

[47]

Frank M. Shipman, Ricardo Gutierrez-Osuna, and Caio D. D. Monteiro. 2014. Identifying sign language videos in video sharing sites. ACM Trans. Access. Comput. 5, 4, Article 9 (Mar. 2014), 14 pages.

Digital Library

[48]

Patrice Y. Simard, David Steinkraus, John C. Platt, et al. 2003. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’03), Vol. 3.

[49]

Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand keypoint detection in single images using multiview bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1145--1153.

[50]

William Steptoe, Anthony Steed, Aitor Rovira, and John Rae. 2010. Lie tracking: Social presence, truth and deception in Avatar-mediated telecommunication. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’10). ACM, New York, NY, 1039--1048.

Digital Library

[51]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 3104--3112.

Digital Library

[52]

Bugra Tekin, Pablo Márquez-Neila, Mathieu Salzmann, and Pascal Fua. 2017. Learning to fuse 2d and 3d image cues for monocular body pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 3941--3950.

Cited By

Othman AOthman A(2024)Building Sign Language DatasetsSign Language Processing10.1007/978-3-031-68763-1_7(109-127)Online publication date: 1-Sep-2024
https://doi.org/10.1007/978-3-031-68763-1_7
Joy JBalakrishnan KSreeraj M(2023)Augmentative and Alternative Communication/ Hearing ImpairmentsComputer Assistive Technologies for Physically and Cognitively Challenged Users10.2174/9789815079159123020008(117-134)Online publication date: 21-Mar-2023
https://doi.org/10.2174/9789815079159123020008
Ivashechkin MMendez OBowden R(2023)Denoising Diffusion for 3D Hand Pose Estimation from Images2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00338(3128-3137)Online publication date: 2-Oct-2023
https://doi.org/10.1109/ICCVW60793.2023.00338
Show More Cited By

Index Terms

Learning Three-dimensional Skeleton Data from Sign Language Video
1. Computing methodologies
2. Human-centered computing
  1. Accessibility
    1. Accessibility systems and tools

Recommendations

Collecting and evaluating the CUNY ASL corpus for research on American Sign Language animation

Automated synthesis of American Sign Language (ASL) animations will benefit people who are deaf with low English literacy.Annotated ASL motion-capture corpora enable researchers to produce animations with complex spatial and linguistic phenomena.We ...
Combining emotion and facial nonmanual signals in synthesized american sign language
ASSETS '12: Proceedings of the 14th international ACM SIGACCESS conference on Computers and accessibility

Translating from English to American Sign Language (ASL) requires an avatar to display synthesized ASL. Essential to the language are nonmanual signals that appear on the face. Previous avatars were hampered by an inability to portray emotion and facial ...
Sign language recognition with recurrent neural network using human keypoint detection
RACS '18: Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems

We study the sign language recognition problem which is to translate the meaning of signs from visual input such as videos. It is well-known that many problems in the field of computer vision require a huge amount of dataset to train deep neural network ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 11, Issue 3

Survey Paper and Regular Papers

June 2020

286 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/3392081

Editor:
Yu Zheng
JD Finance, China

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2020

Accepted: 01 December 2019

Revised: 01 November 2019

Received: 01 July 2019

Published in TIST Volume 11, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Honda Research Institute Europe

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
3,315
Total Downloads

Downloads (Last 12 months)518
Downloads (Last 6 weeks)40

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Othman AOthman A(2024)Building Sign Language DatasetsSign Language Processing10.1007/978-3-031-68763-1_7(109-127)Online publication date: 1-Sep-2024
https://doi.org/10.1007/978-3-031-68763-1_7
Joy JBalakrishnan KSreeraj M(2023)Augmentative and Alternative Communication/ Hearing ImpairmentsComputer Assistive Technologies for Physically and Cognitively Challenged Users10.2174/9789815079159123020008(117-134)Online publication date: 21-Mar-2023
https://doi.org/10.2174/9789815079159123020008
Ivashechkin MMendez OBowden R(2023)Denoising Diffusion for 3D Hand Pose Estimation from Images2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00338(3128-3137)Online publication date: 2-Oct-2023
https://doi.org/10.1109/ICCVW60793.2023.00338
Han QHuangfu ZMin WDing TLiao Y(2023)Sign language recognition based on skeleton and SK3D-Residual networkMultimedia Tools and Applications10.1007/s11042-023-16117-y83:6(18059-18072)Online publication date: 22-Jul-2023
https://doi.org/10.1007/s11042-023-16117-y
Lounis CBénéjean MHémont FMarrast PBrock ADu Tiers HMadras JCeretto E(2023)Towards a Multidisciplinary Approach for Designing Multimodal Sensory Communication Devices for AeronauticsDesign for Sustainable Inclusion10.1007/978-3-031-28528-8_16(146-155)Online publication date: 24-Mar-2023
https://doi.org/10.1007/978-3-031-28528-8_16
Prietch SSánchez JGuerrero J(2022)A Systematic Review of User Studies as a Basis for the Design of Systems for Automatic Sign Language ProcessingACM Transactions on Accessible Computing10.1145/356339515:4(1-33)Online publication date: 11-Nov-2022
https://dl.acm.org/doi/10.1145/3563395
Herath RIshanka P(2022)An Approach to Sri Lankan Sign Language Recognition Using Deep Learning with MediaPipeDigital Technologies and Applications10.1007/978-3-031-01942-5_45(449-459)Online publication date: 8-May-2022
https://doi.org/10.1007/978-3-031-01942-5_45
Papastratis IChatzikonstantinou CKonstantinidis DDimitropoulos KDaras P(2021)Artificial Intelligence Technologies for Sign LanguageSensors10.3390/s2117584321:17(5843)Online publication date: 30-Aug-2021
https://doi.org/10.3390/s21175843
Fu QZhang XXu JZhang H(2020)Capture of 3D Human Motion Pose in Virtual Reality Based on Video RecognitionComplexity10.1155/2020/88577482020Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1155/2020/8857748
Shao QSniffen ABlanchet JHillis MShi XHaris TLiu JLamberton JMalzkuhn MQuandt LMahoney JKraemer DZhou XBalkcom D(2020)Teaching American Sign Language in Mixed RealityProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/34322114:4(1-27)Online publication date: 18-Dec-2020
https://dl.acm.org/doi/10.1145/3432211

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents