Abstract
The most existing sign language recognition methods have made significant progress. However, there are still problems in the field of sign language recognition: Traditional SLR technology relies on external devices such as data gloves, position tracker, and has achieved limited success. Moreover, the current state-of-the-art vision-based technologies cannot be applied in practice due to the difficulty in balancing accuracy and speed, because most of them pay the cost of running time for better sign language classification accuracy. In this paper, we propose a (2+1)D-SLR network based on (2+1)D convolution, which is different from other methods in that the proposed network can achieve higher accuracy with a faster speed. Because (2+1)D-SLR can learn spatio-temporal features from the raw sign RGB frames. In addition, the existing Chinese sign language dataset is difficult to guarantee the personality differences between different sign language speakers and the presentation differences of the same presenter. Therefore, we propose a large-scale Chinese sign language video dataset called NCSL to solve this problem, including 300 different sign language vocabulary which demonstrated by 30 volunteers, 10 times each. We also validated our method on NCSL and another large-scale sign language dataset, i.e., LSA64, Achieved 96.4% and 98.7% accuracy, respectively, demonstrating that our method can not only achieve competitive accuracy but be much faster than current well-known sign language recognition methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Gao W, Jiyong MA, Jiangqin WU (2000) Sign language recognition based on hmm/ann/dp. Int J Pattern Recogn Artif Intell 14(5):587-602
Feris R, Turk M, Raskar R, Tan K, Ohashi G (2005) Exploiting depth discontinuities for vision-based fingerspelling recognition. In: 2004 Conference on computer vision and pattern recognition workshop
Fang G, Wen G, Zhao D (2003) Large vocabulary sign language recognition based on fuzzy decision trees. IEEE rans Syst Man Cybern Part A: Syst Hum 34(3):305–314.
Sun C, Zhang T, Bao B-K, Xu C (2013) Discriminative exemplar coding for sign language recognition with Kinect. IEEE Trans Cybern 43(5):1418–1428
Lin Y, Chai X, Yu Z, Chen X (2014) Curve matching from the view of manifold for sign language recognition. Lect Notes Comput Sci 9010:233–246
Escalera S, Baró X, Gonzàlez J, Bautista MA, Guyon I (2004) Chalearn looking at people challenge 2014: dataset and results. Springer, Berlin
Fang Y, Chai X, Chen X (2016) Iterative reference driven metric learning for signer independent isolated sign language recognition. Springer, Berlin
Dan G, Zhou W, Li H, Meng W (2017) Online early-late fusion based on adaptive hmm for sign language recognition. ACM Trans Multimed Comput Commun Appl 14(1):1–18
Huang J, Zhou W, Li H, Li W (2018) Attention-based 3d-cnns for large-vocabulary sign language recognition. IEEE Trans Circuits Syst Video Technol PP:1–1
Zhang J, Zhou W, Li H (2014) A threshold-based HMM-DTW approach for continuous sign language recognition. In: Proceedings of international conference on internet multimedia computing and service, ser. ICIMCS ’14. Association for Computing Machinery, New York, NY, USA, pp 237–240
Koller O, Zargaran S, Ney H (2017) Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMS. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR)
ZHENGYU, ZHOU, KEHUANG, LI, CHIN-HUI, & LEE. (2016). Sign transition modeling and a scalable solution to continuous sign language recognition for real-world applications. Acm Transactions on Accessible Computing.
Huang J, Zhou W, Zhang Q, Li H, Li W (2018) Video-based sign language recognition without temporal segmentation. In: 32nd AAAI conference on artificial intelligence (AAAI-18)
Wang S, Guo D, Zhou W-G, Zha Z-J, Wang M (2018) Connectionist temporal fusion for sign language translation. In: Proceedings of the 26th ACM international conference on multimedia, pp 1483–1491
Pu J, Zhou W, Li H (2018) Dilated convolutional network with iterative optimization for continuous sign language recognition. In: Twenty-seventh international joint conference on artificial intelligence IJCAI-18
Cui R, Liu H, Zhang C (2019) A deep neural framework for continuous sign language recognition by iterative training. IEEE Trans Multimedia PP 7:1–1
Koller O, Camgoz NC, Ney H, Bowden R (2020) Weakly supervised learning with multi-stream CNN-LSTM-HMMS to discover sequential parallelism in sign language videos. IEEE Trans Pattern Anal Mach Intell 42(9):2306–2320
Pu J, Zhou W, Li H (2019) Iterative alignment network for continuous sign language recognition. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Sagawa H, Takeuchi M (2000) A method for recognizing a sequence of sign language words represented in a japanese sign language sentence. In: 4th IEEE international conference on automatic face and gesture recognition (FG 2000), Grenoble, France
Hernandez-Rebollar JL, Kyriakopoulos N, Lindeman RW (2004) A new instrumented approach for translating American sign language into sound and text. In: IEEE international conference on automatic face and gesture recognition
Gao W, Ma JY, Shan SG, Chen XL, Wu JQ (2000) Handtalker: A multimodal dialog system using sign language and 3-d virtual human. In: International conference on advances in multimodal interfaces
Hu H, Zhou W, Pu J, Li H (2021) Global-local enhancement network for nmf-aware sign language recognition. In: ACM transactions on multimedia computing, communications, and applications (TOMM)
Cooper H, Ong E, Pugeault N, Bowden R (2012) Sign language recognition using sub-units. J Mach Learn Res 13:2205–2231
Kapuciński T, Oszust M, Wysocki M (2014) Recognition of dynamic hand gesture observed by depth cameras. In: Workshop on real-time gesture recognition for human robot interaction
Ronchetti F, Quiroga F, Estrebou C, Lanzarini L, Rosete A (2016) Lsa64: a dataset of argentinian sign language. XX II Congreso Argentino de Ciencias de la Computación (CACIC)
Neidle C, Thangali A, Sclaroff S (2012) Challenges in development of the American sign language lexicon video dataset (ASLLVD) corpus. https://open.bu.edu/handle/2144/31899
Pugeault N, Bowden (2012) Spelling it out: real-time ASL fingerspelling recognition. In: IEEE international conference on computer vision workshops, ICCV 2011 Workshops, Barcelona, Spain, 2011
Koller O, Ney H, Bowden R (2016) Automatic alignment of Hamnosys subunits for continuous sign language recognition In: LREC workshop on the representation and processing of sign languages: corpus mining
Ji Y, Kim S, Lee KB (2017) Sign language learning system with image sampling and convolutional neural network. In: IEEE international conference on robotic computing
Kim S, Ji Y, Lee KB (2018) An effective sign language learning with object detection based ROI segmentation. In: IEEE international conference on robotic computing
Köpükü O, Köse N, Rigoll G (2018) Motion fused frames: data level fusion strategy for hand gesture recognition. In: 2018 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW)
Konstantinidis D, Dimitropoulos K, Daras P (2018) Sign language recognition based on hand and body skeletal data. In: 3DTV conference
Devineau G, Moutarde F, Xi W, Yang J (2018) Deep learning for hand gesture recognition on skeletal data. In: 2018 13th IEEE international conference on automatic face and gesture recognition (FG). IEEE, 2018, pp 106–113
Molchanov P, Gupta S, Kim K, Kautz J (2015) Hand gesture recognition with 3d convolutional neural networks. In: Computer vision and pattern recognition workshops
Wu D, Pigou L, Kindermans PJ, Le DH, Shao L, Dambre J, Odobez JM (2016) Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans Pattern Anal Mach Intell 38(8):1583–1597
Huang J, Zhou W, Li H, Li W (2015) Sign language recognition using 3d convolutional neural networks. In: 2015 IEEE international conference on multimedia and expo (ICME)
Li Y, Miao Q, Tian K, Fan Y, Song J (2016) Large-scale gesture recognition with a fusion of RGB-D data based on the c3d model. In: 2016 23rd international conference on pattern recognition (ICPR)
Li Y, Miao Q, Tian K, Fan Y, Song J (2016) Large-scale gesture recognition with a fusion of RGB-D data based on the C3D model. 2016 23rd international conference on pattern recognition (ICPR). IEEE
Elbadawy M, Elons AS, Shedeed HA, Tolba MF (2017) Arabic sign language recognition with 3d convolutional neural networks. In: 2017 eighth international conference on intelligent computing and information systems (ICICIS)
Zhi-Jie L, Sheng-Bin L, Bing-Zhang H (2018) 3d convolutional neural networks for dynamic sign language recognition. Comput J 11:1725–1736
Lee C, Xu Y (2002) Online, interactive learning of gestures for human/robot interfaces. In: IEEE international conference on robotics and automation
Fels SS, Hinton GE (1993) Glove-talk: a neural network interface between a data-glove and a speech synthesizer. IEEE Trans Neural Netw 4(1):2–8
Wadhawan A, Kumar P (2020) Deep learning-based sign language recognition system for static signs. Neural Comput Appl 32:1–12
Sharma A, Sharma N, Saxena Y, Singh A, Sadhya D (2020) Benchmarking deep neural network approaches for Indian sign language recognition. Neural Comput Appl 2:1–12
Elons AS, Abull-Ela M, Tolba MF (2013) Neutralizing lighting non-homogeneity and background size in PCNN image signature for Arabic sign language recognition. Neural Comput Appl 22(1 Supplement):47–53
Ozcan T, Basturk A (2019) Transfer learning-based convolutional neural networks with heuristic optimization for hand gesture recognition. Neural Comput Appl 31(12):8955–8970
Du T, Wang H, Torresani L, Ray J, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Van Gool L (2017) Temporal 3d convnets: New architecture and transfer learning for video classification. arXiv:1711.08200
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp 4489–4497
Diba A, Sharma V, Van Gool L, Stiefelhagen R (2019) Dynamonet: Dynamic action and motion network. arXiv:1904.11407
Girdhar R, Tran D, Torresani L, Ramanan D (2019) Distinit: learning video representations without a single labeled video. arXiv:1901.09244
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: 2017 IEEE international conference on computer vision (ICCV)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.1, 4
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos, in: Advances in neural information processing systems, pp 568–576
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. Springer, Berlin, pp 20–36
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459
Lin W, Zhang C, Lu K, Sheng B, Wu J, Ni B, Liu X, Xiong H (2018) Action recognition with coarse-to-fine deep feature integration and asynchronous fusion. In: Thirty-second AAAI conference on artificial intelligence
Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321
Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) Multi-fiber networks for video recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 352–367
Kar A, Rai N, Sikka K, Sharma G (2017) Adascan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3376–3385
Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv:1706.02677
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: IEEE/CVF international conference on computer vision (ICCV)
Feichtenhofer C (2020) X3d: Expanding architectures for efficient video recognition. In: Computer vision and pattern recognition, CVPR, pp 203–213
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d CNNS retrace the history of 2d CNNS and imagenet? In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6546–6555
Tang A, Lu K, Wang Y, Huang J, Li H (2015) A real-time hand posture recognition system using deep neural networks. ACM Trans Intell Syst Technol 6(2):1–23
Fan L, Huang W, Gan C, Ermon S, Huang J (2018) End-to-end learning of motion representation for video understanding. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Sun S, Kuang Z, Ouyang W, Sheng L, Zhang W (2017) Optical flow guided feature: a fast and robust motion representation for video action recognition. arXiv:1711.11152
Xie S, Sun C, Huang J, Tu Z, Murphy K (2017) Rethinking spatiotemporal feature learning for video understanding. arXiv:1712.04851
Acknowledgements
This work was supported in part by the Foundation of the Fundamental Research Funds for the Central Universities of China under Grant N182612002, N2104008, the Central Government Guides The Local Science and Technology Development Special Fund under grant 2021JH6/10500129, the Program for Liaoning Innovative Talents in University under grant LR2020047.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, F., Du, Y., Wang, G. et al. (2+1)D-SLR: an efficient network for video sign language recognition. Neural Comput & Applic 34, 2413–2423 (2022). https://doi.org/10.1007/s00521-021-06467-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-06467-9