Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

(2+1)D-SLR: an efficient network for video sign language recognition

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The most existing sign language recognition methods have made significant progress. However, there are still problems in the field of sign language recognition: Traditional SLR technology relies on external devices such as data gloves, position tracker, and has achieved limited success. Moreover, the current state-of-the-art vision-based technologies cannot be applied in practice due to the difficulty in balancing accuracy and speed, because most of them pay the cost of running time for better sign language classification accuracy. In this paper, we propose a (2+1)D-SLR network based on (2+1)D convolution, which is different from other methods in that the proposed network can achieve higher accuracy with a faster speed. Because (2+1)D-SLR can learn spatio-temporal features from the raw sign RGB frames. In addition, the existing Chinese sign language dataset is difficult to guarantee the personality differences between different sign language speakers and the presentation differences of the same presenter. Therefore, we propose a large-scale Chinese sign language video dataset called NCSL to solve this problem, including 300 different sign language vocabulary which demonstrated by 30 volunteers, 10 times each. We also validated our method on NCSL and another large-scale sign language dataset, i.e., LSA64, Achieved 96.4% and 98.7% accuracy, respectively, demonstrating that our method can not only achieve competitive accuracy but be much faster than current well-known sign language recognition methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Gao W,  Jiyong MA, Jiangqin WU (2000) Sign language recognition based on hmm/ann/dp. Int J Pattern Recogn Artif Intell 14(5):587-602

    Article  Google Scholar 

  2. Feris R, Turk M, Raskar R, Tan K, Ohashi G (2005) Exploiting depth discontinuities for vision-based fingerspelling recognition. In: 2004 Conference on computer vision and pattern recognition workshop

  3. Fang G,  Wen G, Zhao D (2003) Large vocabulary sign language recognition based on fuzzy decision trees. IEEE rans Syst Man Cybern Part A: Syst Hum 34(3):305–314.

    Article  Google Scholar 

  4. Sun C, Zhang T, Bao B-K, Xu C (2013) Discriminative exemplar coding for sign language recognition with Kinect. IEEE Trans Cybern 43(5):1418–1428

    Article  Google Scholar 

  5. Lin Y, Chai X, Yu Z, Chen X (2014) Curve matching from the view of manifold for sign language recognition. Lect Notes Comput Sci 9010:233–246

    Article  Google Scholar 

  6. Escalera S, Baró X, Gonzàlez J, Bautista MA, Guyon I (2004) Chalearn looking at people challenge 2014: dataset and results. Springer, Berlin

    Google Scholar 

  7. Fang Y, Chai X, Chen X (2016) Iterative reference driven metric learning for signer independent isolated sign language recognition. Springer, Berlin

    Google Scholar 

  8. Dan G, Zhou W, Li H, Meng W (2017) Online early-late fusion based on adaptive hmm for sign language recognition. ACM Trans Multimed Comput Commun Appl 14(1):1–18

    Google Scholar 

  9. Huang J, Zhou W, Li H, Li W (2018) Attention-based 3d-cnns for large-vocabulary sign language recognition. IEEE Trans Circuits Syst Video Technol PP:1–1

    Google Scholar 

  10. Zhang J, Zhou W, Li H (2014) A threshold-based HMM-DTW approach for continuous sign language recognition. In: Proceedings of international conference on internet multimedia computing and service, ser. ICIMCS ’14. Association for Computing Machinery, New York, NY, USA, pp 237–240

  11. Koller O, Zargaran S, Ney H (2017) Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMS. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR)

  12. ZHENGYU, ZHOU, KEHUANG, LI, CHIN-HUI, & LEE. (2016). Sign transition modeling and a scalable solution to continuous sign language recognition for real-world applications. Acm Transactions on Accessible Computing.

    Google Scholar 

  13. Huang J, Zhou W, Zhang Q, Li H, Li W (2018) Video-based sign language recognition without temporal segmentation. In: 32nd AAAI conference on artificial intelligence (AAAI-18)

  14. Wang S, Guo D, Zhou W-G, Zha Z-J, Wang M (2018) Connectionist temporal fusion for sign language translation. In: Proceedings of the 26th ACM international conference on multimedia, pp 1483–1491

  15. Pu J, Zhou W, Li H (2018) Dilated convolutional network with iterative optimization for continuous sign language recognition. In: Twenty-seventh international joint conference on artificial intelligence IJCAI-18

  16. Cui R, Liu H, Zhang C (2019) A deep neural framework for continuous sign language recognition by iterative training. IEEE Trans Multimedia PP 7:1–1

    Google Scholar 

  17. Koller O, Camgoz NC, Ney H, Bowden R (2020) Weakly supervised learning with multi-stream CNN-LSTM-HMMS to discover sequential parallelism in sign language videos. IEEE Trans Pattern Anal Mach Intell 42(9):2306–2320

    Article  Google Scholar 

  18. Pu J, Zhou W, Li H (2019) Iterative alignment network for continuous sign language recognition. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  19. Sagawa H, Takeuchi M (2000) A method for recognizing a sequence of sign language words represented in a japanese sign language sentence. In: 4th IEEE international conference on automatic face and gesture recognition (FG 2000), Grenoble, France

  20. Hernandez-Rebollar JL, Kyriakopoulos N, Lindeman RW (2004) A new instrumented approach for translating American sign language into sound and text. In: IEEE international conference on automatic face and gesture recognition

  21. Gao W, Ma JY, Shan SG, Chen XL, Wu JQ (2000) Handtalker: A multimodal dialog system using sign language and 3-d virtual human. In: International conference on advances in multimodal interfaces

  22. Hu H, Zhou W, Pu J, Li H (2021) Global-local enhancement network for nmf-aware sign language recognition. In: ACM transactions on multimedia computing, communications, and applications (TOMM)

  23. Cooper H, Ong E, Pugeault N, Bowden R (2012) Sign language recognition using sub-units. J Mach Learn Res 13:2205–2231

    Google Scholar 

  24. Kapuciński T, Oszust M, Wysocki M (2014) Recognition of dynamic hand gesture observed by depth cameras. In: Workshop on real-time gesture recognition for human robot interaction

  25. Ronchetti F, Quiroga F, Estrebou C, Lanzarini L, Rosete A (2016) Lsa64: a dataset of argentinian sign language. XX II Congreso Argentino de Ciencias de la Computación (CACIC)

  26. Neidle C, Thangali A, Sclaroff S (2012) Challenges in development of the American sign language lexicon video dataset (ASLLVD) corpus. https://open.bu.edu/handle/2144/31899

  27. Pugeault N, Bowden (2012) Spelling it out: real-time ASL fingerspelling recognition. In: IEEE international conference on computer vision workshops, ICCV 2011 Workshops, Barcelona, Spain, 2011

  28. Koller O, Ney H, Bowden R (2016) Automatic alignment of Hamnosys subunits for continuous sign language recognition In: LREC workshop on the representation and processing of sign languages: corpus mining

  29. Ji Y, Kim S, Lee KB (2017) Sign language learning system with image sampling and convolutional neural network. In: IEEE international conference on robotic computing

  30. Kim S, Ji Y, Lee KB (2018) An effective sign language learning with object detection based ROI segmentation. In: IEEE international conference on robotic computing

  31. Köpükü O, Köse N, Rigoll G (2018) Motion fused frames: data level fusion strategy for hand gesture recognition. In: 2018 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW)

  32. Konstantinidis D, Dimitropoulos K, Daras P (2018) Sign language recognition based on hand and body skeletal data. In: 3DTV conference

  33. Devineau G, Moutarde F, Xi W, Yang J (2018) Deep learning for hand gesture recognition on skeletal data. In: 2018 13th IEEE international conference on automatic face and gesture recognition (FG). IEEE, 2018, pp 106–113

  34. Molchanov P, Gupta S, Kim K, Kautz J (2015) Hand gesture recognition with 3d convolutional neural networks. In: Computer vision and pattern recognition workshops

  35. Wu D, Pigou L, Kindermans PJ, Le DH, Shao L, Dambre J, Odobez JM (2016) Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans Pattern Anal Mach Intell 38(8):1583–1597

    Article  Google Scholar 

  36. Huang J, Zhou W, Li H, Li W (2015) Sign language recognition using 3d convolutional neural networks. In: 2015 IEEE international conference on multimedia and expo (ICME)

  37. Li Y, Miao Q, Tian K, Fan Y, Song J (2016) Large-scale gesture recognition with a fusion of RGB-D data based on the c3d model. In: 2016 23rd international conference on pattern recognition (ICPR)

  38. Li Y,  Miao Q,  Tian K,  Fan Y, Song J (2016) Large-scale gesture recognition with a fusion of RGB-D data based on the C3D model. 2016 23rd international conference on pattern recognition (ICPR). IEEE

  39. Elbadawy M, Elons AS, Shedeed HA, Tolba MF (2017) Arabic sign language recognition with 3d convolutional neural networks. In: 2017 eighth international conference on intelligent computing and information systems (ICICIS)

  40. Zhi-Jie L, Sheng-Bin L, Bing-Zhang H (2018) 3d convolutional neural networks for dynamic sign language recognition. Comput J 11:1725–1736

    Google Scholar 

  41. Lee C, Xu Y (2002) Online, interactive learning of gestures for human/robot interfaces. In: IEEE international conference on robotics and automation

  42. Fels SS, Hinton GE (1993) Glove-talk: a neural network interface between a data-glove and a speech synthesizer. IEEE Trans Neural Netw 4(1):2–8

    Article  Google Scholar 

  43. Wadhawan A, Kumar P (2020) Deep learning-based sign language recognition system for static signs. Neural Comput Appl 32:1–12

    Article  Google Scholar 

  44. Sharma A, Sharma N, Saxena Y, Singh A, Sadhya D (2020) Benchmarking deep neural network approaches for Indian sign language recognition. Neural Comput Appl 2:1–12

    Google Scholar 

  45. Elons AS, Abull-Ela M, Tolba MF (2013) Neutralizing lighting non-homogeneity and background size in PCNN image signature for Arabic sign language recognition. Neural Comput Appl 22(1 Supplement):47–53

    Article  Google Scholar 

  46. Ozcan T, Basturk A (2019) Transfer learning-based convolutional neural networks with heuristic optimization for hand gesture recognition. Neural Comput Appl 31(12):8955–8970

    Article  Google Scholar 

  47. Du T, Wang H, Torresani L, Ray J, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  48. Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Van Gool L (2017) Temporal 3d convnets: New architecture and transfer learning for video classification. arXiv:1711.08200

  49. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

  50. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp 4489–4497

  51. Diba A, Sharma V, Van Gool L, Stiefelhagen R (2019) Dynamonet: Dynamic action and motion network. arXiv:1904.11407

  52. Girdhar R, Tran D, Torresani L, Ramanan D (2019) Distinit: learning video representations without a single labeled video. arXiv:1901.09244

  53. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: 2017 IEEE international conference on computer vision (ICCV)

  54. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition

  55. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.1, 4

  56. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos, in: Advances in neural information processing systems, pp 568–576

  57. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. Springer, Berlin, pp 20–36

  58. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459

  59. Lin W, Zhang C, Lu K, Sheng B, Wu J, Ni B, Liu X, Xiong H (2018) Action recognition with coarse-to-fine deep feature integration and asynchronous fusion. In: Thirty-second AAAI conference on artificial intelligence

  60. Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321

  61. Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) Multi-fiber networks for video recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 352–367

  62. Kar A, Rai N, Sikka K, Sharma G (2017) Adascan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3376–3385

  63. Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv:1706.02677

  64. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: IEEE/CVF international conference on computer vision (ICCV)

  65. Feichtenhofer C (2020) X3d: Expanding architectures for efficient video recognition. In: Computer vision and pattern recognition, CVPR, pp 203–213

  66. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d CNNS retrace the history of 2d CNNS and imagenet? In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6546–6555

  67. Tang A, Lu K, Wang Y, Huang J, Li H (2015) A real-time hand posture recognition system using deep neural networks. ACM Trans Intell Syst Technol 6(2):1–23

    Article  Google Scholar 

  68. Fan L, Huang W, Gan C, Ermon S, Huang J (2018) End-to-end learning of motion representation for video understanding. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  69. Sun S, Kuang Z, Ouyang W, Sheng L, Zhang W (2017) Optical flow guided feature: a fast and robust motion representation for video action recognition. arXiv:1711.11152

  70. Xie S, Sun C, Huang J, Tu Z, Murphy K (2017) Rethinking spatiotemporal feature learning for video understanding. arXiv:1712.04851

Download references

Acknowledgements

This work was supported in part by the Foundation of the Fundamental Research Funds for the Central Universities of China under Grant N182612002, N2104008, the Central Government Guides The Local Science and Technology Development Special Fund under grant 2021JH6/10500129, the Program for Liaoning Innovative Talents in University under grant LR2020047.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fei Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, F., Du, Y., Wang, G. et al. (2+1)D-SLR: an efficient network for video sign language recognition. Neural Comput & Applic 34, 2413–2423 (2022). https://doi.org/10.1007/s00521-021-06467-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06467-9

Keywords

Navigation