Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

FT-HID: a large-scale RGB-D dataset for first- and third-person human interaction analysis

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Analysis of human interaction is one important research topic of human motion analysis. It has been studied either using first-person vision (FPV) or third-person vision (TPV). However, the joint learning of both types of vision has so far attracted little attention. One of the reasons is the lack of suitable datasets that cover both FPV and TPV. In addition, existing benchmark datasets of either FPV or TPV have several limitations, including the limited number of samples, participant subjects, interaction categories, and modalities. In this work, we contribute a large-scale human interaction dataset, namely FT-HID dataset. FT-HID contains pair-aligned samples of first-person and third-person visions. The dataset was collected from 109 distinct subjects and has more than 90K samples for three modalities. The dataset has been validated by using several existing action recognition methods. In addition, we introduce a novel multi-view interaction mechanism for skeleton sequences, and a joint learning multi-stream framework for first-person and third-person visions. Both methods yield promising results on the FT-HID dataset. It is expected that the introduction of this vision-aligned large-scale dataset will promote the development of both FPV and TPV, and their joint learning techniques for human action analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availibility

The FT-HID dataset that support the findings of this study is available in Baidu Cloud, https://pan.baidu.com/s/1RHMRF-O8VLljLo5j9DxoRA?pwd=wr6u. All other data are available from the authors upon reasonable request.

References

  1. Asadi-Aghbolaghi M, Bertiche H, Roig V, Kasaei S, Escalera S (2017) Action recognition from rgb-d data: Comparison and fusion of spatio-temporal handcrafted features and deep strategies. In: Proceedings of the IEEE International conference on computer vision workshops, pp. 3179–3188

  2. Ben Tanfous A, Drira H, Ben Amor B (2018) Coding kendall’s shape trajectories for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2840–2849

  3. Bloom V, Argyriou V, Makris D (2013) Dynamic feature selection for online action recognition. In: international workshop on human behavior understanding, pp. 64–76. Springer

  4. Bloom V, Argyriou V, Makris D (2014) G3di: A gaming interaction dataset with a real time detection and evaluation framework. In: European conference on computer vision, pp. 698–712. Springer

  5. Bloom V, Makris D, Argyriou V (2012) G3d: a gaming action dataset and real time action recognition evaluation framework. In: 2012 IEEE Computer society conference on computer vision and pattern recognition workshops, pp. 7–12. IEEE

  6. Cao C, Lan C, Zhang Y, Zeng W, Lu H, Zhang Y (2018) Skeleton-based action recognition with gated convolutional neural networks. IEEE Trans Circ Sys Video Tech 29(11):3247–3257

    Google Scholar 

  7. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6299–6308

  8. Chen C, Jafari R, Kehtarnavaz N (2015) Action recognition from depth sequences using depth motion maps-based local binary patterns. In: 2015 IEEE winter conference on applications of computer vision, pp. 1092–1099. IEEE

  9. Chen C, Jafari R, Kehtarnavaz N (2015) Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: 2015 IEEE International conference on image processing (ICIP), pp. 168–172. IEEE

  10. Cherian A, Fernando B, Harandi M, Gould S (2017) Generalized rank pooling for activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3222–3231

  11. Ding Z, Wang P, Ogunbona PO, Li W (2017) Investigation of different skeleton features for cnn-based 3d action recognition. In: 2017 IEEE International conference on multimedia & expo workshops (ICMEW), pp. 617–622. IEEE

  12. Fan Z, Zhao X, Lin T, Su H (2018) Attention-based multiview re-observation fusion network for skeletal action recognition. IEEE Trans Multim 21(2):363–374

    Google Scholar 

  13. Fathi A, Ren X, Rehg JM (2011) Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288. IEEE

  14. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941

  15. Fernando B, Gavves E, Oramas J, Ghodrati A, Tuytelaars T (2016) Rank pooling for action recognition. IEEE Trans Patt Anal Mach Intell 39(4):773–787

    Google Scholar 

  16. Gao X, Hu W, Tang J, Liu J, Guo Z (2019) Optimized skeleton-based action recognition via sparsified graph regression. In: Proceedings of the 27th ACM international conference on multimedia, pp. 601–610. ACM

  17. Gao Z, Li S, Zhu Y, Wang C, Zhang H (2017) Collaborative sparse representation leaning model for rgbd action recognition. J Visual Commun Image Represent 48:442–452

    Google Scholar 

  18. Gao Z, Zhang H, Xu G, Xue Y (2015) Multi-perspective and multi-modality joint representation and recognition model for 3d action recognition. Neurocomputing 151:554–564

    Google Scholar 

  19. Garcia NC, Morerio P, Murino V (2018) Modality distillation with multiple stream networks for action recognition. In: Proceedings of the european conference on computer vision (ECCV), pp. 103–118

  20. Garcia-Hernando G, Yuan S, Baek S, Kim TK (2018) First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 409–419

  21. Hou Y, Yu H, Zhou D, Wang P, Ge H, Zhang J, Zhang Q (2021) Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition. Neur Comp Appl 33(23):16439–16450

    Google Scholar 

  22. Hu JF, Zheng WS, Lai J, Zhang J (2015) Jointly learning heterogeneous features for rgb-d activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5344–5352

  23. Hu JF, Zheng WS, Pan J, Lai J, Zhang J (2018) Deep bilinear learning for rgb-d action recognition. In: Proceedings of the European conference on computer vision (ECCV), pp. 335–351

  24. Huang D, Yao S, Wang Y, De La Torre, F (2014) Sequential max-margin event detectors. In: European conference on computer vision, pp. 410–424. Springer

  25. Ijjina EP, Chalavadi KM (2017) Human action recognition in rgb-d videos using motion sequence information and deep learning. Patt Recognit 72:504–516

    Google Scholar 

  26. Imran J, Kumar P (2016) Human action recognition using rgb-d sensor and deep convolutional neural networks. In: 2016 International conference on advances in computing, communications and informatics (ICACCI), pp. 144–148. IEEE

  27. Ji Y, Xu F, Yang Y, Shen F, Shen HT, Zheng WS (2018) A large-scale rgb-d database for arbitrary-view human action recognition. In: 2018 ACM Multimedia conference on multimedia conference, pp. 1510–1518. ACM

  28. Jia C, Fu Y (2016) Low-rank tensor subspace learning for rgb-d action recognition. IEEE Trans Image Process 25(10):4641–4652

    MathSciNet  MATH  Google Scholar 

  29. Joachims T (2006) Training linear svms in linear time. In: Proceedings of the 12th ACM SIGKDD International conference on Knowledge discovery and data mining, pp. 217–226

  30. Khowaja SA, Lee SL (2020) Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition. Neur Comp Appl 32(14):10423–10434

    Google Scholar 

  31. Kong J, Liu T, Jiang M (2019) Collaborative multimodal feature learning for rgb-d action recognition. J Visu Commun Image Represent 59:537–549

    Google Scholar 

  32. Kong Y, Fu Y (2015) Bilinear heterogeneous information machine for rgb-d action recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp. 1054–1062

  33. Kong Y, Fu Y (2016) Discriminative relational representation learning for rgb-d action recognition. IEEE Trans Image Process 25(6):2856–2865

    MathSciNet  MATH  Google Scholar 

  34. Kong Y, Fu Y (2017) Max-margin heterogeneous information machine for rgb-d action recognition. Int J Comp Vision 123(3):350–371

    MathSciNet  MATH  Google Scholar 

  35. Koperski M, Bremond, F (2016) Modeling spatial layout of features for real world scenario rgb-d action recognition. In: 2016 13th IEEE international conference on advanced video and signal based surveillance (AVSS), pp. 44–50. IEEE

  36. Li B, Li X, Zhang Z, Wu F (2019) Spatio-temporal graph routing for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence 33(1), pp 8561-8568

  37. Li C, Hou Y, Wang P, Li W (2017) Joint distance maps based action recognition with convolutional neural networks. IEEE Sign Process Lett 24(5):624–628

    Google Scholar 

  38. Li C, Li S, Gao Y, Zhang X, Li W (2021) A two-stream neural network for pose-based hand gesture recognition. IEEE Trans Cognit Develop Sys. https://doi.org/10.1109/TCDS.2021.3126637

    Article  Google Scholar 

  39. Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian, Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3595–3603

  40. Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. In: 2010 IEEE computer society conference on computer vision and pattern recognition-workshops, pp. 9–14. IEEE

  41. Li Y, Xia R, Liu X (2020) Learning shape and motion representations for view invariant skeleton-based action recognition. Patt Recognit 103:107293

    Google Scholar 

  42. Liu AA, Nie WZ, Su YT, Ma L, Hao T, Yang ZX (2015) Coupled hidden conditional random fields for rgb-d human action recognition. Sig Process 112:74–82

    Google Scholar 

  43. Liu AA, Xu N, Nie WZ, Su YT, Wong Y, Kankanhalli M (2016) Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans Cybern 47(7):1781–1794

    Google Scholar 

  44. Liu AA, Xu N, Su YT, Lin H, Hao T, Yang ZX (2015) Single/multi-view human action recognition via regularized multi-task learning. Neurocomputing 151:544–553

    Google Scholar 

  45. Liu H, Yuan M, Sun F (2015) Rgb-d action recognition using linear coding. Neurocomputing 149:79–85

    Google Scholar 

  46. Liu J, Akhtar N, Ajmal M (2018) Viewpoint invariant action recognition using rgb-d videos. IEEE Access 6:70061–70071

    Google Scholar 

  47. Liu J, Shahroudy A, Perez ML, Wang G, Duan LY, Chichung AK (2019) Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans Patt Anal Mach Intell 42(10):2684–2701

    Google Scholar 

  48. Liu Z, Li Z, Wang R, Zong M, Ji W (2020) Spatiotemporal saliency-based multi-stream networks with attention-aware lstm for action recognition. Neur Comp Appl 32(18):14593–14602

    Google Scholar 

  49. Mansur A, Makihara Y, Yagi Y (2012) Inverse dynamics for action recognition. IEEE Trans Cybern 43(4):1226–1236

    Google Scholar 

  50. Moghimi M, Azagra P, Montesano L, Murillo AC, Belongie S (2014) Experiments on an rgb-d wearable vision system for egocentric activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 597–603

  51. Negin F, Özdemir, F, Akgül CB, Yüksel KA, Erçil, A (2013) A decision forest based feature selection framework for action recognition from rgb-depth cameras. In: International conference image analysis and recognition, pp. 648–657. Springer

  52. Nie Q, Wang J, Wang X, Liu Y (2019) View-invariant human action recognition based on a 3d bio-constrained skeleton model. IEEE Trans Image Process 28(8):3959–3972

    MathSciNet  MATH  Google Scholar 

  53. Oreifej O, Liu Z (2013) Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 716–723

  54. Pirsiavash H, Ramanan D (2012) Detecting activities of daily living in first-person camera views. In: 2012 IEEE conference on computer vision and pattern recognition, pp. 2847–2854. IEEE

  55. Rahmani H, Mahmood A, Huynh DQ, Mian A (2014) Hopc: Histogram of oriented principal components of 3d pointclouds for action recognition. In: European conference on computer vision, pp. 742–757. Springer

  56. Seddik B, Gazzah S, Amara NEB (2017) Human-action recognition using a multi-layered fusion scheme of kinect modalities. IET Comp Vision 11(7):530–540

    Google Scholar 

  57. Shahroudy A, Liu J, Ng TT Wang G (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019

  58. Shahroudy A, Ng TT, Gong Y, Wang G (2017) Deep multimodal feature analysis for action recognition in rgb+ d videos. IEEE Trans Patt Anal Mach Intell 40(5):1045–1058

    Google Scholar 

  59. Shao Z, Li Y, Zhang H (2020) Learning representations from skeletal self-similarities for cross-view action recognition. IEEE Trans Circ Sys Video Tech 31(1):160–174

    Google Scholar 

  60. Shi L, Zhang Y, Cheng J, Lu, H (2019) Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7912–7921

  61. Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 12026–12035

  62. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  63. Song S, Cheung NM, Chandrasekhar V, Mandal B, Liri J (2016) Egocentric activity recognition with multimodal fisher vector. In: 2016 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 2717–2721. IEEE

  64. Song Y, Liu S, Tang J (2014) Describing trajectory of surface patch for human action recognition on rgb and depth videos. IEEE Sig Process Lett 22(4):426–429

    Google Scholar 

  65. Tang Y, Wang Z, Lu J, Feng J, Zhou J (2018) Multi-stream deep neural networks for rgb-d egocentric action recognition. IEEE Trans Circ Sys Video Tech 29(10):3001–3015

    Google Scholar 

  66. Van Gemeren C, Tan RT, Poppe R, Veltkamp RC (2014) Dyadic interaction detection from pose and flow. In: International workshop on human behavior understanding, pp. 101–115. Springer

  67. Vernikos I, Mathe E, Papadakis A, Spyrou E, Mylonas P (2019) An image representation of skeletal data for action recognition using convolutional neural networks. In: Proceedings of the 12th ACM International conference on pervasive technologies related to assistive environments, pp. 325–326. ACM

  68. Wang H, Wang L (2018) Beyond joints: learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Trans Image Process 27(9):4382–4394

    MathSciNet  Google Scholar 

  69. Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE Conference on computer vision and pattern recognition, pp. 1290–1297. IEEE

  70. Wang J, Nie X, Xia Y, Wu Y, Zhu, SC (2014) Cross-view action modeling, learning and recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2649–2656

  71. Wang K, Wang X, Lin L, Wang M, Zuo W (2014) 3d human activity recognition with reconfigurable convolutional neural networks. In: Proceedings of the 22nd ACM international conference on Multimedia, pp. 97–106

  72. Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona PO (2015) Action recognition from depth maps using deep convolutional neural networks. IEEE Trans Human-Mach Sys 46(4):498–509

    Google Scholar 

  73. Wang P, Li W, Li C, Hou Y (2018) Action recognition based on joint trajectory maps with convolutional neural networks. Knowl-Based Sys 158:43–53

    Google Scholar 

  74. Wang P, Li W, Wan J, Ogunbona P, Liu X (2018) Cooperative training of deep aggregation networks for rgb-d action recognition. In: Thirty-Second AAAI conference on artificial intelligence

  75. Wei P, Zhao Y, Zheng N, Zhu SC (2013) Modeling 4d human-object interactions for event and object recognition. In: Proceedings of the IEEE international conference on computer vision, pp. 3272–3279

  76. Wen YH, Gao L, Fu H, Zhang FL Xia S (2019) Graph cnns with motif and variable temporal block for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 8989–8996

  77. Wolf C, Lombardi E, Mille J, Celiktutan O, Jiu M, Dogan E, Eren G, Baccouche M, Dellandréa E, Bichot CE et al (2014) Evaluation of video activity localizations integrating quality and quantity measurements. Comp Vis Image Underst 127:14–30

    Google Scholar 

  78. Xia L, Chen CC, Aggarwal JK (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE Computer society conference on computer vision and pattern recognition workshops, pp. 20–27. IEEE

  79. Xia L, Gori I, Aggarwal JK, Ryoo MS (2015) Robot-centric activity recognition from first-person rgb-d videos. In: 2015 IEEE winter conference on applications of computer vision, pp. 357–364. IEEE

  80. Xingjian S, Chen Z, Wang H, Yeung DY, Wong WK, Woo Wc (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp. 802–810

  81. Xu N, Liu A, Nie W, Wong Y, Li F, Su Y (2015) Multi-modal & multi-view & interactive benchmark dataset for human action recognition. In: Proceedings of the 23rd ACM international conference on Multimedia, pp. 1195–1198

  82. Yang Z, Li Y, Yang J, Luo J (2018) Action recognition with spatio-temporal visual attention on skeleton image sequences. IEEE Trans Circ Sys Video Tech 29(8):2405–2415

    Google Scholar 

  83. Yu M, Liu L, Shao L (2015) Structure-preserving binary representations for rgb-d action recognition. IEEE Trans Patt Anal Mach Intell 38(8):1651–1664

    Google Scholar 

  84. Yu M, Liu L, Shao L (2016) Structure-preserving binary representations for rgb-d action recognition. IEEE Trans Patt Anal Mach Intell 38(8):1651–1664

    Google Scholar 

  85. Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer society conference on computer vision and pattern recognition workshops, pp. 28–35. IEEE

  86. Zhang C, Tian Y (2012) Rgb-d camera-based daily living activity recognition. J Comp Vis Image Process 2(4):12

    Google Scholar 

  87. Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2019) View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans Patt Anal Mach Intell 41(8):1963–1978

    Google Scholar 

  88. Zhang S, Yang Y, Xiao J, Liu X, Yang Y, Xie D, Zhuang Y (2018) Fusing geometric features for skeleton-based action recognition using multilayer lstm networks. IEEE Trans Multim 20(9):2330–2343

    Google Scholar 

  89. Zhang Y, Cao C, Cheng J, Lu H (2018) Egogesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans Multim 20(5):1038–1050

    Google Scholar 

  90. Zhu G, Zhang L, Shen P, Song J (2017) Multimodal gesture recognition using 3-d convolution and convolutional lstm. IEEE Access 5:4517–4524

    Google Scholar 

  91. Zhu Z, Ji H, Zhang W, Xu Y (2018) Rank pooling dynamic network: Learning end-to-end dynamic characteristic for action recognition. Neurocomputing 317:101–109

    Google Scholar 

  92. Zong M, Wang R, Chen Z, Wang M, Wang X, Potgieter J (2021) Multi-cue based 3d residual network for action recognition. Neur Comp Appl 33(10):5167–5181

    Google Scholar 

Download references

Acknowledgements

The work of Zhimin Gao was supported in part by National Natural Science Foundation of China under Grant No. 61906173.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhimin Gao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, Z., Hou, Y., Wang, P. et al. FT-HID: a large-scale RGB-D dataset for first- and third-person human interaction analysis. Neural Comput & Applic 35, 2007–2024 (2023). https://doi.org/10.1007/s00521-022-07826-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07826-w

Keywords

Navigation