Abstract
This paper addresses the problem of recognizing person-person interaction using multi-view data captured by depth cameras. Due to the complex spatio-temporal structure of interaction between two persons, it is difficult to characterize different classes of person-person interactions for recognition. To handle this difficulty, we divide each person-person interaction into body part interactions, and analyze the person-person interaction using the pairwise features of these body part interactions. We first make use of two features for representing the relative movement and local physical contact between the body parts of two people and extract the pairwise features to characterize the corresponding body part interaction. For processing each camera view, we propose a regression-based learning approach with a sparsity inducing regularizer to model each person-person interaction as the combination of pairwise features for a sparse set of body part interactions. To take full advantage of the information in all depth camera views, we further extend the proposed interaction learning model to combine features from multi-views to order to increase the recognition performance. Our approach is evaluated on three public activity recognition datasets captured with depth cameras. Experimental results on the three datasets have demonstrated the efficacy of the proposed method.
Similar content being viewed by others
Notes
When \(\|\text {w}_{m}^{\phi _{s}}\|_{2}\)= 0, the objective function in (6) is not differentiable, we can regularize the ϕs-th diagonal block of Dm as \( \frac {1}{2\sqrt {\|\text {w}_{m}^{\phi _{s}}\|_{2}^{2}+\eta }}\text {I}_{\phi _{s}}\) in which η → 0.
References
Aggarwal JK, Xia L (2014) Human activity recognition from 3d data: a review. Pattern Recogn Lett 48:70–80
Amer MR, Todorovic S (2012) Sum-product networks for modeling activitie with stochastic structure. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1314–1321
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Chen C, Jafari R, Kehtarnavaz N (2017) A survey of depth and inertial sensor fusion for human action recognition. Multimed Tools Appl 76(3):4405–4425
Choi W, Shahid K, Savarese S (2011) Learning context for collective activity recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3273–3280
Choi W, Savarese S (2012) A unified framework for multi-target tracking and collective activity recognition. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) ECCV 2012, Part IV. LNCS, vol 7575. Springer, Heidelberg, pp 215–230
Desai C, Ramanan D, Fowlkes C (2010) Discriminative models for static human-object interactions. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 9–16
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1110–1 118
Du Y, Fu Y, Wang L (2016) Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Trans Image Process 25(7):3010–3022
Filipovych R, Ribeiro E (2008) Recognizing primitive interactions by exploring actor-object states. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1–7
Gong S, Xiang T (2003) Recognition of group activities using dynamic probabilistic networks. IEEE international conference on computer vision (ICCV), pp 742–749
Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789
Guyon I, Athitsos V, Jangyodsuk P, Escalante HJ (2014) The ChaLearn gesture dataset (CGD 2011). Mach Vis Appl 25(8):1929–1951
Han F, Reily B, Hoff W, Zhang H (2017) Space-time representation of people based on 3D skeletal data: a review. Comput Vis Image Underst 158:85–105
Ji Y, Ye G, Cheng H (2014) Interactive body part contrast mining for human interaction recognition. In: Multimedia and expo workshops (ICMEW), pp 1–6
Kong Y, Fu Y (2015) Close human interaction recognition using patch-aware models. IEEE Trans Image Process 25(1):167–178
Kong Y, Jia Y, Fu Y (2012) Learning human interaction by interactive phrases. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) ECCV 2012, Part I. LNCS, vol 7572. Springer, Heidelberg, pp 300–313
Kong Y, Jia Y, Fu Y (2014) Interactive phrases: semantic descriptions for human interaction recognition. IEEE Trans Pattern Anal Mach Intell 36(9):1775–1788
Lan T, Wang Y, Yang W, Robinovitch SN, Mori G (2012) Discriminative latent models for recognizing contextual group activities. IEEE Trans Pattern Anal Mach Intell 34(8):1549–1562
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1–8
Li M, Leung H (2016) Multiview skeletal interaction recognition using active joint interaction graph. IEEE Trans Multimed 18(11):2293–2302
Odashima S, Shimosaka M, Kaneko T, Fukui R, Sato T (2012) Collective activity localization with contextual spatial pyramid. In: Fusiello A, Murino V, Cucchiara R (eds) ECCV 2012 Ws/Demos, Part III. LNCS, vol 7585. Springer, Heidelberg, pp 243–252
Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in tv shows. IEEE Trans Pattern Anal Mach Intell 34 (12):2441–2453
Prest A, Schmid C, Ferrari V (2012) Weakly supervised learning of interactions between humans and objects. IEEE Trans Pattern Anal Mach Intell 34(3):601–614
Raptis M, Sigal L (2013) Poselet key-framing: a model for human activity recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2650–2657
Ryoo MS, Aggarwal JK (2006) Recognition of composite human activities through context-free grammar based representation. In: IEEE computer society conference on computer vision and pattern recognition, vol 2, pp 1709–1718
Ryoo M, Aggarwal J (2009) Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: IEEE international conference on computer vision (ICCV), pp 1593–1600
Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1010–1019
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1297–1304
Vahdat A, Gao B, Ranjbar M, Mori G (2011) A discriminative key pose sequence model for recognizing human interactions. In: IEEE international conference on computer vision workshops (ICCVW), pp 1729–1736
Vieira A, Nascimento E, Oliveira G, Liu Z, Campos M (2012) Stop: space-time occupancy patterns for 3D action recognition from depth map sequences. In: Iberoamerican congress on pattern recognition. Buenos Aires, pp 252–259
Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012) Robust 3D action recognition with random occupancy patterns. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) ECCV 2012, Part II. LNCS, vol 7573. Springer, Heidelberg, pp 872–885
Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3D human action recognition. IEEE Trans Pattern Anal Mach Intell 36(5):914–927
Xu N, Liu A, Nie W, Wong Y, Li F, Su Y (2015) Multi-modal & multi-view & interactive benchmark dataset for human action recognition. In: ACM international conference on multimedia, pp 1195–1198
Yao B, Fei-Fei L (2010) Modeling mutual context of object and human pose in human-object interaction activities. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 17–24
Yao A, Gall J, Fanelli G, Van Gool L (2011) Does human action recognition benefit from pose estimation? In: Proceedings of the 22nd British machine vision conference (BMVC), pp 6.71–67.11
Ye M, Zhang Q, Wang L, Zhu J, Yang R, Gall J (2013) A survey on human motion analysis from depth data. In: Grzegorzek M, Theobalt C, Koch R, Kolb A (eds) Time-of-flight and depth imaging. LNCS, vol 8200. Springer, Heidelberg, pp 149–187
Yu TH, Kim TK, Cipolla R (2010) Real-time action recognition by spatiotemporal semantic and structural forests. In: British machine vision conference (BMVC), pp 1–7
Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 28–35
Zhang J, Li W, Ogunbona PO, Wang P, Tang C (2016) RGB-D-based action recognition datasets: a survey. Pattern Recogn 60:86–105
Zhang S, Liu X, Xiao J (2017) On geometric features for skeleton-based action recognition using multilayer LSTM networks. In: IEEE winter conference on applications of computer vision (WACV), pp 148–157
Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI, pp 3697–3703
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, M., Leung, H. Multi-view depth-based pairwise feature learning for person-person interaction recognition. Multimed Tools Appl 78, 5731–5749 (2019). https://doi.org/10.1007/s11042-018-5738-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-5738-6