Abstract
The analysis of customer pose draws more and more attention of retailers and researchers, because this information can reveal the customer habits and the customer interest level to the merchandise. In the retail store environment, customers’ poses are highly related to their body orientations. For example, when a customer is picking an item from merchandise shelf, he or she must face to the shelf. On the other hand, if the customer body orientation is parallel to the shelf, this customer is probably just walking through. Considering this fact, we propose a customer pose estimation system using orientational spatio-temporal deep neural network from surveillance camera. This system first generates the initial joint heatmaps using a fully convolutional network. Based on these heatmaps, we propose a set of novel orientational message-passing layers to fine-tune joint heatmaps by introducing the body orientation information into the conventional message-passing layers. In addition, we apply a bi-directional recurrent neural network on top of the system to improve the estimation accuracy by considering both forward and backward image sequences. Therefore, in this system, the global body orientation, local joint connections, and temporal pose continuity are integrally considered. At last, we conduct a series of comparison experiments to show the effectiveness of our system.
Similar content being viewed by others
References
Sminchisescu, C., Telea, A.: Human pose estimation from silhouettes. A consistent approach using distance level sets. In: Proceedings of the International Conference on Computer Graphics, Visualization and Computer Vision (WSCG) (2002)
Wagg, D.K., Nixon, M.S.: Model-based gait enrolment in real-world imagery. In: Proceedings of the Workshop on Multimodal User Authentication, pp. 189–195 (2003)
Tafazzoli, F., Safabakhsh, R.: Model-based human gait recognition using leg and arm movements. Eng. Appl. Artif. Intell. 23(8), 1237–1246 (2010)
Zhao, L.: Dressed human modeling, detection, and parts localization, Ph.D. thesis, Carnegie Mellon University Pittsburgh, PA, (2001)
Mittal, A., Zhao, L., Davis, L.S.: Human body pose estimation using silhouette shape analysis. In: Proceedings of IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 263–270 (2003)
Kushwaha, A.K.S., Srivastava, S., Srivastava, R.: Multi-view human activity recognition based on silhouette and uniform rotation invariant local binary patterns. Multimed. Syst. pp. 1–17 (2016)
Ramanan, D., Forsyth, D.A., Zisserman, A.: Tracking people by learning their appearance. IEEE Trans. Pattern Anal. Mach. Intell. 29(1), 65–81 (2007)
Andriluka, M., Roth, S., Schiele, B.: Monocular 3D pose estimation and tracking by detection. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) 2010, 623–630 (2010)
Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: people detection and articulated pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1014–1021 (2009)
Moutzouris, A., Martinez-del-Rincon, J., Lewandowski, M., Nebel, J., Makris, D.: Human pose tracking in low dimensional space enhanced by limb correction. In: 2011 18th IEEE International Conference on Image Processing (ICIP), pp. 2301–2304 (2011)
Weiss, D., Sapp, B., Taskar, B.: Sidestepping intractable inference with structured ensemble cascades. In: Advances in Neural Information Processing Systems, pp. 2415–2423 (2010)
Sapp, B., Weiss, D., Taskar, B.: Parsing human motion with stretchable models. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) 2011, 1281–1288 (2011)
Eichner, M., Marin-Jimenez, M., Zisserman, A., Ferrari, V.: 2d articulated human pose estimation and retrieval in (almost) unconstrained still images. Int. J. Comput. Vis. 99(2), 190–214 (2012)
Cherian, A., Mairal, J., Alahari, K., Schmid, C.: Mixing body-part sequences for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2353–2360 (2014)
Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. Int. J. Comput. Vis. IJCV 61(1), 55–79 (2005)
Sun, M., Savarese, S.: Articulated part-based model for joint object detection and pose estimation. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 723–730 (2011)
Dantone, M., Gall, J., Leistner, C., Van Gool, L.: Human pose estimation using body parts dependent joint regressors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3041–3048 (2013)
Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. PAMI 35(12), 2878–2890 (2013)
Eichner, M., Ferrari, V.: Appearance sharing for collective human pose estimation. In: Proceedings of the Asian Conference on Computer Vision (ACCV), pp. 138–151 (2012)
Li, S., Zhang, M., Su, S., Shuai, B., Ji, R.: Decomposed human localization from social photo album. Multimed. Syst. 22(1), 137–148 (2016)
Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Poselet conditioned pictorial structures. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 588–595 (2013)
Le Cun, B.B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Handwritten digit recognition with a back-propagation network. In: Neural Information Processing Systems (NIPS) (1989)
LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time-series. Handb. Brain Theory Neural Netw. 3361(10), (1995)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ArXiv Prepr. ArXiv14091556, (2014)
C. Szegedy et al., “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. ArXiv Prepr. ArXiv151203385, (2015)
Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1653–1660 (2014)
Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. ArXiv Prepr. ArXiv150706550, (2015)
Haque, A., Peng, B., Luo, Z., Alahi, A., Yeung, S., Fei-Fei, L.: Towards viewpoint invariant 3D human pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 160–177 (2016)
Chen, X., Yuille, A.L.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Advances in Neural Information Processing Systems, pp. 1736–1744 (2014)
Chen, X., Yuille, A.L.: Parsing occluded people by flexible compositions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3945–3954 (2015)
Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in neural information processing systems, pp. 1799–1807 (2014)
Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656 (2015)
Yang, W., Ouyang, W., Li, H., Wang, X.: End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4715–4723 (2016)
Chu, X., Ouyang, W., Li, H., Wang, X.: Structured feature learning for pose estimation. ArXiv Prepr. ArXiv160309065, 2016
Jain, A., Tompson, J., LeCun, Y., Bregler, C.: MoDeep: a deep learning framework using motion features for human pose estimation. In: Proceedings of the Asian Conference on Computer Vision (ACCV), pp. 302–315 (2014)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: large displacement optical flow with deep matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1385–1392 (2013)
Pfister, T., Charles, J., Zisserman, A.: Flowing convnets for human pose estimation in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1913–1921 (2015)
Wang, L., Qiao, Y., Tang, X.: Video action detection with relational dynamic-poselets. In: European Conference on Computer Vision, pp. 565–580 (2014)
Yao, J., Odobez, J.: Multi-layer background subtraction based on color and texture. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR’07, pp. 1–8 (2007)
Liu, J., Gu, Y., Kamijo, S.: Customer behavior classification using surveillance camera for marketing. Multimed. Tools Appl., pp. 1–28 (2016)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893 (2005)
Haque, A., Peng, B., Luo, Z., Alahi, A., Yeung, S., Fei-Fei, L.: Viewpoint invariant 3D human pose estimation with recurrent error feedback. ArXiv160307076 Cs, (2016)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035 (2007)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: Proceedings of the British Machine Vision Conference (BMVC), 2010
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3192–3199 (2013)
Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1465–1472 (2011)
Rafi, U., Leibe, B., Gall, J., Kostrikov, I.: An efficient convolutional network for human pose estimation. In: BMVC, vol. 1, p. 2 (2016)
Yu, X., Zhou, F., Chandraker, M.: Deep deformation network for object landmark localization. ArXiv Prepr. ArXiv160501014, 2016
Xiaohan Nie, B., Xiong, C., Zhu, S.-C.: Joint action recognition and pose estimation from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1293–1301 (2015)
Iqbal, U., Garbade, M., Gall, J.: Pose for action-action for pose. In: 2017 12th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2017), pp. 438–445 (2017)
Song, J., Wang, L., Van Gool, L., Hilliges, O.: Thin-slicing network: a deep structured model for pose estimation in videos. ArXiv170310898 Cs, 2017
Acknowledgements
The authors thank Haitao Wang, Yongjie Liu, and Qianlong Wang for their helps for labeling data. The faces of customers are blurred for the purpose of privacy in this paper. This research is permitted by the Compliance Committee of the University of Tokyo.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by M. Cooper.
Rights and permissions
About this article
Cite this article
Liu, J., Gu, Y. & Kamijo, S. Customer pose estimation using orientational spatio-temporal network from surveillance camera. Multimedia Systems 24, 439–457 (2018). https://doi.org/10.1007/s00530-017-0570-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-017-0570-9