Abstract
In recent years, human activity recognition (HAR) methods are developing rapidly. However, most existing methods base on single input data modality, and suffers from accuracy and robustness issues. In this paper, we present a novel multi-modal HAR architecture which fuses signals from both RGB visual data and Inertial Measurement Units (IMU) data. As for the RGB modality, the speed-weighted star RGB representation is proposed to aggregate the temporal information, and a convolutional network is employed to extract features; As for the IMU modality, Fast Fourier transform and multi-layer perceptron are employed to extract the dynamical features of IMU data. As for the feature fusion scheme, the global soft attention layer is designed to adjust the weights according to the concatenated features, and the L-softmax with soft voting is adopted to classify activities. The proposed method is evaluated on the UP-Fall dataset, the F1-scores are 0.92 and 1.00 for 11 classes classification task and fall/non-fall binary classification task respectively.
Similar content being viewed by others
Data availability
No data was used for the research described in the article.
References
Abebe, G., Cavallaro, A.: Inertial-vision: cross-domain knowledge transfer for wearable sensors. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1392–1400 (2017)
Ahad, M., Rahman, A., Tan, J., Kim, H., Ishikawa, S.: Motion history image: its variants and applications. Mach. Vis. Appl. 23(2), 255–281 (2012)
Balli, S., Sağbaş, E.A., Peker, M.: Human activity recognition from smart watch sensor data using a hybrid of principal component analysis and random forest algorithm. Meas. Control 52(1–2), 37–45 (2019)
Barros, P., Parisi, G.I., Jirak, D., Wermter, S.: Real-time gesture recognition using a humanoid robot with a deep neural architecture. In: 2014 IEEE-RAS International Conference on Humanoid Robots. IEEE, pp. 646–651 (2014)
Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(3), 257–267 (2001)
Brena, R.F., Aguileta, A.A., Trejo, L.A., Molino-Minero-Re, E., Mayora, O.: Choosing the best sensor fusion method: a machine-learning approach. Sensors 20(8), 2350 (2020)
Chen, C., Jafari, R., Kehtarnavaz, N.: A survey of depth and inertial sensor fusion for human action recognition. Multim. Tools Appl. 76(3), 4405–4425 (2017)
Cippitelli, E., Gasparrini, S., Gambi, E., Spinsante, S.: A human activity recognition system using skeleton data from rgbd sensors. Comput. Intell. Neurosci. 2016 (2016)
Demrozi, F., Pravadelli, G., Bihorac, A., Rashidi, P.: Human activity recognition using inertial, physiological and environmental sensors: a comprehensive survey. IEEE Access 8, 210-210 836 (2010). (836)
dos Santos, C.C., Samatelo, J.L.A., Vassallo, R.F.: Dynamic gesture recognition by using cnns and star rgb: a temporal information condensation. Neurocomputing 400, 238–254 (2020)
Espinosa, R., Ponce, H., Gutiérrez, S., Martínez-Villaseñor, L., Brieva, J., Moya-Albor, E.: A vision-based approach for fall detection using multiple cameras and convolutional neural networks: a case study using the up-fall detection dataset. Comput. Biol. Med. 115, 103520 (2019)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
Feichtenhofer, C.: X3d: expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213 (2020)
Fortun, D., Bouthemy, P., Kervrann, C.: Optical flow modeling and computation: A survey. Comput. Vis. Image Underst. 134, 1–21 (2015)
Galvão, Y.M., Ferreira, J., Albuquerque, V.A., Barros, P., Fernandes, B.J.: A multimodal approach using deep learning for fall detection. Expert Syst. Appl. 168, 114226 (2021)
Gjoreski, H., Stankoski, S., Kiprijanovska, I., Nikolovska, A., Mladenovska, N., Trajanoska, M., Velichkovska, B., Gjoreski, M., Luštrek, M., Gams, M.: Wearable sensors data-fusion and machine-learning method for fall detection and activity recognition. In: Challenges and Trends in Multimodal Fall Detection for Healthcare. Springer, pp. 81–96 (2020)
Han, J., Bhanu, B.: Human activity recognition in thermal infrared imagery. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)-Workshops. IEEE, pp. 17 (2005)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
He, J., Zhang, Z., Wang, X., Yang, S.: A low power fall sensing technology based on fd-cnn. IEEE Sens. J. 19(13), 5110–5118 (2019)
He, J., Zhang, C., He, X., Dong, R.: Visual recognition of traffic police gestures with convolutional pose machine and handcrafted features. Neurocomputing 390, 248–259 (2020)
Horn, B.K., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1–3), 185–203 (1981)
Hwang, I., Cha, G., Oh, S.: Multi-modal human action recognition using deep neural networks fusing image and inertial sensor data. In: 2017 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI). IEEE, pp. 278–283 (2017)
Li, Z., Wu, H.: A survey of maneuvering target tracking using Kalman filter. In: 2015 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering. Atlantis Press, pp. 542–545 (2015)
Liu, W., Wen, Y., Yu, Z., Yang, M.: Large-margin softmax loss for convolutional neural networks. In: International Conference on Machine Learning. PMLR, pp. 507–516 (2016)
Lu, Y., Velipasalar, S.: Autonomous human activity classification from wearable multi-modal sensors. IEEE Sens. J. 19(23), 11 403-11 412 (2019)
Lucas, B.D., Kanade, T. et al.: An iterative image registration technique with an application to stereo vision. Vancouver 81 (1981)
Luo, F., Poslad, S., Bodanese, E.: Temporal convolutional networks for multiperson activity recognition using a 2-d lidar. IEEE Internet Things J. 7(8), 7432–7442 (2020)
Luong, M.-T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv:1508.04025 (2015)
Mallat, R., Bonnet, V., Khalil, M., Mohammed, S.: Toward an affordable multi-modal motion capture system framework for human kinematics and kinetics assessment. In: International Symposium on Wearable Robotics. Springer, pp. 65–69 (2018)
Mao, A., Ma, X., He, Y., Luo, J.: Highly portable, sensor-based system for human fall monitoring. Sensors 17(9), 2096 (2017)
Martínez-Villaseñor, L., Ponce, H., Brieva, J., Moya-Albor, E., Núñez-Martínez, J., Peñafort-Asturiano, C.: Up-fall detection dataset: a multimodal approach. Sensors 19(9), 1988 (2019)
Ometov, A., Shubina, V., Klus, L., Skibińska, J., Saafi, S., Pascacio, P., Flueratoru, L., Gaibor, D.Q., Chukhno, N., Chukhno, O., et al.: A survey on wearable technology: history, state-of-the-art and current challenges. Comput. Netw. 193, 108074 (2021)
Ponce, H., Martínez-Villaseñor, L.: Approaching fall classification using the up-fall detection dataset: Analysis and results from an international competition. In: Challenges and Trends in Multimodal Fall Detection for Healthcare. Springer, pp. 121–133 (2020)
Ravi, N., Dandekar, N., Mysore, P., Littman, M.L.: Activity recognition from accelerometer data. In: Aaai, vol. 5, no. 2005. Pittsburgh, PA, pp. 1541–1546 (2005)
Rivera, P., Valarezo, E., Choi, M.-T., Kim, T.-S.: Recognition of human hand activities based on a single wrist imu using recurrent neural networks. Int. J. Pharma Med. Biol. Sci. 6(4), 114–118 (2017)
Salehzadeh, A., Calitz, A.P., Greyling, J.: Human activity recognition using deep electroencephalography learning. Biomed. Signal Process. Control 62, 102094 (2020)
Steven Eyobu, O., Han, D.S.: Feature representation and data augmentation for human activity classification based on wearable imu sensor data using a deep lstm neural network. Sensors 18(9), 2892 (2018)
Stoeve, M., Schuldhaus, D., Gamp, A., Zwick, C., Eskofier, B.M.: From the laboratory to the field: Imu-based shot and pass detection in football training and game scenarios using deep learning. Sensors 21(9), 3071 (2021)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Tsai, D.-M., Chiu, W.-Y., Lee, M.-H.: Optical flow-motion history image (of-mhi) for action recognition. Signal Image Video Process. 9(8), 1897–1906 (2015)
Wan, S., Qi, L., Xu, X., Tong, C., Gu, Z.: Deep learning models for real-time human activity recognition with smartphones. Mob. Netw. Appl. 25(2), 743–755 (2020)
Zhu, Y., Yu, J., Hu, F., Li, Z., Ling, Z.: Human activity recognition via smart-belt in wireless body area networks. Int. J. Distrib. Sens. Netw. 15(5), 1550147719849357 (2019)
Zimmermann, T., Taetz, B., Bleser, G.: Imu-to-segment assignment and orientation alignment for the lower body using deep learning. Sensors 18(1), 302 (2018)
Funding
This work was supported by the National Key Research and Development Plan under Grant 2020YFB2104400.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, C., Zu, T., Hou, Y. et al. Human activity recognition based on multi-modal fusion. CCF Trans. Pervasive Comp. Interact. 5, 321–332 (2023). https://doi.org/10.1007/s42486-023-00132-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42486-023-00132-x