Automatic Evaluation Method for Functional Movement Screening Based on Multi-Scale Lightweight 3D Convolution and an Encoder–Decoder
<p>ML3D-ED network architecture.</p> "> Figure 2
<p>Three-dimensional filter equivalently transformed into two-dimensional + one-dimensional filters.</p> "> Figure 3
<p>ML3D architecture.</p> "> Figure 4
<p>Encoder–decoder.</p> "> Figure 5
<p>Comparison of the feature extraction models for the number of parameters and computational cost.</p> "> Figure 6
<p>Three-dimensional convolution decoupling methods.</p> "> Figure 7
<p>Three-dimensional convolution decoupling methods.</p> ">
Abstract
:1. Introduction
- In this paper, an ML3D module is designed as an alternative to the I3D feature extraction module. Compared with the I3D model, the parameters and computation (floating point operations per second, FLOPs) of the ML3D-ED model were reduced by 59.55% and 77.67%, respectively.
- This paper proposes an ED structure network to process features extracted by the ML3D module, learn subtle movement changes in advanced movement quality evaluation, apply it to functional movement screening, and improve the accuracy of the evaluation results.
- The paper employs a score prediction approach to transform label data processed by the ED into a distribution of scores. Utilizing a Gaussian distribution, it compares losses between true and predicted values for samples. Compared to the current most popular approach, the accuracy of this this method has been improved by nearly 9%.
2. Relevant Theories
2.1. Functional Movement Screening
2.2. Video-Based Action Quality Evaluation
2.3. I3D Architecture
3. The Protocol Proposed in This Paper
3.1. Data Preprocessing
3.2. ML3D
3.3. Encoder–Decoder
3.4. Score Prediction
3.4.1. Gaussian Distribution of the Initial Data
3.4.2. Kullback–Leibler (KL) Divergence
4. Experiment
4.1. Data and Experimental Environment
4.2. Evaluation Metrics
- Accuracy: This represents the effectiveness of the model’s predictions. It is the ratio of the sum of the number of samples predicted to be correct to the total number of samples, as shown in Formula (5).
- Macroscopic F1 (macro _F1): This is used to measure the accuracy of multiclass classification. The prerequisite for calculating macro_F1 is to calculate F1_Score, which can be derived from Formula (6). It is a measure of classification tasks and is defined as the harmonic mean of precision and recall. Then, macro_F1 is calculated based on the value of F1_Score and Formula (7).In the above formula, is the recall of the i th class, and is the precision of the i th class.In the above formula, C is the number of classes.
- Kappa coefficient: This is used to measure agreement and can also be used as a measure of precision. For classification tasks, agreement is defined as the degree of consistency between the model prediction results and the actual classification results. The calculation of the Kappa coefficient is based on the confusion matrix. It has a value between −1 and 1, and is usually greater than 0, which is shown in Formula (8):In Formula (8), is the accuracy and is consistent with Formula (8). represents the accidental agreement, derived from Formula (9):In the above formula, is the number of actual samples of the th class, and is the number of predicted samples of the i th class. is the total number of classes, and n is the total number of samples.
4.3. Experiment and Result Analysis
4.3.1. Comparative Experiment Analysis
4.3.2. Ablation Experiment Analysis
- Three-dimensional convolution decoupling methodsA 3D convolutional filter can be decoupled into a 2D convolutional filter in the spatial domain (S) and a 1D convolutional filter in the temporal domain (T). Inspired by [30], there are three combination patterns based on the interactions between two convolutional filters. The first pattern is a cascade combination of a spatial 2D filter and a temporal 1D filter. These two filters can directly interact with each other on the same path, and only the temporal 1D filter directly affects the final output, as shown in Figure 7 ML3D-a. The second pattern is a parallel combination of two filters, where each filter indirectly interacts with each other on different paths in the network, as shown in Figure 7 ML3D-b. The third pattern is a variant of the first pattern, establishing a residual connection between S and T, so that the output of S can also directly affect the output result, as shown in Figure 7 ML3D-c.The effects of different decoupling methods are shown in Table 7. Different decoupling methods have a significant impact on performance.The fact that ML3D-b and ML3D-c perform better than ML3D-a proves that directly connecting the output of the spatial 2D filters to the final output enhances the model’s information flow path, enabling spatial features to have a more direct impact on the final prediction. ML3D-c shows an improvement over ML3D-b with an approximately 1% improvement in accuracy, 2% in maF1, and 3% in Kappa, which validates that the direct influence of the two types of filters has a positive effect on the model’s performance.
- Downsampling methodsIn the ML3D module, downsampling is used to reduce the feature dimensions of the raw input so that the feature dimensions are the same for residual connection. The downsampling can be designed as a 3D convolution with learnable parameters or a parameterless pooling layer for direct dimensionality reduction. Table 8 shows the number of parameters and the computational cost of different downsampling methods. Compared with parameterless pooling, 3D convolution will increase the number of model parameters by 0.005 M and the computational cost by 7.398 G. Its impact on performance is shown in Table 9. The 3D decoupling methods used are ML3D-c.The performance has been greatly improved for 3D convolution compared to pooling, particularly the Kappa value, which has been improved by 5.31%. The downsampling method used in this paper is 3D convolution, which has resulted in a trade-off of a number of parameters of 0.005 M and a computational cost of 7.398 G for a considerable performance improvement.
- Multi-scale learningThe convolution size of mainstream I3D feature extractions is fixed, and a large amount of practice shows that capturing multi-scale information is beneficial for improving model performance. The ML3D model uses 2D convolutional filters of four scales for initial feature extraction. Table 10 shows the impact of convolutional filters of different sizes on performance.The first row shows the performance of single-scale convolutional filters. The analysis of rows 2 and 3 of the Table shows that the combination of two small-sized and two large-sized filters has led to improvements of 1–3% across all metrics, indicating that moderately increasing filter size can enhance the model’s feature extraction capabilities, thereby improving its overall performance. Compared to row 3, row 4 shows a decrease of approximately 3% in Accuracy, 4% in maF1, and 7% in Kappa, indicating that excessively large convolutional filters have a negative impact on performance.
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Jiao, L.; Zhang, R.; Liu, F.; Yang, S.; Hou, B.; Li, L.; Tang, X. New Generation Deep Learning for Video Object Detection: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 3195–3215. [Google Scholar] [CrossRef]
- Pareek, P.; Thakkar, A. A Survey on Video-Based Human Action Recognition: Recent Updates, Datasets, Challenges, and Applications. Artif. Intell. Rev. 2021, 54, 2259–2322. [Google Scholar] [CrossRef]
- Baccouche, M.; Mamalet, F.; Wolf, C.; Garcia, C.; Baskurt, A. Sequential Deep Learning for Human Action Recognition. In Proceedings of the Human Behavior Understanding: Second International Workshop, HBU 2011, Amsterdam, The Netherlands, 16 November 2011; Proceedings 2011. pp. 29–39. [Google Scholar]
- Zhou, Y.; Sun, X.; Zha, Z.-J.; Zeng, W. Mict: Mixed 3d/2d Convolutional Tube for Human Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 449–458. [Google Scholar]
- Spilz, A.; Munz, M. Automatic Assessment of Functional Movement Screening Exercises with Deep Learning Architectures. Sensors 2022, 23, 5. [Google Scholar] [CrossRef] [PubMed]
- Duan, L. Empirical analysis on the reduction of sports injury by functional movement screening method under biological image data. Rev. Bras. Med. Esporte 2021, 27, 400–404. [Google Scholar] [CrossRef]
- Zhou, S.K.; Greenspan, H.; Davatzikos, C.; Duncan, J.S.; van Ginneken, B.; Madabhushi, A.; Prince, J.L.; Rueckert, D.; Summers, R.M. A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. Proc. IEEE 2021, 109, 820–838. [Google Scholar] [CrossRef]
- Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the ICML, Virtual Event, 18–24 July 2021; p. 2. [Google Scholar]
- Lin, X.; Huang, T.; Ruan, Z.; Yang, X.; Chen, Z.; Zheng, G.; Feng, C. Automatic Evaluation of Functional Movement Screening Based on Attention Mechanism and Score Distribution Prediction. Mathematics 2023, 11, 4936. [Google Scholar] [CrossRef]
- Lin, X.; Chen, R.; Feng, C.; Chen, Z.; Yang, X.; Cui, H. Automatic Evaluation Method for Functional Movement Screening Based on a Dual-Stream Network and Feature Fusion. Mathematics 2024, 12, 1162. [Google Scholar] [CrossRef]
- Wu, W.L.; Lee, M.H.; Hsu, H.T.; Ho, W.H.; Liang, J.M. Development of an automatic functional movement screening system with inertial measurement unit sensors. Appl. Sci. 2020, 11, 96. [Google Scholar] [CrossRef]
- Bochniewicz, E.M.; Emmer, G.; McLeod, A.; Barth, J.; Dromerick, A.W.; Lum, P. Measuring functional arm movement after stroke using a single wrist-worn sensor and machine learning. J. Stroke Cerebrovasc. Dis. 2017, 26, 2880–2887. [Google Scholar] [CrossRef]
- Hong, R.; Xing, Q.; Shen, Y.; Shen, Y. Effective Quantization Evaluation Method of Functional Movement Screening with Improved Gaussian Mixture Model. Appl. Sci. 2023, 13, 7487. [Google Scholar] [CrossRef]
- Bai, Y.; Zhou, D.; Zhang, S.; Wang, J.; Ding, E.; Guan, Y.; Long, Y.; Wang, J. Action quality assessment with temporal parsing transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 422–438. [Google Scholar]
- Gordon, A.S. Automated video assessment of human performance. In Proceedings of the AI-ED, Washington, DC, USA, 16–19 August 1995; Volume 2. [Google Scholar]
- Li, Y.; Chai, X.; Chen, X. Scoringnet: Learning key fragment for action quality assessment with ranking loss in skilled sports. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 149–164. [Google Scholar]
- Tao, L.; Elhamifar, E.; Khudanpur, S.; Vidal, G.D.; Vidal, R. Sparse hidden markov models for surgical gesture classification and skill evaluation. In Information Processing in Computer-Assisted Interventions: Third International Conference, IPCAI 2012, Pisa, Italy, 27 June 2012. Proceedings; Springer: Berlin/Heidelberg, Germany, 2012; Volume 3, pp. 167–177. [Google Scholar]
- Parmar, P.; Morris, B.T. Measuring the quality of exercises. In Proceedings of the 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Orlando, FL, USA, 16–20 August 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2241–2244. [Google Scholar]
- Xu, C.; Fu, Y.; Zhang, B.; Chen, Z.; Jiang, Y.; Xue, X. Learning to score figure skating sport videos. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 4578–4590. [Google Scholar] [CrossRef]
- Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Sharma, V.; Gupta, M.; Pandey, A.; Mishra, D.; Kumar, A. A review of deep learning-based human activity recognition on benchmark video datasets. Appl. Artif. Intell. 2022, 36, 2093705. [Google Scholar] [CrossRef]
- Hu, K.; Jin, J.; Zheng, F.; Weng, L.; Ding, Y. Overview of behavior recognition based on deep learning. Artif. Intell. Rev. 2023, 56, 1833–1865. [Google Scholar] [CrossRef]
- Hara, K.; Kataoka, H.; Satoh, Y. Can spatiotemporal 3D cnns retrace the history of 2D cnns and imagenet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 23 June 2018. [Google Scholar]
- Wang, X.; Miao, Z.; Zhang, R.; Hao, S. I3d-lstm: A new model for human action recognition. IOP Conf. Ser. Mater. Sci. Eng. 2019, 569, 032035. [Google Scholar] [CrossRef]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Xing, Q.-J.; Shen, Y.Y.; Cao, R.; Zong, S.X.; Zhao, S.X.; Shen, Y.F. Functional movement screen dataset collected with two azure kinect depth sensors. Sci. Data 2022, 9, 104. [Google Scholar] [CrossRef]
- Parmar, P.; Tran Morris, B. Learning to score olympic events. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 20–28. [Google Scholar]
- Tang, Y.; Ni, Z.; Zhou, J.; Zhang, D.; Lu, J.; Wu, Y.; Zhou, J. Uncertainty-aware score distribution learning for action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9839–9848. [Google Scholar]
- Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Training Set | Test Set | |||||
---|---|---|---|---|---|---|
ID | 1 | 2 | 3 | 1 | 2 | 3 |
M01 | 13 | 69 | 17 | 4 | 23 | 5 |
M03 | 28 | 54 | 18 | 9 | 18 | 8 |
M05 | 8 | 75 | 17 | 2 | 25 | 8 |
M07 | 18 | 9 | 5 | 6 | 3 | 2 |
M09 | 9 | 54 | 39 | 3 | 18 | 12 |
M11 | 7 | 88 | 9 | 3 | 18 | 12 |
M12 | 3 | 77 | 8 | 2 | 26 | 3 |
M14 | 6 | 88 | 1 | 2 | 28 | 1 |
Model | Accuracy/% | maF1/% | Kappa/% |
---|---|---|---|
Improved GMM [13] | 80.00 | 77.00 | 67.00 |
C3D-LSTM [28] | 74.44 | 74.35 | 61.66 |
I3D-LSTM | 71.11 | 70.90 | 56.66 |
I3D-MLP [29] | 84.44 | 84.53 | 76.66 |
Ours | 93.33 | 89.82 | 85.00 |
Range of Kappa Values | Meaning |
---|---|
0.00~0.20 | Very low agreement (slight) |
0.21~0.40 | General agreement (fair) |
0.41~0.60 | Intermediate agreement (moderate) |
0.61~0.80 | High agreement (substantial) |
0.81~1.00 | Nearly complete agreement (almost perfect) |
Model | Params | FLOPs |
---|---|---|
I3D | 12.287 M | 223.013 G |
ML3D | 4.977 M | 49.800 G |
Feature Extraction | Model | Accuracy | maF1 | Kappa |
---|---|---|---|---|
I3D | MLP | 90.00 | 83.86 | 77.83 |
ML3D | MLP | 90.83 | 87.16 | 79.71 |
I3D | ED | 90.83 | 85.85 | 79.13 |
ML3D | ED | 93.33 | 89.82 | 85.00 |
Model | Params | FLOPs |
---|---|---|
ED | 5.410 M | 5.538 k |
MLP | 55.092 M | 689.540 k |
Decoupling Method | Accuracy | maF1 | Kappa |
---|---|---|---|
ML3D-a | 90.41 | 85.65 | 78.93 |
ML3D-b | 92.08 | 87.76 | 82.18 |
ML3D-c | 93.33 | 89.82 | 85.00 |
Downsample | Params | FLOPs |
---|---|---|
3D convolution | 4.977 M | 49.800 G |
pooling | 4.972 M | 42.402 G |
Downsample | Accuracy | maF1 | Kappa |
---|---|---|---|
3D convolution | 93.33 | 89.82 | 85.00 |
pooling | 91.25 | 85.15 | 79.69 |
Filter Size | Accuracy | maF1 | Kappa |
---|---|---|---|
7,7,7,7 | 90.83 | 86.09 | 79.75 |
3,7,9,11 | 92.08 | 88.46 | 82.42 |
3,7,13,15 | 93.33 | 89.82 | 85.00 |
3,7,13,17 | 90.42 | 85.22 | 77.91 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lin, X.; Liu, Y.; Feng, C.; Chen, Z.; Yang, X.; Cui, H. Automatic Evaluation Method for Functional Movement Screening Based on Multi-Scale Lightweight 3D Convolution and an Encoder–Decoder. Electronics 2024, 13, 1813. https://doi.org/10.3390/electronics13101813
Lin X, Liu Y, Feng C, Chen Z, Yang X, Cui H. Automatic Evaluation Method for Functional Movement Screening Based on Multi-Scale Lightweight 3D Convolution and an Encoder–Decoder. Electronics. 2024; 13(10):1813. https://doi.org/10.3390/electronics13101813
Chicago/Turabian StyleLin, Xiuchun, Yichao Liu, Chen Feng, Zhide Chen, Xu Yang, and Hui Cui. 2024. "Automatic Evaluation Method for Functional Movement Screening Based on Multi-Scale Lightweight 3D Convolution and an Encoder–Decoder" Electronics 13, no. 10: 1813. https://doi.org/10.3390/electronics13101813
APA StyleLin, X., Liu, Y., Feng, C., Chen, Z., Yang, X., & Cui, H. (2024). Automatic Evaluation Method for Functional Movement Screening Based on Multi-Scale Lightweight 3D Convolution and an Encoder–Decoder. Electronics, 13(10), 1813. https://doi.org/10.3390/electronics13101813