research-article

A teacher–student deep learning strategy for extreme low resolution unsafe action recognition in construction projects

Authors:

Zhile YangAuthors Info & Claims

Volume 59, Issue C

https://doi.org/10.1016/j.aei.2023.102294

Published: 01 January 2024 Publication History

Abstract

A large proportion of construction accidents are caused by workers’ unsafe actions. Due to the complexity of the work environment and excessive demands of safety supervision on construction sites, many modern information technologies, such as computer vision (CV) technologies, have been increasingly applied and gradually replaced the traditional manual supervision to automatically identify unsafe behaviors from surveillance video data. Current models focus on high resolution (HR) video containing high quality features. However, challenges remain for more specific real-world construction scenarios, including but not limited to lack of high-quality data, far-field video recognition, privacy protection, and limited computing resources. To address the challenges, in this paper, a simple but effective method relying on the Teacher–Student Learning Architecture is proposed, where the teacher network assists the student network in effectively capturing useful features for downstream action recognition from extreme low resolution (eLR) video which can be easily transferred to other tasks. In addition, a Knowledge Learning Model and a similarity loss function are proposed to better guide the training of student network. In numerical experiments and case studies, our proposed training strategy achieves an top-1 accuracy of 28.53% and 71.81% on eLR HMDB dataset and eLR CMA dataset, 5.96% and 3.89% higher than the baseline model. The proposed strategy can make the automatic safety monitoring of construction works more accurate and reliable, thereby effectively reducing the accident rate and management cost. In addition, the performance of proposed strategy depends on efficient knowledge distillation methods, which may inspire future applications of CV-based deep learning models in construction management to achieve higher recognition results while maintaining low computational cost.

References

[1]

National Census of Fatal Occupational Injuries in 2020, Tech. Rep., 2021, URL https://www.ecmweb.com/safety/article/21212253/national-census-of-fatal-occupational-injuries-in-2020.

[2]

The Office of the Safety Committee of the State Council’s Report on the Safety Production Situation of the National Construction Industry in the First Half of 2018, Tech. Rep., 2018, URL https://www.mem.gov.cn/gk/tzgg/tb/201807/t20180725_230568.shtml.

[3]

Heinrich H.W., et al., Industrial Accident Prevention. A Scientific Approach, second ed., McGraw-Hill Book Company, Inc., New York & London, 1941.

[4]

Jiang Z., Fang D., Zhang M., Understanding the causation of construction workers’ unsafe behaviors based on system dynamics modeling, J. Manage. Eng. 31 (6) (2015),.

[5]

Wu C., Li X., Guo Y., Wang J., Ren Z., Wang M., Yang Z., Natural language processing for smart construction: Current status and future directions, Autom. Constr. 134 (2022) 104059,.

[6]

Wu C., Li X., Jiang R., Guo Y., Wang J., Yang Z., Graph-based deep learning model for knowledge base completion in constraint management of construction projects, Comput.-Aided Civil Infrastruct. Eng. 38 (6) (2023) 702–719,.

[7]

Yang M., Yang Z., Guo Y., Su S., Fan Z., A novel YOLO based safety helmet detection in intelligent construction platform, in: 7th International Conference on Life System Modeling and Simulation, LSMS 2021 and 7th International Conference on Intelligent Computing for Sustainable Energy and Environment, ICSEE 2021, Hangzhou, Zhejiang, China, 2021, pp. 268–275,.

[8]

Nath N.D., Behzadan A.H., Paal S.G., Deep learning for site safety: Real-time detection of personal protective equipment, Autom. Constr. 112 (2020),.

[9]

Cheng J.P., Wong P.K.-Y., Luo H., Wang M., Leung P.H., Vision-based monitoring of site safety compliance based on worker re-identification and personal protective equipment classification, Autom. Constr. 139 (2022),.

[10]

Xiong R., Tang P., Pose guided anchoring for detecting proper use of personal protective equipment, Autom. Constr. 130 (2021),.

[11]

Luo X., Li H., Cao D., Yu Y., Yang X., Huang T., Towards efficient and objective work sampling: Recognizing workers’ activities in site surveillance videos with two-stream convolutional networks, Autom. Constr. 94 (2018) 360–370,.

[12]

Yang M., Wu C., Guo Y., Jiang R., Zhou F., Zhang J., Yang Z., Transformer-based deep learning model and video dataset for unsafe action identification in construction projects, Autom. Constr. 146 (2023),.

[13]

Cheng M.-Y., Khitam A.F., Tanto H.H., Construction worker productivity evaluation using action recognition for foreign labor training and education: A case study of Taiwan, Autom. Constr. 150 (2023),.

[14]

Chen C., Xiao B., Zhang Y., Zhu Z., Automatic vision-based calculation of excavator earthmoving productivity using zero-shot learning activity recognition, Autom. Constr. 146 (2023),.

[15]

Xiao B., Kang S.-C., Vision-based method integrating deep learning detection for tracking multiple construction machines, J. Comput. Civ. Eng. 35 (2) (2021),.

[16]

Chen C., Zhu Z., Hammad A., Automated excavators activity recognition and productivity analysis from construction site surveillance videos, Autom. Constr. 110 (2020),.

[17]

Ding L., Fang W., Luo H., Love P.E., Zhong B., Ouyang X., A deep hybrid learning model to detect unsafe behavior: Integrating convolution neural networks and long short-term memory, Autom. Constr. 86 (2018) 118–124,.

[18]

Ryoo M., Kim K., Yang H., Extreme low resolution activity recognition with multi-siamese embedding learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2018, pp. 7315–7322,.

[19]

Dai J., Saghafi B., Wu J., Konrad J., Ishwar P., Towards privacy-preserving recognition of human activities, in: IEEE International Conference on Image Processing, ICIP, Quebec City, QC, Canada, 2015, pp. 4238–4242,.

Digital Library

[20]

Yu T., Wang L., Guo C., Gu H., Xiang S., Pan C., Pseudo low rank video representation, Pattern Recognit. 85 (2019) 50–59,.

[21]

Hou M., Liu S., Zhou J., Zhang Y., Feng Z., Extreme low-resolution activity recognition using a super-resolution-oriented generative adversarial network, Micromachines 12 (6) (2021) 670,.

[22]

Demir U., Rawat Y.S., Shah M., Tinyvirat: Low-resolution video action recognition, in: International Conference on Pattern Recognition, ICPR, Milan, Italy, 2021, pp. 7387–7394,.

[23]

Gochoo M., Tan T.-H., Huang S.-C., Batjargal T., Hsieh J.-W., Alnajjar F.S., Chen Y.-F., Novel IoT-based privacy-preserving yoga posture recognition system using low-resolution infrared sensors and deep learning, IEEE Internet Things J. 6 (4) (2019) 7192–7200,.

[24]

Zhang X., Fan J., Peng T., Zheng P., K. M. Lee C., Tang R., A privacy-preserving and unobtrusive sitting posture recognition system via pressure array sensor and infrared array sensor for office workers, Adv. Eng. Inform. 53 (2022),.

Digital Library

[25]

Ryoo M., Rothrock B., Fleming C., Yang H.J., Privacy-preserving human activity recognition from extreme low resolution, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31, No. 1, 2017,.

Digital Library

[26]

Tran D., Wang H., Torresani L., Ray J., LeCun Y., Paluri M., A closer look at spatiotemporal convolutions for action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6450–6459,.

[27]

He K., Zhang X., Ren S., Sun J., Deep residual learning for image recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Las Vegas, NV, USA, 2016, pp. 770–778,.

[28]

Tran D., Bourdev L., Fergus R., Torresani L., Paluri M., Learning spatiotemporal features with 3d convolutional networks, in: IEEE International Conference on Computer Vision, ICCV, Santiago, Chile, 2015, pp. 4489–4497,.

Digital Library

[29]

Carreira J., Zisserman A., Quo vadis, action recognition? a new model and the kinetics dataset, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 2017, pp. 4724–4733,.

[30]

Tran D., Ray J., Shou Z., Chang S.-F., Paluri M., Convnet architecture search for spatiotemporal feature learning, 2017,. arXiv preprint arXiv:1708.05038.

[31]

Simonyan K., Zisserman A., Two-stream convolutional networks for action recognition in videos, Adv. Neural Inf. Process. Syst. 27 (2014),.

[32]

Wang L., Xiong Y., Wang Z., Qiao Y., Lin D., Tang X., Gool L.V., Temporal segment networks: Towards good practices for deep action recognition, in: European Conference on Computer Vision, ECCV, Amsterdam, Netherlands, 2016, pp. 20–36,.

[33]

Zong M., Wang R., Chen X., Chen Z., Gong Y., Motion saliency based multi-stream multiplier ResNets for action recognition, Image Vis. Comput. 107 (2021),.

[34]

Garcia N.C., Morerio P., Murino V., Modality distillation with multiple stream networks for action recognition, in: European Conference on Computer Vision, ECCV, Munich, Germany, 2018, pp. 103–118,.

Digital Library

[35]

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I., Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017),.

[36]

Liu Z., Ning J., Cao Y., Wei Y., Zhang Z., Lin S., Hu H., Video swin transformer, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, 2022, pp. 3192–3201,.

[37]

Arnab A., Dehghani M., Heigold G., Sun C., Lučić M., Schmid C., Vivit: A video vision transformer, in: IEEE International Conference on Computer Vision, ICCV, Montreal, QC, Canada, 2021, pp. 6816–6826,.

[38]

Bertasius G., Wang H., Torresani L., Is space-time attention all you need for video understanding?, in: International Conference on Machine Learning, Vol. 2, No. 3, ICML, Vienna, Austria, 2021, p. 4,.

[39]

Yan S., Xiong X., Arnab A., Lu Z., Zhang M., Sun C., Schmid C., Multiview transformers for video recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, 2022, pp. 3323–3333,.

[40]

Xie S., Sun C., Huang J., Tu Z., Murphy K., Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification, in: European Conference on Computer Vision, ECCV, Munich, Germany, 2018, pp. 305–321,.

[41]

Wang X., Girshick R., Gupta A., He K., Non-local neural networks, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 2018, pp. 7794–7803,.

[42]

Hinton G., Vinyals O., Dean J., et al., Distilling the knowledge in a neural network, 2 (7), 2015,. arXiv preprint arXiv:1503.02531.

[43]

Kim H., Jain M., Lee J.-T., Yun S., Porikli F., Efficient action recognition via dynamic knowledge propagation, in: IEEE International Conference on Computer Vision, ICCV, 2021, pp. 13719–13728,.

[44]

Wang X., Hu J.-F., Lai J.-H., Zhang J., Zheng W.-S., Progressive teacher-student learning for early action prediction, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Long Beach, CA, USA, 2019, pp. 3551–3560,.

[45]

Ma C., Guo Q., Jiang Y., Yuan Z., Luo P., Qi X., Rethinking resolution in the context of efficient video recognition, 2022,. arXiv preprint arXiv:2209.12797.

[46]

Liu H., Zhao P., Ruan Z., Shang F., Liu Y., Large motion video super-resolution with dual subnet and multi-stage communicated upsampling, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2021, pp. 2127–2135,.

[47]

Chan K.C., Wang X., Yu K., Dong C., Loy C.C., Basicvsr: The search for essential components in video super-resolution and beyond, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Nashville, TN, USA, 2021, pp. 4947–4956,.

[48]

Liu H., Ruan Z., Zhao P., Dong C., Shang F., Liu Y., Yang L., Timofte R., Video super-resolution based on deep learning: a comprehensive survey, Artif. Intell. Rev. 55 (8) (2022) 5981–6035,.

Digital Library

[49]

Zhang H., Liu D., Xiong Z., Two-stream action recognition-oriented video super-resolution, in: IEEE International Conference on Computer Vision, ICCV, Seoul, Korea (South), 2019, pp. 8798–8807,.

[50]

Creswell A., White T., Dumoulin V., Arulkumaran K., Sengupta B., Bharath A.A., Generative adversarial networks: An overview, IEEE Signal Process. Mag. 35 (1) (2018) 53–65,.

[51]

Chen J., Wu J., Konrad J., Ishwar P., Semi-coupled two-stream fusion ConvNets for action recognition at extremely low resolutions, in: IEEE Winter Conference on Applications of Computer Vision, WACV, Santa Rosa, CA, USA, 2017, pp. 139–147,.

[52]

Xu M., Sharghi A., Chen X., Crandall D.J., Fully-coupled two-stream spatiotemporal networks for extremely low resolution action recognition, in: IEEE Winter Conference on Applications of Computer Vision, WACV, Lake Tahoe, NV, USA, 2018, pp. 1607–1615,.

[53]

Nan F., Jing W., Tian F., Zhang J., Chao K.-M., Hong Z., Zheng Q., Feature super-resolution based facial expression recognition for multi-scale low-resolution images, Knowl.-Based Syst. 236 (2022),.

Digital Library

[54]

Purwanto D., Renanda Adhi Pramono R., Chen Y.-T., Fang W.-H., Extreme low resolution action recognition with spatial-temporal multi-head self-attention and knowledge distillation, in: IEEE International Conference on Computer Vision Workshop, ICCVW, Seoul, Korea (South), 2019, pp. 961–969,.

[55]

Bian C., Feng W., Wan L., Wang S., Structural knowledge distillation for efficient skeleton-based action recognition, IEEE Trans. Image Process. 30 (2021) 2963–2976,.

Digital Library

[56]

Li L., Zhang P., Yang S., Jiao W., YOLOv5-SFE: An algorithm fusing spatio-temporal features for detecting and recognizing workers’ operating behaviors, Adv. Eng. Inform. 56 (2023),.

Digital Library

[57]

Fang W., Zhong B., Zhao N., Love P.E., Luo H., Xue J., Xu S., A deep learning-based approach for mitigating falls from height with computer vision: Convolutional neural network, Adv. Eng. Inform. 39 (2019) 170–177,.

Digital Library

[58]

Chian E., Fang W., Goh Y.M., Tian J., Computer vision approaches for detecting missing barricades, Autom. Constr. 131 (2021),.

[59]

Li Z., Li D., Action recognition of construction workers under occlusion, J. Build. Eng. 45 (2022),.

[60]

Ding C., Wen S., Ding W., Liu K., Belyaev E., Temporal segment graph convolutional networks for skeleton-based action recognition, Eng. Appl. Artif. Intell. 110 (2022),.

Digital Library

[61]

Yang J., Shi Z., Wu Z., Vision-based action recognition of construction workers using dense trajectories, Adv. Eng. Inform. 30 (3) (2016) 327–336,.

Digital Library

[62]

Gong Y., Yang K., Seo J., Lee J.G., Wearable acceleration-based action recognition for long-term and continuous activity analysis in construction site, J. Build. Eng. 52 (2022),.

[63]

Antwi-Afari M.F., Qarout Y., Herzallah R., Anwer S., Umer W., Zhang Y., Manu P., Deep learning-based networks for automated recognition and classification of awkward working postures in construction using wearable insole sensor data, Autom. Constr. 136 (2022),.

[64]

Golparvar-Fard M., Heydarian A., Niebles J.C., Vision-based action recognition of earthmoving equipment using spatio-temporal features and support vector machine classifiers, Adv. Eng. Inform. 27 (4) (2013) 652–663,.

Digital Library

[65]

Wang Y., Xiao B., Bouferguene A., Al-Hussein M., Li H., Vision-based method for semantic information extraction in construction by integrating deep learning object detection and image captioning, Adv. Eng. Inform. 53 (2022),.

Digital Library

[66]

Gou J., Yu B., Maybank S.J., Tao D., Knowledge distillation: A survey, Int. J. Comput. Vis. 129 (6) (2021) 1789–1819,.

Digital Library

[67]

Meng Z., Li J., Zhao Y., Gong Y., Conditional teacher-student learning, in: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Brighton, UK, 2019, pp. 6445–6449,.

[68]

Passban P., Wu Y., Rezagholizadeh M., Liu Q., Alp-kd: Attention-based layer projection for knowledge distillation, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2021, pp. 13657–13665,.

[69]

Chen D., Mei J.-P., Zhang Y., Wang C., Wang Z., Feng Y., Chen C., Cross-layer distillation with semantic calibration, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2021, pp. 7028–7036,.

[70]

Mirzadeh S.I., Farajtabar M., Li A., Levine N., Matsukawa A., Ghasemzadeh H., Improved knowledge distillation via teacher assistant, in: Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 2020, pp. 5191–5198,.

[71]

Li T., Li J., Liu Z., Zhang C., Few sample knowledge distillation for efficient network compression, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Seattle, WA, USA, 2020, pp. 14639–14647,.

[72]

Chen D., Mei J.-P., Wang C., Feng Y., Chen C., Online knowledge distillation with diverse peers, in: Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 2020, pp. 3430–3437,.

[73]

Wu G., Gong S., Peer collaborative learning for online knowledge distillation, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2021, pp. 10302–10310,.

[74]

Tarvainen A., Valpola H., Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results, Adv. Neural Inf. Process. Syst. 30 (2017),.

[75]

Furlanello T., Lipton Z., Tschannen M., Itti L., Anandkumar A., Born again neural networks, in: International Conference on Machine Learning, Stockholm, Sweden, 2018, pp. 1607–1616,.

[76]

He K., Fan H., Wu Y., Xie S., Girshick R., Momentum contrast for unsupervised visual representation learning, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Seattle, WA, USA, 2020, pp. 9726–9735,.

[77]

Grill J.-B., Strub F., Altché F., Tallec C., Richemond P., Buchatskaya E., Doersch C., Avila Pires B., Guo Z., Gheshlaghi Azar M., et al., Bootstrap your own latent-a new approach to self-supervised learning, Adv. Neural Inf. Process. Syst. 33 (2020) 21271–21284,.

Digital Library

[78]

Chen T., Kornblith S., Norouzi M., Hinton G., A simple framework for contrastive learning of visual representations, in: International Conference on Machine Learning, ICML, Vienna, Austria, 2020, pp. 1597–1607,.

Digital Library

[79]

Chen X., He K., Exploring simple siamese representation learning, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Nashville, TN, USA, 2021, pp. 15750–15758,.

[80]

Xie Q., Luong M.-T., Hovy E., Le Q.V., Self-training with noisy student improves imagenet classification, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Seattle, WA, USA, 2020, pp. 10687–10698,.

[81]

Li Y., Ji B., Shi X., Zhang J., Kang B., Wang L., Tea: Temporal excitation and aggregation for action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Seattle, WA, USA, 2020, pp. 906–915,.

[82]

Feichtenhofer C., Fan H., Malik J., He K., Slowfast networks for video recognition, in: IEEE International Conference on Computer Vision, ICCV, Seoul, Korea (South), 2019, pp. 6201–6210,.

[83]

Fan H., Xiong B., Mangalam K., Li Y., Yan Z., Malik J., Feichtenhofer C., Multiscale vision transformers, in: IEEE International Conference on Computer Vision, ICCV, Montreal, QC, Canada, 2021, pp. 6804–6815,.

[84]

Tolstikhin I.O., Houlsby N., Kolesnikov A., Beyer L., Zhai X., Unterthiner T., Yung J., Steiner A., Keysers D., Uszkoreit J., et al., Mlp-mixer: An all-mlp architecture for vision, Adv. Neural Inf. Process. Syst. 34 (2021) 24261–24272,.

[85]

Ba J.L., Kiros J.R., Hinton G.E., Layer normalization, 2016,. arXiv preprint arXiv:1607.06450.

[86]

Hendrycks D., Gimpel K., Gaussian error linear units (gelus), 2016,. arXiv preprint arXiv:1606.08415.

[87]

Kuehne H., Jhuang H., Garrote E., Poggio T., Serre T., HMDB: a large video database for human motion recognition, in: IEEE International Conference on Computer Vision, ICCV, Barcelona, Spain, 2011, pp. 2556–2563,.

Digital Library

[88]

Liashchynskyi P., Liashchynskyi P., Grid search, random search, genetic algorithm: a big comparison for NAS, 2019,. arXiv preprint arXiv:1912.06059.

[89]

Ryoo M.S., Rothrock B., Matthies L., Pooled motion features for first-person videos, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Boston, MA, USA, 2015, pp. 896–904,.

Recommendations

Extreme Low-Resolution Action Recognition with Confident Spatial-Temporal Attention Transfer
Abstract
Action recognition on extreme low-resolution videos, e.g., a resolution of $12 \times 16$ pixels, plays a vital role in far-view surveillance and privacy-preserving multimedia analysis. As low-resolution videos often only contain limited information, it is ...
Ontology-based text convolution neural network (TextCNN) for prediction of construction accidents
Abstract
The construction industry suffers from workplace accidents, including injuries and fatalities, which represent a significant economic and social burden for employers, workers, and society as a whole. The existing research on construction accidents ...
Construction Safety Assurance System of Key Water-control Project
ICEE '10: Proceedings of the 2010 International Conference on E-Business and E-Government

Safety of key water-control project construction is of great significance to people’s lives and national economic development. Science and technology and economic develop rapidly at present, there will be more large-scale and more complex key water-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Advanced Engineering Informatics

Advanced Engineering Informatics Volume 59, Issue C

Jan 2024

1632 pages

Issue’s Table of Contents

Elsevier Ltd.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 January 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents