Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Improving single‐stage activity recognition of excavators using knowledge distillation of temporal gradient data

Published: 29 January 2024 Publication History

Abstract

Single‐stage activity recognition methods have been gaining popularity within the construction domain. However, their low per‐frame accuracy necessitates additional post‐processing to link the per‐frame detections. Therefore, limiting their real‐time monitoring capabilities is an indispensable component of the emerging construction of digital twins. This study proposes knowledge DIstillation of temporal Gradient data for construction Entity activity Recognition (DIGER), built upon the you only watch once (YOWO) method and improving its activity recognition and localization performance. Activity recognition is improved by designing an auxiliary backbone to exploit the complementary information in the temporal gradient data (transferred into YOWO using knowledge distillation), while localization is improved primarily through integration of complete intersection over union loss. DIGER achieved a per‐frame activity recognition accuracy of 93.6% and localization mean average precision at 50% of 79.8% on a large custom dataset, outperforming state‐of‐the‐art methods without requiring additional computation during inference, making it highly effective for real‐time monitoring of construction site activities.

References

[1]
Bang, S., & Kim, H. (2020). Context‐based information generation for managing UAV‐acquired data using image captioning. Automation in Construction, 112, 103116. https://doi.org/10.1016/j.autcon.2020.103116
[2]
Beddiar, D. R., Nini, B., Sabokrou, M., & Hadid, A. (2020). Vision‐based human activity recognition: A survey. Multimedia Tools and Applications, 79(41), 30509–30555. https://doi.org/10.1007/s11042-020-09004-3
[3]
Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft‐NMS—Improving object detection with one line of code. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy (pp. 5562–5570). https://doi.org/10.1109/ICCV.2017.593
[4]
Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self‐supervised Vision Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada (pp. 9630–9640). https://doi.org/10.1109/ICCV48922.2021.00951
[5]
Carreira, J., Noland, E., Banki‐Horvath, A., Hillier, C., & Zisserman, A. (2018). A short note about Kinetics‐600. arXiv. https://doi.org/10.48550/arXiv.1808.01340
[6]
Chen, C., Zhu, Z., & Hammad, A. (2020). Automated excavators activity recognition and productivity analysis from construction site surveillance videos. Automation in Construction, 110, 103045. https://doi.org/10.1016/j.autcon.2019.103045
[7]
Chen, C., Zhu, Z., & Hammad, A. (2022). Critical review and road map of automated methods for earthmoving equipment productivity monitoring. Journal of Computing in Civil Engineering, 36(3), 03122001. https://doi.org/10.1061/(ASCE)CP.1943-5487.0001017
[8]
Chen, C., Zhu, Z., Hammad, A., & Akbarzadeh, M. (2021). Automatic identification of idling reasons in excavation operations based on excavator–truck relationships. Journal of Computing in Civil Engineering, 35(5), 04021015. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000981
[9]
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. Proceedings of the 37th International Conference on Machine Learning, 119, 1597–1607.
[10]
Cheng, M.‐Y., Cao, M.‐T., & Nuralim, C. K. (2022). Computer vision‐based deep learning for supervising excavator operations and measuring real‐time earthwork productivity. Journal of Supercomputing, 79, 4468–4492. https://doi.org/10.1007/s11227-022-04803-x
[11]
Crasto, N., Weinzaepfel, P., Alahari, K., & Schmid, C. (2019). MARS: Motion‐augmented RGB stream for action recognition. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA (pp. 7874–7883). https://doi.org/10.1109/CVPR.2019.00807
[12]
Dai, R., Das, S., & Bremond, F. (2021). Learning an augmented RGB representation with cross‐modal knowledge distillation for action detection. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada (pp. 13033–13044). https://doi.org/10.1109/ICCV48922.2021.01281
[13]
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Darrell, T., & Saenko, K. (2015). Long‐term recurrent convolutional networks for visual recognition and description. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA (pp. 2625–2634). https://doi.org/10.1109/CVPR.2015.7298878
[14]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16×16 words: Transformers for image recognition at scale. https://doi.org/10.48550/arXiv.2010.11929
[15]
Durdyev, S., & Hosseini, M. R. (2020). Causes of delays on construction projects: A comprehensive list. International Journal of Managing Projects in Business, 13(1), 20–46. https://doi.org/10.1108/IJMPB-09-2018-0178
[16]
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). SlowFast networks for video recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea (pp. 6202–6211). https://doi.org/10.1109/ICCV.2019.00630
[17]
Garcia, N. C., Adel Bargal, S., Ablavsky, V., Morerio, P., Murino, V., & Sclaroff, S. (2021). Distillation multiple choice learning for multimodal action recognition. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI (pp. 2754–2763). https://doi.org/10.1109/WACV48630.2021.00280
[18]
Ghelmani, A., & Hammad, A. (2023a). Enhancing single‐stage excavator activity recognition via knowledge distillation of temporal gradient data. Proceedings of the 2023 European Conference on Computing in Construction and the 40th International CIB W78 Conference, Crete, Greece. https://doi.org/10.35490/EC3.2023.321
[19]
Ghelmani, A., & Hammad, A. (2023b). Self‐supervised contrastive video representation learning for construction equipment activity recognition on limited dataset. Automation in Construction, 154, 105001. https://doi.org/10.1016/j.autcon.2023.105001
[20]
Gkioxari, G., & Malik, J. (2015). Finding action tubes. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA (pp. 759–768). https://doi.org/10.1109/CVPR.2015.7298676
[21]
Golparvar‐Fard, M., Heydarian, A., & Niebles, J. C. (2013). Vision‐based action recognition of earthmoving equipment using spatio‐temporal features and support vector machine classifiers. Advanced Engineering Informatics, 27(4), 652–663. https://doi.org/10.1016/j.aei.2013.09.001
[22]
Gou, J., Yu, B., Maybank, S. J., & Tao, D. (2021). Knowledge distillation: A survey. International Journal of Computer Vision, 129(6), 1789–1819. https://doi.org/10.1007/s11263-021-01453-z
[23]
Harichandran, A., Raphael, B., & Mukherjee, A. (2023). Equipment activity recognition and early fault detection in automated construction through a hybrid machine learning framework. Computer‐Aided Civil and Infrastructure Engineering, 38(2), 253–268. https://doi.org/10.1111/mice.12848
[24]
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R‐CNN. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy (pp. 2980–2988). https://doi.org/10.1109/ICCV.2017.322
[25]
Hui, J. (2018, March 7). mAP (mean Average Precision) for object detection. Medium. https://jonathan‐hui.medium.com/map‐mean‐average‐precision‐for‐object‐detection‐45c121a31173
[26]
Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D Convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231. https://doi.org/10.1109/TPAMI.2012.59
[27]
Jung, S., Jeoung, J., Kang, H., & Hong, T. (2022). 3D convolutional neural network‐based one‐stage model for real‐time action detection in video of construction equipment. Computer‐Aided Civil and Infrastructure Engineering, 37(1), 126–142. https://doi.org/10.1111/mice.12695
[28]
Jung, S., Jeoung, J., Lee, D., Jang, H., & Hong, T. (2023). Visual–auditory learning network for construction equipment action detection. Computer‐Aided Civil and Infrastructure Engineering, 38(14), 1916–1934. https://doi.org/10.1111/mice.12983
[29]
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Action tubelet detector for spatio‐temporal action localization. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy (pp. 4415–4423). https://doi.org/10.1109/ICCV.2017.472
[30]
Kim, J., & Chi, S. (2019). Action recognition of earthmoving excavators based on sequential pattern analysis of visual features and operation cycles. Automation in Construction, 104, 255–264. https://doi.org/10.1016/j.autcon.2019.03.025
[31]
Köpüklü, O., Kose, N., Gunduz, A., & Rigoll, G. (2019). Resource efficient 3D convolutional neural networks. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, South Korea (pp. 1910–1919). https://doi.org/10.1109/ICCVW.2019.00240
[32]
Köpüklü, O., Wei, X., & Rigoll, G. (2021). You Only Watch Once: A unified CNN architecture for real‐time spatiotemporal action localization (arXiv:1911.06644). arXiv. https://doi.org/10.48550/arXiv.1911.06644
[33]
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
[34]
Lin, J., Gan, C., & Han, S. (2019). TSM: Temporal shift module for efficient video understanding. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, South Korea (pp. 7082–7092). https://doi.org/10.1109/ICCV.2019.00718
[35]
Lin, T.‐Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2020). Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318–327. https://doi.org/10.1109/TPAMI.2018.2858826
[36]
Lin, T.‐Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision—ECCV 2014 (pp. 740–755). Springer International Publishing. https://doi.org/10.1007/978-3-319-10602-1_48
[37]
Liu, S., Huang, D., & Wang, Y. (2019). Adaptive NMS: Refining pedestrian detection in a crowd. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA (pp. 6452–6461). https://doi.org/10.1109/CVPR.2019.00662
[38]
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.‐Y., & Berg, A. C. (2016). SSD: Single Shot MultiBox Detector. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer vision—ECCV 2016 (pp. 21–37). Springer International Publishing. https://doi.org/10.1007/978-3-319-46448-0_2
[39]
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. https://doi.org/10.1109/tit.1982.1056489
[40]
Loshchilov, I., & Hutter, F. (2016). SGDR: Stochastic gradient descent with restarts. International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico. https://doi.org/10.48550/arXiv.1608.03983
[41]
Luo, H., Xiong, C., Fang, W., Love, P. E. D., Zhang, B., & Ouyang, X. (2018). Convolutional neural networks: Computer vision‐based workforce activity assessment in construction. Automation in Construction, 94, 282–289. https://doi.org/10.1016/j.autcon.2018.06.007
[42]
Luo, X., Li, H., Cao, D., Dai, F., Seo, J., & Lee, S. (2018). Recognizing diverse construction activities in site images via relevance networks of construction‐related objects detected by convolutional neural networks. Journal of Computing in Civil Engineering, 32(3), 04018012. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000756
[43]
Luo, X., Li, H., Yu, Y., Zhou, C., & Cao, D. (2020). Combining deep features and activity context to improve recognition of activities of workers in groups. Computer‐Aided Civil and Infrastructure Engineering, 35(9), 965–978. https://doi.org/10.1111/mice.12538
[44]
Padilla, R., Netto, S. L., & da Silva, E. A. B. (2020). A survey on performance metrics for object‐detection algorithms. 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niterói, Brazil (pp. 237–242). https://doi.org/10.1109/IWSSIP48289.2020.9145130
[45]
Peng, X., & Schmid, C. (2016). Multi‐region two‐stream R‐CNN for action detection. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer vision—ECCV 2016 (pp. 744–759). Springer International Publishing. https://doi.org/10.1007/978-3-319-46493-0_45
[46]
Qian, R., Meng, T., Gong, B., Yang, M.‐H., Wang, H., Belongie, S., & Cui, Y. (2021). Spatiotemporal contrastive video representation learning. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN (pp. 6960–6970). https://doi.org/10.1109/CVPR46437.2021.00689
[47]
Rafiei, M. H., & Adeli, H. (2018). Novel machine‐learning model for estimating construction costs considering economic variables and indexes. Journal of Construction Engineering and Management, 144(12). https://doi.org/10.1061/(asce)co.1943-7862.0001570
[48]
Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI (pp. 6517–6525). https://doi.org/10.1109/CVPR.2017.690
[49]
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R‐CNN: Towards real‐time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
[50]
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA (pp. 658–666). https://doi.org/10.1109/CVPR.2019.00075
[51]
Roberts, D., & Golparvar‐Fard, M. (2019). End‐to‐end vision‐based detection, tracking and activity analysis of earthmoving equipment filmed at ground level. Automation in Construction, 105, 102811. https://doi.org/10.1016/j.autcon.2019.04.006
[52]
Sekachev, B., Manovich, N., Zhiltsov, M., Zhavoronkov, A., Kalinin, D., Hoff, B., TOsmanov, Kruchinin, D., Zankevich, A., DmitriySidnev, Markelov, M., Johannes222, Chenuet, M., a‐andre, telenachos, Melnikov, A., Kim, J., Ilouz, L., Glazov, N., … Truong, T. (2020). opencv/cvat: V1.1.0 (v1.1.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.4009388
[53]
Simonyan, K., & Zisserman, A. (2014). Two‐Stream convolutional networks for action recognition in videos. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 27). Curran Associates, Inc.
[54]
Soviany, P., Ionescu, R. T., Rota, P., & Sebe, N. (2022). Curriculum learning: A survey. International Journal of Computer Vision, 130(6), 1526–1565. https://doi.org/10.1007/s11263-022-01611-x
[55]
Stroud, J. C., Ross, D. A., Sun, C., Deng, J., & Sukthankar, R. (2020). D3D: Distilled 3D networks for video action recognition. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO (pp. 614–623). https://doi.org/10.1109/WACV45572.2020.9093274
[56]
Torabi, G., Hammad, A., & Bouguila, N. (2022). Two‐dimensional and three‐dimensional CNN‐based simultaneous detection and activity classification of construction workers. Journal of Computing in Civil Engineering, 36(4), 04022009. https://doi.org/10.1061/(ASCE)CP.1943-5487.0001024
[57]
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile (pp. 4489–4497). https://doi.org/10.1109/ICCV.2015.510
[58]
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer vision—ECCV 2016 (pp. 20–36). Springer International Publishing. https://doi.org/10.1007/978-3-319-46484-8_2
[59]
Wang, Z., Zhang, Q., Yang, B., Wu, T., Lei, K., Zhang, B., & Fang, T. (2021). Vision‐based framework for automatic progress monitoring of precast walls by using surveillance videos during the construction phase. Journal of Computing in Civil Engineering, 35(1), 04020056. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000933
[60]
Wang, Z., Zhang, Y., Mosalam, K. M., Gao, Y., & Huang, S.‐L. (2022). Deep semantic segmentation for visual understanding on construction sites. Computer‐Aided Civil and Infrastructure Engineering, 37(2), 145–162. https://doi.org/10.1111/mice.12701
[61]
Xiao, J., Jing, L., Zhang, L., He, J., She, Q., Zhou, Z., Yuille, A., & Li, Y. (2022). Learning from temporal gradient for semi‐supervised action recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA (pp. 3252–3262). https://doi.org/10.1109/CVPR52688.2022.00325
[62]
Xiong, B., Fan, H., Grauman, K., & Feichtenhofer, C. (2021). Multiview pseudo‐labeling for semi‐supervised learning from video. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada (pp. 7209–7219). https://doi.org/10.1109/ICCV48922.2021.00712
[63]
Yang, J., Wang, K., Zhao, L., Dai, K., & LI, R. (2023). YOWOv2: A real‐time multi‐level detection framework for spatio‐temporal action detection. https://doi.org/10.2139/ssrn.4485402
[64]
Yang, M., Wu, C., Guo, Y., Jiang, R., Zhou, F., Zhang, J., & Yang, Z. (2023). Transformer‐based deep learning model and video dataset for unsafe action identification in construction projects. Automation in Construction, 146, 104703. https://doi.org/10.1016/j.autcon.2022.104703
[65]
Yap, J. B. H., Goay, P. L., Woon, Y. B., & Skitmore, M. (2021). Revisiting critical delay factors for construction: Analysing projects in Malaysia. Alexandria Engineering Journal, 60(1), 1717–1729. https://doi.org/10.1016/j.aej.2020.11.021
[66]
Yu, J., Jiang, Y., Wang, Z., Cao, Z., & Huang, T. (2016). UnitBox: An advanced object detection network. Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands (pp. 516–520). https://doi.org/10.1145/2964284.2967274
[67]
Zaidi, S. S. A., Ansari, M. S., Aslam, A., Kanwal, N., Asghar, M., & Lee, B. (2021). A survey of modern deep learning based object detection models. arXiv:2104.11892 [Cs, Eess]. http://arxiv.org/abs/2104.11892
[68]
Zhang, B., Wang, L., Wang, Z., Qiao, Y., & Wang, H. (2016). Real‐time action recognition with enhanced motion vector CNNs. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV (pp. 2718–2726). https://doi.org/10.1109/CVPR.2016.297
[69]
Zhao, Y., Xiong, Y., & Lin, D. (2018). Recognize actions by disentangling components of dynamics. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT (pp. 6566–6575). https://doi.org/10.1109/CVPR.2018.00687
[70]
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020). Distance‐IoU loss: Faster and better learning for bounding box regression. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7), 12993–13000. https://doi.org/10.1609/aaai.v34i07.6999
[71]
Zou, Z., Chen, K., Shi, Z., Guo, Y., & Ye, J. (2023). Object detection in 20 years: A survey. Proceedings of the IEEE, 111(3), 257–276. https://doi.org/10.1109/JPROC.2023.3238524

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computer-Aided Civil and Infrastructure Engineering
Computer-Aided Civil and Infrastructure Engineering  Volume 39, Issue 13
1 July 2024
163 pages
EISSN:1467-8667
DOI:10.1111/mice.v39.13
Issue’s Table of Contents
This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 29 January 2024

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media