Simple Conditional Spatial Query Mask Deformable Detection Transformer: A Detection Approach for Multi-Style Strokes of Chinese Characters
<p>(<b>a</b>) The simple process of the deformable attention used in existing methods. (<b>b</b>) The simple process of the proposed deformable attention with a mask mechanism. The proposed mask mechanism determines whether a sampling point has a contribution value to the current reference figure point by predicting the mask layer and discards the sampling points that do not have a contribution value. The gray block represents the query vector, while the yellow block represents the mask vector corresponding to the query vector, and the green block represents query vector after filtering and resampling. The red box represents the current reference point and the green boxes represent the random sampling points.</p> "> Figure 2
<p>The proposed deformable attention module based on the mask mechanism. The purpose of the proposed mask mechanism is to generate a mask prediction layer that predicts which random sampling points have not contributed to the current reference point and discards these points. The random sampling process is accomplished by adding random integer offsets to the current reference point to obtain the exact position of the sampling points. Blocks of different colors represent different vectors, and blocks of different sizes represent different sized feature maps. The dotted lines represent the correspondences between the different blocks.</p> "> Figure 3
<p>The resampling strategy of sampling points in the deformable attention module based on the mask mechanism. The offset in the same direction for a sample point that does not contribute to the current reference point is decreased, and this invalid reference point is replaced with another sample point in that direction that is closer to the reference point. Arrows of different colors represent different sampling points, and the coordinates where the arrows are located represent the offsets from the current reference point. Dashed arrows represent the original sampling point, solid arrows in that direction represent a reassignment of the sampling offset, and the dot represents discarding the current offset.</p> "> Figure 4
<p>The structure of the proposed SFN module. The corresponding weight matrices are obtained from the vectors after linear and sigmoid computations. The weight matrices of the classification features are used as the weight coefficients of the regression task, and the weight matrices of the regression features are used as the weight coefficients of the classification task.</p> "> Figure 5
<p>The overall network structure of the proposed SCSQ-MDD (simple conditional spatial query mask deformable DETR). The mask deformable attention mechanism proposed in this paper is incorporated into the improved multiscale deformable self-attention module. The purpose of the SCSQ is to obtain a spatial query vector, which is spliced and fused with the content query vector and used for cross-attention computations).</p> "> Figure 6
<p>The overall flow of the Chinese character stroke detection network. Each Chinese character image consists of one style of stroke, and there are a total of 5 stroke styles of Chinese character images as input. The purpose of the backbone is to obtain the basic features of the image. The proposed SCSQ-MDD method is used to generate a high-level representation containing information about the location of the strokes and the category of the strokes of the Chinese characters, which is ultimately used in the head detector.</p> "> Figure 7
<p>The process of applying the SCSQ-MDD method to robotic arms. A Chinese character consists figure of multiple strokes, and the position and category information of all the strokes are obtained by stroke detection.</p> "> Figure 8
<p>Trends in the loss function for the deformable DETR and the SCSQ-MDD during network training. The training loss of the proposed SCSQ-MDD decreased faster compared to the deformable DETR.</p> "> Figure 9
<p>Trends in the accuracies of the deformable DETR and the SCSQ-MDD during the training/validation stage. The proposed SCSQ-MDD method performed significantly better compared to the deformable DETR method in terms of detection accuracy.</p> ">
Abstract
:1. Introduction
2. Related Works
2.1. Deformable DETR and Multiscale Deformable Attention Mechanism
2.2. Conditional DETR and Conditional Spatial Query Module
3. Methods
3.1. Mask Deformable Attention in the Multiscale Deformable Attention Module
3.2. The Simple Conditional Spatial Query Strategy
3.3. SFN Structure and Cross-Fused Module
3.4. SCSQ-MDD Pipeline
3.5. Chinese Stroke Detection Method Based on the SCSQ-MDD
3.6. Application of the SCSQ-MDD to Robotic Arms
4. Experiments and Results
4.1. Implementation Details
4.2. Comparison between the SCSQ-MDD and the Deformable DETR
4.3. Comparison between the SCSQ Mask Deformable DETR and Mainstream Detection Methods
4.4. Ablation Study on Improved Mask Deformation Attention
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ma, C.H.; Lu, C.L.; Shih, H.C. Vision-Based Jigsaw Puzzle Solving with a Robotic Arm. Sensors 2023, 23, 6913. [Google Scholar] [CrossRef]
- Xia, X.; Li, T.; Sang, S.; Cheng, Y.; Ma, H.; Zhang, Q.; Yang, K. Path Planning for Obstacle Avoidance of Robot Arm Based on Improved Potential Field Method. Sensors 2023, 23, 3754. [Google Scholar] [CrossRef]
- Zhang, Z.; Wang, Z.; Zhou, Z.; Li, H.; Zhang, Q.; Zhou, Y.; Li, X.; Liu, W. Omnidirectional Continuous Movement Method of Dual-Arm Robot in a Space Station. Sensors 2023, 23, 5025. [Google Scholar] [CrossRef]
- Chao, F.; Huang, Y.; Lin, C.M.; Yang, L.; Hu, H.; Zhou, C. Use of Automatic Chinese Character Decomposition and Human Gestures for Chinese Calligraphy Robots. IEEE Trans.-Hum.-Mach. Syst. 2019, 49, 47–58. [Google Scholar] [CrossRef]
- Wang, T.Q.; Jiang, X.; Liu, C.L. Query Pixel Guided Stroke Extraction with Model-Based Matching for Offline Handwritten Chinese Characters. Pattern Recognit. 2022, 123, 108416. [Google Scholar] [CrossRef]
- Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 580–587. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef]
- Girshick, R.B. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv 2016, arXiv:1605.06409. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2017; pp. 6154–6162. [Google Scholar]
- Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
- Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. arXiv 2015, arXiv:1512.02325. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
- Xie, T.; Zhang, Z.; Tian, J.; Ma, L. Focal DETR: Target-Aware Token Design for Transformer-Based Object Detection. Sensors 2022, 22, 8686. [Google Scholar] [CrossRef] [PubMed]
- Li, S.; Sultonov, F.; Tursunboev, J.; Park, J.H.; Yun, S.; Kang, J.M. Ghostformer: A GhostNet-Based Two-Stage Transformer for Small Object Detection. Sensors 2022, 22, 6939. [Google Scholar] [CrossRef]
- Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention Augmented Convolutional Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3285–3294. [Google Scholar]
- Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representations. arXiv 2018, arXiv:1803.02155. [Google Scholar]
- Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-Alone Self-Attention in Vision Models. arXiv 2019, arXiv:1906.05909. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
- Wu, K.; Peng, H.; Chen, M.; Fu, J.; Chao, H. Rethinking and Improving Relative Position Encoding for Vision Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 10013–10021. [Google Scholar]
- Chen, Q.; Chen, X.; Zeng, G.; Wang, J. Group DETR: Fast Training Convergence with Decoupled One-to-Many Label Assignment. arXiv 2022, arXiv:2207.13085. [Google Scholar]
- Bar, A.; Wang, X.; Kantorov, V.; Reed, C.; Herzig, R.; Chechik, G.; Rohrbach, A.; Darrell, T.; Globerson, A. DETReg: Unsupervised Pretraining with Region Priors for Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14585–14595. [Google Scholar]
- Li, F.; Zhang, H.; guang Liu, S.; Guo, J.; shuan Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13609–13617. [Google Scholar]
- Zhang, G.; Luo, Z.; Yu, Y.; Cui, K.; Lu, S. Accelerating DETR Convergence via Semantic-Aligned Matching. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 939–948. [Google Scholar]
- Gao, P.; Zheng, M.; Wang, X.; Dai, J.; Li, H. Fast Convergence of DETR with Spatially Modulated Co-Attention. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3601–3610. [Google Scholar]
- Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer: The Efficient Transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for Fast Training Convergence. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3631–3640. [Google Scholar]
- Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9756–9765. [Google Scholar]
dian | fandian | duanheng | heng | changheng |
shu | zuoxieshu | youxieshu | pie | shupie |
fanpie | na | ti | piedian | shuti |
hengzheti | wangou | shugou | shuwangou | xiegou |
wogou | henggou | hengzhegou | hengzhexiegou | tizhegou |
hengzhewangou | hengzuozhewangou | hengpiewangoungzhegou | hengpiewanwan | hengzhezhezhegou |
hengzuozhezhezhegou | shuzhezhegou | shuwan | hengzhewan | hengzhe |
hengzuozhe | xieshuzhe | shuzhe | shutizhe | piezhe |
banpiezhe | hengpie | hengxiaopie | tixiaopie | banhengpie |
hengna | hengzhezhepie | shuzhepie | hengxiegou | shuzhezhe |
hengzhezhe | hengzhezhezhe |
Method | Epochs | AP | AR | Params | FLOPS | FPS | ||||
---|---|---|---|---|---|---|---|---|---|---|
Deformable DETR [35] | 150 | 79.8 | 90.5 | 89.1 | 71.8 | 79.8 | 90.6 | 40M | 144G | 5 |
Mask Deformable DETR | 150 | 81.3 | 92.5 | 91.7 | 71.7 | 81.3 | 91.4 | 41M | 185G | 4 |
SCSQ-MDD | 150 | 83.6 | 93.6 | 93.0 | 71.7 | 81.7 | 91.7 | 46M | 185G | 4 |
Method | Backbone | SFN | AP | AR | ||||
---|---|---|---|---|---|---|---|---|
Faster RCNN [8] | ResNet50 | 78.5 | 96.4 | 94.4 | 68.9 | 78.5 | 82.6 | |
ATSS [37] | ResNet50 | 71.8 | 89.8 | 83.1 | 79.6 | 71.7 | 79.0 | |
YOLOv5 | ResNet50 | 84.1 | 94.1 | – | – | – | – | |
YOLOv5 | ResNet50 | √ | 84.9 | 94.9 | – | – | – | – |
YOLOv7 [17] | ResNet50 | 87.9 | 95.0 | – | – | – | – | |
SCSQ-MDD | ResNet50 | √ | 88.1 | 95.6 | 95.5 | 72.1 | 88.2 | 92.9 |
SCSQ-MDD | ResNet101 | √ | 88.6 | 96.9 | 96.0 | 75.1 | 88.6 | 93.5 |
Deformable DETR | Mask Deformable Attn | SCSQ | SFN | AP | >AP50 | AP75 | AR |
---|---|---|---|---|---|---|---|
√ | 79.8 | 90.5 | 89.1 | 90.6 | |||
√ | √ | 81.3 | 92.5 | 91.7 | 91.4 | ||
√ | √ | √ | 82.4 | 92.4 | 92.0 | 91.3 | |
√ | √ | √ | 82.5 | 93.1 | 92.6 | 90.7 | |
√ | √ | √ | 81.5 | 91.1 | 89.7 | 92.1 | |
√ | √ | √ | √ | 83.6 | 93.6 | 93.3 | 91.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhou, T.; Xie, W.; Zhang, H.; Fan, Y. Simple Conditional Spatial Query Mask Deformable Detection Transformer: A Detection Approach for Multi-Style Strokes of Chinese Characters. Sensors 2024, 24, 931. https://doi.org/10.3390/s24030931
Zhou T, Xie W, Zhang H, Fan Y. Simple Conditional Spatial Query Mask Deformable Detection Transformer: A Detection Approach for Multi-Style Strokes of Chinese Characters. Sensors. 2024; 24(3):931. https://doi.org/10.3390/s24030931
Chicago/Turabian StyleZhou, Tian, Wu Xie, Huimin Zhang, and Yong Fan. 2024. "Simple Conditional Spatial Query Mask Deformable Detection Transformer: A Detection Approach for Multi-Style Strokes of Chinese Characters" Sensors 24, no. 3: 931. https://doi.org/10.3390/s24030931
APA StyleZhou, T., Xie, W., Zhang, H., & Fan, Y. (2024). Simple Conditional Spatial Query Mask Deformable Detection Transformer: A Detection Approach for Multi-Style Strokes of Chinese Characters. Sensors, 24(3), 931. https://doi.org/10.3390/s24030931