Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

DeIoU: Toward Distinguishable Box Prediction in Densely Packed Object Detection

Published: 01 November 2024 Publication History

Abstract

The Intersection over Union (IoU) has been widely employed in various stages of object detection owing to its ability to quantify the similarity between boxes objectively. However, in densely packed scenes full of crowded and small-sized objects, adjacent positive boxes often exhibit high levels of overlap. This overlap interference compromises the consistency between quality evaluation and confidence, leading to ambiguous box prediction within the previous IoU-based models. To address this issue, we design a novel learning paradigm tailored for Dense scenes based on IoU, called DeIoU. This approach effectively suppresses unnecessary overlap between predicted boxes and thereby enhances representation learning for non-salient objects. Specifically, it consists of a dense box regression loss <inline-formula> <tex-math notation="LaTeX">${\mathcal {L}}_{DeIoU}$ </tex-math></inline-formula> and a one-to-many (O2M) label matching strategy guided by DeIoU. These components focus on calibrating the position and shape prediction quality during the model training, learning distinguishable object features by penalizing overlap interference between neighboring boxes. Extensive experiments on four object detection datasets including SKU-110K, CrowdHuman, MS COCO 2017, and DIOR, demonstrate that our DeIoU-based learning strategy outperforms other state-of-the-art methods. Notably, the proposed method delivers a substantial improvement (average <inline-formula> <tex-math notation="LaTeX">$1.3~{AP}$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$1.8~MR^{-2}$ </tex-math></inline-formula>) across popular detectors on SKU-110K and CrowdHuman while exhibiting distinct competitiveness on small objects within natural scenes.

References

[1]
J. Wan, Q. Wang, and A. B. Chan, “Kernel-based density map generation for dense object counting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 3, pp. 1357–1370, Mar. 2022.
[2]
W. Zhu, X. Wang, and H. Li, “Multi-modal deep analysis for multimedia,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 10, pp. 3740–3764, Oct. 2020.
[3]
E. Goldman, R. Herzig, A. Eisenschtat, J. Goldberger, and T. Hassner, “Precise detection in densely packed scenes,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jul. 2019, pp. 5227–5236.
[4]
W. Feng, L. Lan, Y. Luo, Y. Yu, X. Zhang, and Z. Luo, “Near-online multi-pedestrian tracking via combining multiple consistent appearance cues,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 4, pp. 1540–1554, Apr. 2021.
[5]
L. Dai, H. Liu, H. Tang, Z. Wu, and P. Song, “AO2-DETR: Arbitrary-oriented object detection transformer,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 5, pp. 2342–2356, May 2023.
[6]
B. Huang, J. Li, J. Chen, G. Wang, J. Zhao, and T. Xu, “Anti-UAV410: A thermal infrared benchmark and customized scheme for tracking drones in the wild,” IEEE Trans. Pattern Anal. Mach. Intell., early access, Nov. 22, 2024. 10.1109/TPAMI.2023.3335338.
[7]
Z. Zhang et al., “Modality meets long-term tracker: A Siamese dual fusion framework for tracking UAV,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Oct. 2023, pp. 1975–1979.
[8]
S. Zhang, C. Li, Z. Jia, L. Liu, Z. Zhang, and L. Wang, “Diag-IoU loss for object detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 12, pp. 7671–7683, Dec. 2023.
[9]
Y. Han et al., “H2V4Sports: Real-time horizontal-to-vertical video converter for sports lives via fast object detection and tracking,” in Proc. ACM Multimedia, Oct. 2023, pp. 9376–9378.
[10]
H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 658–666.
[11]
Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-IoU loss: Faster and better learning for bounding box regression,” in Proc. AAAI Conf. Artif. Intell., Apr. 2020, vol. 34, no. 7, pp. 12993–13000.
[12]
L. Wang, D. Tao, R. Wang, R. Wang, and H. Li, “Big map R-CNN for object detection in large-scale remote sensing images,” Math. Found. Comput., vol. 2, no. 4, pp. 299–314, 2019.
[13]
B. Han, Y. Wang, Z. Yang, and X. Gao, “Small-scale pedestrian detection based on deep neural network,” IEEE Trans. Intell. Transp. Syst., vol. 21, no. 7, pp. 3046–3055, Jul. 2020.
[14]
L. Lan, X. Wang, S. Zhang, D. Tao, W. Gao, and T. S. Huang, “Interacting tracklets for multi-object tracking,” IEEE Trans. Image Process., vol. 27, no. 9, pp. 4585–4597, Sep. 2018.
[15]
M. Zhang, R. Zhang, Y. Yang, H. Bai, J. Zhang, and J. Guo, “ISNet: Shape matters for infrared small target detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 877–886.
[16]
Y. Zhan, J. Yu, T. Yu, and D. Tao, “On exploring undetermined relationships for visual relationship detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 5123–5132.
[17]
Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “RepPoints: Point set representation for object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9657–9666.
[18]
J. Wang, L. Song, Z. Li, H. Sun, J. Sun, and N. Zheng, “End-to-end object detection with fully convolutional network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 15849–15858.
[19]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
[20]
Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6154–6162.
[21]
Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO series in 2021,” 2021, arXiv:2107.08430.
[22]
K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “CenterNet: Keypoint triplets for object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 6569–6578.
[23]
H. Chi et al., “Tohan: A one-step approach towards few-shot hypothesis adaptation,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 20970–20982.
[24]
L. Ru, Y. Zhan, B. Yu, and B. Du, “Learning affinity from attention: End-to-end weakly-supervised semantic segmentation with transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2022, pp. 16846–16855.
[25]
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020, pp. 213–229.
[26]
X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable transformers for end-to-end object detection,” in Proc. Int. Conf. Learn. Represent. (ICLR), Jan. 2021, pp. 1–11.
[27]
A. Zheng, Y. Zhang, X. Zhang, X. Qi, and J. Sun, “Progressive end-to-end object detection in crowded scenes,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 857–866.
[28]
H. Zhang et al., “DINO: DETR with improved denoising anchor boxes for end-to-end object detection,” in Proc. Int. Conf. Learn. Represent., 2023, pp. 1–18.
[29]
L. Lan, X. Wang, G. Hua, T. S. Huang, and D. Tao, “Semi-online multi-people tracking by re-identification,” Int. J. Comput. Vis., vol. 128, no. 7, pp. 1937–1955, Jul. 2020.
[30]
T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988.
[31]
J. Wang, X. Tao, M. Xu, Y. Duan, and J. Lu, “Hierarchical objectness network for region proposal generation and object detection,” Pattern Recognit., vol. 83, pp. 260–272, Nov. 2018.
[32]
S. Kant, “Learning Gaussian maps for dense object detection,” 2020, arXiv:2004.11855.
[33]
X. Pan et al., “Dynamic refinement network for oriented and densely packed object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 11207–11216.
[34]
X. Huang, Z. Ge, Z. Jie, and O. Yoshie, “NMS by representative region: Towards crowded pedestrian detection by proposal pairing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 10750–10759.
[35]
X. Li et al., “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, Dec. 2020, pp. 21002–21012.
[36]
T. Rong, Y. Zhu, H. Cai, and Y. Xiong, “A solution to product detection in densely packed scenes,” 2020, arXiv:2007.11946.
[37]
H. Zhang, Y. Wang, F. Dayoub, and N. Sunderhauf, “VarifocalNet: An IoU-aware dense object detector,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 8514–8523.
[38]
S. Zhang et al., “Dense distinct query for end-to-end object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 7329–7338.
[39]
P. Chen et al., “Point-to-box network for accurate object detection via single point supervision,” in Proc. Eur. Conf. Comput. Vis., vol. 13669, 2022, pp. 51–67.
[40]
D. Zhang, W. Zeng, J. Yao, and J. Han, “Weakly supervised object detection using proposal- and semantic-level relationships,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 3349–3363, Jun. 2022.
[41]
J. Yao et al., “Position-based anchor optimization for point supervised dense nuclei detection,” Neural Netw., vol. 171, pp. 159–170, Mar. 2024.
[42]
R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1440–1448.
[43]
J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, “UnitBox: An advanced object detection network,” in Proc. 24th ACM Int. Conf. Multimedia, Oct. 2016, pp. 516–520. 10.1145/2964284.2967274.
[44]
D. Zhou et al., “IoU loss for 2D/3D object detection,” in Proc. Int. Conf. 3D Vis. (3DV), Sep. 2019, pp. 85–94.
[45]
Y.-F. Zhang, W. Ren, Z. Zhang, Z. Jia, L. Wang, and T. Tan, “Focal and efficient IOU loss for accurate bounding box regression,” 2021, arXiv:2101.08158.
[46]
S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9759–9768.
[47]
T. Y. Lin, P. Dollàr, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 2117–2125.
[48]
X. Bi, Y. Shang, B. Liu, B. Xiao, W. Li, and X. Gao, “A versatile detection method for various contrast enhancement manipulations,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 2, pp. 491–504, Feb. 2023.
[49]
Z. He, L. Zhang, Y. Yang, and X. Gao, “Partial alignment for object detection in the wild,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 8, pp. 5238–5251, Aug. 2022.
[50]
P. Sun et al., “Sparse R-CNN: End-to-end object detection with learnable proposals,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 14454–14463.
[51]
S. Shao et al., “CrowdHuman: A benchmark for detecting human in a crowd,” 2018, arXiv:1805.00123.
[52]
W. Wang et al., “InternImage: Exploring large-scale vision foundation models with deformable convolutions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 14408–14419.
[53]
H. Law and J. Deng, “CornerNet: Detecting objects as paired keypoints,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 734–750.
[54]
C. Zhang, K.-M. Lam, and Q. Wang, “CoF-Net: A progressive coarse-to-fine framework for object detection in remote-sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, Art. no. 10.1109/TGRS.2022.3233881.
[55]
T. Gao, Q. Niu, J. Zhang, T. Chen, S. Mei, and A. Jubair, “Global to local: A scale-aware network for remote sensing object detection,” IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, Art. no.
[56]
T. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
[57]
K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,” ISPRS J. Photogramm. Remote Sens., vol. 159, pp. 296–307, Jan. 2020.
[58]
D. Zhang et al., “Weakly supervised semantic segmentation via alternate self-dual teaching,” IEEE Trans. Image Process., early access, Dec. 20, 2023. 10.1109/TIP.2023.3343112.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Circuits and Systems for Video Technology
IEEE Transactions on Circuits and Systems for Video Technology  Volume 34, Issue 11_Part_1
Nov. 2024
789 pages

Publisher

IEEE Press

Publication History

Published: 01 November 2024

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media