research-article

DeIoU: Toward Distinguishable Box Prediction in Densely Packed Object Detection

Authors:

Xinbo GaoAuthors Info & Claims

IEEE Transactions on Circuits and Systems for Video Technology, Volume 34, Issue 11_Part_1

Pages 11086 - 11100

https://doi.org/10.1109/TCSVT.2024.3415657

Published: 01 November 2024 Publication History

Abstract

The Intersection over Union (IoU) has been widely employed in various stages of object detection owing to its ability to quantify the similarity between boxes objectively. However, in densely packed scenes full of crowded and small-sized objects, adjacent positive boxes often exhibit high levels of overlap. This overlap interference compromises the consistency between quality evaluation and confidence, leading to ambiguous box prediction within the previous IoU-based models. To address this issue, we design a novel learning paradigm tailored for Dense scenes based on IoU, called DeIoU. This approach effectively suppresses unnecessary overlap between predicted boxes and thereby enhances representation learning for non-salient objects. Specifically, it consists of a dense box regression loss <inline-formula> <tex-math notation="LaTeX">${\mathcal {L}}_{DeIoU}$ </tex-math></inline-formula> and a one-to-many (O2M) label matching strategy guided by DeIoU. These components focus on calibrating the position and shape prediction quality during the model training, learning distinguishable object features by penalizing overlap interference between neighboring boxes. Extensive experiments on four object detection datasets including SKU-110K, CrowdHuman, MS COCO 2017, and DIOR, demonstrate that our DeIoU-based learning strategy outperforms other state-of-the-art methods. Notably, the proposed method delivers a substantial improvement (average <inline-formula> <tex-math notation="LaTeX">$1.3~{AP}$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$1.8~MR^{-2}$ </tex-math></inline-formula>) across popular detectors on SKU-110K and CrowdHuman while exhibiting distinct competitiveness on small objects within natural scenes.

References

[1]

J. Wan, Q. Wang, and A. B. Chan, “Kernel-based density map generation for dense object counting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 3, pp. 1357–1370, Mar. 2022.

[2]

W. Zhu, X. Wang, and H. Li, “Multi-modal deep analysis for multimedia,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 10, pp. 3740–3764, Oct. 2020.

[3]

E. Goldman, R. Herzig, A. Eisenschtat, J. Goldberger, and T. Hassner, “Precise detection in densely packed scenes,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jul. 2019, pp. 5227–5236.

[4]

W. Feng, L. Lan, Y. Luo, Y. Yu, X. Zhang, and Z. Luo, “Near-online multi-pedestrian tracking via combining multiple consistent appearance cues,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 4, pp. 1540–1554, Apr. 2021.

[5]

L. Dai, H. Liu, H. Tang, Z. Wu, and P. Song, “AO2-DETR: Arbitrary-oriented object detection transformer,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 5, pp. 2342–2356, May 2023.

Digital Library

[6]

B. Huang, J. Li, J. Chen, G. Wang, J. Zhao, and T. Xu, “Anti-UAV410: A thermal infrared benchmark and customized scheme for tracking drones in the wild,” IEEE Trans. Pattern Anal. Mach. Intell., early access, Nov. 22, 2024. 10.1109/TPAMI.2023.3335338.

Digital Library

[7]

Z. Zhang et al., “Modality meets long-term tracker: A Siamese dual fusion framework for tracking UAV,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Oct. 2023, pp. 1975–1979.

[8]

S. Zhang, C. Li, Z. Jia, L. Liu, Z. Zhang, and L. Wang, “Diag-IoU loss for object detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 12, pp. 7671–7683, Dec. 2023.

Digital Library

[9]

Y. Han et al., “H2V4Sports: Real-time horizontal-to-vertical video converter for sports lives via fast object detection and tracking,” in Proc. ACM Multimedia, Oct. 2023, pp. 9376–9378.

[10]

H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 658–666.

[11]

Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-IoU loss: Faster and better learning for bounding box regression,” in Proc. AAAI Conf. Artif. Intell., Apr. 2020, vol. 34, no. 7, pp. 12993–13000.

[12]

L. Wang, D. Tao, R. Wang, R. Wang, and H. Li, “Big map R-CNN for object detection in large-scale remote sensing images,” Math. Found. Comput., vol. 2, no. 4, pp. 299–314, 2019.

[13]

B. Han, Y. Wang, Z. Yang, and X. Gao, “Small-scale pedestrian detection based on deep neural network,” IEEE Trans. Intell. Transp. Syst., vol. 21, no. 7, pp. 3046–3055, Jul. 2020.

Digital Library

[14]

L. Lan, X. Wang, S. Zhang, D. Tao, W. Gao, and T. S. Huang, “Interacting tracklets for multi-object tracking,” IEEE Trans. Image Process., vol. 27, no. 9, pp. 4585–4597, Sep. 2018.

[15]

M. Zhang, R. Zhang, Y. Yang, H. Bai, J. Zhang, and J. Guo, “ISNet: Shape matters for infrared small target detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 877–886.

[16]

Y. Zhan, J. Yu, T. Yu, and D. Tao, “On exploring undetermined relationships for visual relationship detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 5123–5132.

[17]

Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “RepPoints: Point set representation for object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9657–9666.

[18]

J. Wang, L. Song, Z. Li, H. Sun, J. Sun, and N. Zheng, “End-to-end object detection with fully convolutional network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 15849–15858.

[19]

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.

Digital Library

[20]

Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6154–6162.

[21]

Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO series in 2021,” 2021, arXiv:2107.08430.

[22]

K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “CenterNet: Keypoint triplets for object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 6569–6578.

[23]

H. Chi et al., “Tohan: A one-step approach towards few-shot hypothesis adaptation,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 20970–20982.

[24]

L. Ru, Y. Zhan, B. Yu, and B. Du, “Learning affinity from attention: End-to-end weakly-supervised semantic segmentation with transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2022, pp. 16846–16855.

[25]

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020, pp. 213–229.

[26]

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable transformers for end-to-end object detection,” in Proc. Int. Conf. Learn. Represent. (ICLR), Jan. 2021, pp. 1–11.

[27]

A. Zheng, Y. Zhang, X. Zhang, X. Qi, and J. Sun, “Progressive end-to-end object detection in crowded scenes,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 857–866.

[28]

H. Zhang et al., “DINO: DETR with improved denoising anchor boxes for end-to-end object detection,” in Proc. Int. Conf. Learn. Represent., 2023, pp. 1–18.

[29]

L. Lan, X. Wang, G. Hua, T. S. Huang, and D. Tao, “Semi-online multi-people tracking by re-identification,” Int. J. Comput. Vis., vol. 128, no. 7, pp. 1937–1955, Jul. 2020.

[30]

T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988.

[31]

J. Wang, X. Tao, M. Xu, Y. Duan, and J. Lu, “Hierarchical objectness network for region proposal generation and object detection,” Pattern Recognit., vol. 83, pp. 260–272, Nov. 2018.

Digital Library

[32]

S. Kant, “Learning Gaussian maps for dense object detection,” 2020, arXiv:2004.11855.

[33]

X. Pan et al., “Dynamic refinement network for oriented and densely packed object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 11207–11216.

[34]

X. Huang, Z. Ge, Z. Jie, and O. Yoshie, “NMS by representative region: Towards crowded pedestrian detection by proposal pairing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 10750–10759.

[35]

X. Li et al., “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, Dec. 2020, pp. 21002–21012.

[36]

T. Rong, Y. Zhu, H. Cai, and Y. Xiong, “A solution to product detection in densely packed scenes,” 2020, arXiv:2007.11946.

[37]

H. Zhang, Y. Wang, F. Dayoub, and N. Sunderhauf, “VarifocalNet: An IoU-aware dense object detector,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 8514–8523.

[38]

S. Zhang et al., “Dense distinct query for end-to-end object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 7329–7338.

[39]

P. Chen et al., “Point-to-box network for accurate object detection via single point supervision,” in Proc. Eur. Conf. Comput. Vis., vol. 13669, 2022, pp. 51–67.

[40]

D. Zhang, W. Zeng, J. Yao, and J. Han, “Weakly supervised object detection using proposal- and semantic-level relationships,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 3349–3363, Jun. 2022.

[41]

J. Yao et al., “Position-based anchor optimization for point supervised dense nuclei detection,” Neural Netw., vol. 171, pp. 159–170, Mar. 2024.

Digital Library

[42]

R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1440–1448.

[43]

J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, “UnitBox: An advanced object detection network,” in Proc. 24th ACM Int. Conf. Multimedia, Oct. 2016, pp. 516–520. 10.1145/2964284.2967274.

Digital Library

[44]

D. Zhou et al., “IoU loss for 2D/3D object detection,” in Proc. Int. Conf. 3D Vis. (3DV), Sep. 2019, pp. 85–94.

[45]

Y.-F. Zhang, W. Ren, Z. Zhang, Z. Jia, L. Wang, and T. Tan, “Focal and efficient IOU loss for accurate bounding box regression,” 2021, arXiv:2101.08158.

[46]

S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9759–9768.

[47]

T. Y. Lin, P. Dollàr, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 2117–2125.

[48]

X. Bi, Y. Shang, B. Liu, B. Xiao, W. Li, and X. Gao, “A versatile detection method for various contrast enhancement manipulations,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 2, pp. 491–504, Feb. 2023.

[49]

Z. He, L. Zhang, Y. Yang, and X. Gao, “Partial alignment for object detection in the wild,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 8, pp. 5238–5251, Aug. 2022.

Digital Library

[50]

P. Sun et al., “Sparse R-CNN: End-to-end object detection with learnable proposals,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 14454–14463.

[51]

S. Shao et al., “CrowdHuman: A benchmark for detecting human in a crowd,” 2018, arXiv:1805.00123.

[52]

W. Wang et al., “InternImage: Exploring large-scale vision foundation models with deformable convolutions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 14408–14419.

[53]

H. Law and J. Deng, “CornerNet: Detecting objects as paired keypoints,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 734–750.

[54]

C. Zhang, K.-M. Lam, and Q. Wang, “CoF-Net: A progressive coarse-to-fine framework for object detection in remote-sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, Art. no. 10.1109/TGRS.2022.3233881.

[55]

T. Gao, Q. Niu, J. Zhang, T. Chen, S. Mei, and A. Jubair, “Global to local: A scale-aware network for remote sensing object detection,” IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, Art. no.

[56]

T. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.

[57]

K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,” ISPRS J. Photogramm. Remote Sens., vol. 159, pp. 296–307, Jan. 2020.

[58]

D. Zhang et al., “Weakly supervised semantic segmentation via alternate self-dual teaching,” IEEE Trans. Image Process., early access, Dec. 20, 2023. 10.1109/TIP.2023.3343112.

Index Terms

DeIoU: Toward Distinguishable Box Prediction in Densely Packed Object Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Matching
        Object detection
        Object identification
        Object recognition
      2. Computer vision tasks
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search

Index terms have been assigned to the content through auto-classification.

Recommendations

Densely packed object detection with transformer-based head and EM-merger
Abstract
Due to the high density of objects and their varying sizes, detecting them accurately and without repetition in such scenarios is more challenging than traditional object detection methods. In this paper, we propose a YOLOv5-based object detection ...
Convex-Hull Feature Adaptation for Oriented and Densely Packed Object Detection
Detecting oriented and densely packed objects is a challenging problem considering that the receptive field intersection between objects causes spatial feature aliasing. In this paper, we propose a convex-hull feature adaptation (CFA) approach, with the ...
Boundary-aware box refinement for object proposal generation

Object proposals have been widely used in object detection to speed up object searching. However, many of existing object proposal generators have pool localization quality, which weakens the performance of object detectors. In this paper, we present an ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Circuits and Systems for Video Technology

IEEE Transactions on Circuits and Systems for Video Technology Volume 34, Issue 11_Part_1

Nov. 2024

789 pages

Issue’s Table of Contents

1051-8215 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 November 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents