An Adaptive YOLO11 Framework for the Localisation, Tracking, and Imaging of Small Aerial Targets Using a Pan–Tilt–Zoom Camera Network
<p>SAM and Stable Diffusion Augmentation Process. (<b>a</b>) Original Frame. (<b>b</b>) Object Mask. (<b>c</b>) Augmented Frame.</p> "> Figure 2
<p>LLM and Stable Diffusion augmentation processes. (<b>a</b>) Hilly terrain—raining. (<b>b</b>) Hilly terrain—evening. (<b>c</b>) Hilly terrain—winter.</p> "> Figure 3
<p>YOLO11x network architecture diagram.</p> "> Figure 4
<p>Calibration pattern visualisation. (<b>a</b>) Intrinsic parameters. (<b>b</b>) Extrinsic parameters.</p> "> Figure 5
<p>Experimental setup.</p> "> Figure 6
<p>Top view—environment visualisation.</p> "> Figure 7
<p>mAP50 vs. inference time.</p> "> Figure 8
<p>mAP50-95 vs. inference time.</p> "> Figure 9
<p>YOLO training metrics for 30 epochs. (<b>a</b>) Train DFL. (<b>b</b>) Precision. (<b>c</b>) Recall. (<b>d</b>) Val DFL. (<b>e</b>) mAP50. (<b>f</b>) mAP50-95.</p> "> Figure 10
<p>YOLO Confusion matrices on test datasets. (<b>a</b>) YOLOv8n. (<b>b</b>) YOLO11x.</p> "> Figure 11
<p>Experiment 3—path visualisation.</p> "> Figure 12
<p>Experiment 3—predicted vs. ground truth coordinates.</p> "> Figure 13
<p>Experiment 6—Path Visualisation.</p> "> Figure 14
<p>Experiment 6—predicted vs. ground truth coordinates.</p> "> Figure 15
<p>Error distribution across axes for all experiments. Each box plot displays the distribution of each quartile alongside the measured outliers, which are shown as red crosses towards the right.</p> "> Figure 16
<p>Zoom vs. focal length (<b>left</b>). Zoom vs. principal point (<b>right</b>).</p> "> Figure 17
<p>Pan, tilt, and zoom feed (UF = 0.2).</p> "> Figure 18
<p>Zoom-only feed (UF = 0.4).</p> "> Figure A1
<p>YOLO F1-confidence curves. (<b>a</b>) YOLO11x. (<b>b</b>) YOLOv8n.</p> "> Figure A2
<p>Mean reprojection error per image pair. (<b>a</b>) Intrinsic parameters. (<b>b</b>) Extrinsic parameters.</p> "> Figure A3
<p>Experiment 1—fixed flight. (<b>a</b>) Target path visualisation. (<b>b</b>) Predicted vs. ground truth coordinates.</p> "> Figure A4
<p>Experiment 2—fixed flight. (<b>a</b>) Target path visualisation. (<b>b</b>) Predicted vs. ground truth coordinates.</p> "> Figure A5
<p>Experiment 4—random flight. (<b>a</b>) Target path visualisation. (<b>b</b>) Predicted vs. ground truth coordinates.</p> "> Figure A6
<p>Experiment 5—Random Flight. (<b>a</b>) Target Path Visualisation. (<b>b</b>) Predicted vs. Ground Truth Coordinates.</p> ">
Abstract
:1. Introduction
2. Related Work
2.1. Data Augmentation
2.2. Object Detection
2.3. Multi-Object Tracking
2.4. PTZ Tracking Applications
2.5. Camera Modelling and Calibration
2.6. Triangulation
3. Methodology
3.1. Dataset Collection and Annotation
3.2. Dataset Augmentation
3.3. Object Detection
3.4. Object Tracking
- Number of tracks (NOT): Since it is known in advance that only one target appears in the video from start to finish, the optimal result is that all detected targets are assigned a single track ID. Therefore, a larger number of track IDs corresponds to lesser model stability. It is noted that track IDs are not sequential in ByteTrack. When the target temporarily mismatches, the algorithm assumes a new target has appeared and assigns a new track ID. Once these targets are successfully re-matched with the previous target, the previous track ID is reused, and the newly assigned ID is removed. Therefore, track IDs are not equivalent to the number of tracks.
- Tracking length (TL): The number of consecutive frames in which the algorithm can continuously track the target in the longest identified trajectory in the video.
- Average tracking length (ATL): The mean tracking length of all trajectories. A higher value indicates a more robust algorithm.
- Matching rate (MR): The percentage of frames where targets are assigned IDs to all frames where targets are detected. A higher value indicates a more robust algorithm.
- Long-term matching rate (LTMR): The percentage of the total number of frames from all trajectories with tracking lengths exceeding a set threshold to all frames where targets are detected. A higher value indicates a more robust algorithm.
3.5. Camera Calibration
3.6. Stereo Triangulation
3.7. Coordinate Transformation
3.8. Model Deployment and Experimental Setup
- •
- E2E time: Median end-to-end processing time from the webcam frame capture to PTZ command.
- •
- Success rate: Percentage of frames where the target is completely in the PTZ frame.
4. Results and Discussion
4.1. Detection Performance Metrics
4.2. Stereo Validation
4.3. PTZ Tracking Accuracy
E2E Time (ms) | Success Rate (%) |
---|---|
64.23 | 92.58 |
5. Conclusions
5.1. Contributions
- •
- Collection and annotation of a large dataset of videos for object detection in flight under various backgrounds, seasons, and weather conditions.
- •
- A data augmentation pipeline that utilises knowledge distillation based on the collaboration of open-source pre-trained large models for different tasks.
- •
- Comparative experiments to analyse how detection metrics vary by combining representative model structures (head, neck, and backbone) and backbone network sizes. In particular, this study incorporates the latest object detection model at the time of publication: YOLO11.
- •
- Metrics to aid the evaluation of single object tracking performance and optimise hyperparameters in the absence of video annotations.
- •
- A solution towards the absence of depth perception for PTZ-based imaging systems in the literature, improving the means of characterising aerial targets through localisation.
5.2. Limitations and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
ATL | average tracking length |
DFL | distribution focal loss |
DLT | direct linear transformation |
E2E | end-to-end |
FLOPS | floating point operations per second |
FPN | feature pyramid network |
FPS | frames per second |
GAN | generative adversarial network |
GPU | graphics processing unit |
IP | Internet Protocol |
LOS | line of sight |
LTMR | long-term matching rate |
mAP | mean average precision |
MMLab | Multimedia Laboratory |
MOT | multi-object tracking |
MR | matching rate |
NOT | number of tracks |
PTZ | pan–tilt–zoom |
R-CNN | region-based convolutional neural network |
RT-DETR | Real-Time Detection Transformer |
SAM | Segment Anything Model |
SOTA | state-of-the-art |
SORT | simple, online, and real-time |
TL | tracking length |
UAV | unmanned aerial vehicle |
UF | undershoot factor |
YOLO | You Only Look Once |
Appendix A
Model | Scale | mAP50 | mAP50:100 | Parameters | Flops (G) | Inference (ms) |
---|---|---|---|---|---|---|
YOLOv5 | n | 0.783 | 0.486 | 2,509,634 | 7.2 | 0.6 |
s | 0.789 | 0.508 | 9,124,514 | 24.1 | 1.2 | |
m | 0.816 | 0.535 | 25,068,610 | 64.4 | 2.5 | |
l | 0.827 | 0.557 | 53,167,970 | 135.3 | 4.3 | |
x | 0.828 | 0.551 | 97,205,186 | 246.9 | 7.4 | |
YOLOv6 | n | 0.765 | 0.486 | 4,238,738 | 11.9 | 0.6 |
s | 0.800 | 0.521 | 16,307,010 | 44.2 | 1.3 | |
m | 0.795 | 0.519 | 51,998,962 | 161.6 | 3.8 | |
l | 0.774 | 0.510 | 110,897,826 | 391.9 | 7.4 | |
x | 0.789 | 0.515 | 173,025,874 | 611.2 | 8.3 | |
YOLOv7 | tiny | 0.721 | 0.404 | 8,116,226 | 21.3 | 0.7 |
vanilla | 0.836 | 0.549 | 44,224,385 | 132.2 | 4.7 | |
x | 0.837 | 0.553 | 44,224,386 | 132.2 | 5.3 | |
w6 | 0.826 | 0.552 | 102,496,192 | – | 5.7 | |
e6 | 0.827 | 0.556 | 141,203,328 | – | 9.3 | |
e6e | 0.836 | 0.562 | 195,713,904 | – | 13 | |
d6 | 0.835 | 0.562 | 197,285,568 | – | 10 | |
YOLOv8 | n | 0.822 | 0.538 | 3,012,018 | 8.2 | 0.6 |
s | 0.845 | 0.570 | 11,137,922 | 28.7 | 1.3 | |
m | 0.856 | 0.585 | 25,859,794 | 79.1 | 2.7 | |
l | 0.860 | 0.596 | 43,634,466 | 165.4 | 4.7 | |
x | 0.866 | 0.607 | 68,158,386 | 258.1 | 7.6 | |
YOLOv9 | t | 0.808 | 0.534 | 2,006,578 | 7.9 | 0.7 |
s | 0.831 | 0.563 | 7,289,730 | 27.4 | 1.6 | |
m | 0.851 | 0.590 | 20,162,658 | 77.6 | 3.3 | |
c | 0.852 | 0.588 | 25,533,858 | 103.7 | 4.5 | |
e | 0.854 | 0.587 | 58,149,538 | 192.7 | 9.8 | |
YOLOv10 | n | 0.804 | 0.525 | 2,709,380 | 8.4 | 0.8 |
s | 0.840 | 0.570 | 8,070,980 | 24.8 | 1.6 | |
m | 0.846 | 0.575 | 16,491,076 | 64.0 | 2.9 | |
b | 0.848 | 0.585 | 20,460,276 | 98.7 | 4.0 | |
l | 0.851 | 0.592 | 25,774,580 | 127.2 | 4.8 | |
x | 0.855 | 0.594 | 31,666,420 | 171.1 | 7.7 | |
YOLO11 | n | 0.819 | 0.535 | 2,591,010 | 6.4 | 0.9 |
s | 0.85 | 0.581 | 9,430,098 | 21.6 | 1.8 | |
m | 0.862 | 0.595 | 20,057,618 | 68.2 | 4 | |
l | 0.862 | 0.599 | 25,315,090 | 87.3 | 5.3 | |
x | 0.867 | 0.609 | 56,880,690 | 195.5 | 9.3 | |
Cascade R-CNN | ResNet | 0.504 | 0.316 | 69,167,000 | 166 | 21.4 |
Swin Transformer | 0.618 | 0.417 | 93,883,000 | 229 | 36.1 | |
ConvNeXt | 0.704 | 0.522 | 94,501,000 | 224 | 29.9 | |
DyHead | ResNet | 0.739 | 0.506 | 38,901,000 | 70.5 | 59.5 |
Swin Transformer | 0.785 | 0.545 | 210,000,000 | 569 | 78.7 | |
ConvNeXt | 0.75 | 0.522 | 64,276,000 | 130 | 61.4 | |
Deformable DETR | ResNet | 0.573 | 0.245 | 40,100,000 | 127 | 28.5 |
Swin Transformer | 0.111 | 0.04 | 61,908,000 | 191 | 42.6 | |
ConvNeXt | 0.626 | 0.31 | 62,525,000 | 184 | 36.7 |
References
- O’Malley, J. The no drone zone. Eng. Technol. 2019, 14, 34–38. [Google Scholar] [CrossRef]
- Metz, I.C.; Ellerbroek, J.; Mühlhausen, T.; Kügler, D.; Hoekstra, J.M. Analysis of risk-based operational bird strike prevention. Aerospace 2021, 8, 32. [Google Scholar] [CrossRef]
- Zhang, W.; Cong, M.; Wang, L. Algorithms for optical weak small targets detection and tracking. In Proceedings of the International Conference on Neural Networks and Signal Processing, 2003, Proceedings of the 2003, Nanjing, China, 14–17 December 2003; IEEE: New York, NY, USA, 2003; Volume 1, pp. 643–647. [Google Scholar]
- Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2021, arXiv:2112.10752. [Google Scholar]
- Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 6023–6032. [Google Scholar]
- Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. arXiv 2018, arXiv:1710.09412. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://www.scirp.org/reference/referencespapers?referenceid=3532980 (accessed on 15 October 2024).
- Hui, Y.; Wang, J.; Li, B. STF-YOLO: A small target detection algorithm for UAV remote sensing images based on improved SwinTransformer and class weighted classification decoupling head. Measurement 2024, 224, 113936. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Alexey, D. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Brooks, T.; Holynski, A.; Efros, A.A. InstructPix2Pix: Learning to Follow Image Editing Instructions. arXiv 2022, arXiv:2211.09800. [Google Scholar]
- Wang, S.; Xia, C.; Lv, F.; Shi, Y. RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision. arXiv 2024, arXiv:2409.08475. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I. Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In IEEE Transactions on Pattern Analysis and Machine Intelligence; IEEE: New York, NY, USA, 2017. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
- Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
- Tan, M. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: New York, NY, USA, 2016; Volume 9. [Google Scholar] [CrossRef]
- Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. arXiv 2017, arXiv:1703.07402. [Google Scholar]
- Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
- Aharon, N.; Orfaig, R.; Bobrovsky, B.Z. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
- Kang, S.; Paik, J.K.; Koschan, A.; Abidi, B.R.; Abidi, M.A. Real-time video tracking using PTZ cameras. In Proceedings of the Sixth International Conference on Quality Control by Artificial Vision; SPIE: Bellingham, WA, USA, 2003; Volume 5132, pp. 103–111. [Google Scholar]
- Di Caterina, G.; Hunter, I.; Soraghan, J.J. An embedded smart surveillance system for target tracking using a PTZ camera. In Proceedings of the 4th European Education and Research Conference (EDERC 2010), Nice, France, 1–2 December 2010; pp. 165–169. [Google Scholar]
- Unlu, H.U.; Niehaus, P.S.; Chirita, D.; Evangeliou, N.; Tzes, A. Deep learning-based visual tracking of UAVs using a PTZ camera system. In Proceedings of the IECON 2019–2045th Annual Conference of the IEEE Industrial Electronics Society, Lisbon, Portugal, 14–17 October 2019; IEEE: New York, NY, USA, 2019; Volume 1, pp. 638–644. [Google Scholar]
- Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
- Sinha, S.N.; Pollefeys, M. Towards calibrating a pan-tilt-zoom camera network. In Proceedings of the 5th Workshop Omnidirectional Vision, Camera Networks and Non-Classical Cameras; Citeseer: University Park, PA, USA, 2004; pp. 42–54. [Google Scholar]
- Wu, Z.; Radke, R.J. Keeping a pan-tilt-zoom camera calibrated. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 1994–2007. [Google Scholar] [CrossRef] [PubMed]
- Nasiri, S.M.; Hosseini, R.; Moradi, H. The optimal triangulation method is not really optimal. IET Image Process. 2023, 17, 2855–2865. [Google Scholar] [CrossRef]
- Lee, S.H.; Civera, J. Triangulation: Why optimize? arXiv 2019, arXiv:1907.11917. [Google Scholar]
- Sturm, P. Richard I. Hartley and Peter Sturm. Computer Vision and Image Understanding. Available online: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=970be04e6e469d841cb8a214f3ad95e5c659cc2f (accessed on 15 October 2024).
- Zhu, J.Y.; Zhang, R.; Pathak, D.; Darrell, T.; Efros, A.A.; Wang, O.; Shechtman, E. Toward multimodal image-to-image translation. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Schmalz, C.; Forster, F.; Angelopoulou, E. Camera calibration: Active versus passive targets. Opt. Eng. 2011, 50, 113601. [Google Scholar]
Component | Manufacturer | Details | Manufacturer Country |
---|---|---|---|
GPU | NVIDIA | Geforce RTX3050 | Santa Clara, CA, USA |
CPU | AMD | Ryzen 9 5900HX | Santa Clara, CA, USA |
Webcam | Logitech | C922 | Lausanne, Switzerland |
PTZ Camera | FLIR | M300C | Washington, DC, USA |
Embedded System Module | NVIDIA | Jetson TX2 | Santa Clara, CA, USA |
Tracker | NOT | TL | ATL | MR (%) | LTMR (%) |
---|---|---|---|---|---|
BoT-SORT | 45 | 262 | 79.91 | 87.79 | 67.92 |
ByteTrack | 51 | 276 | 72.37 | 96.64 | 68.48 |
Number of Image Pairs | Overall Mean Error (px) | |
---|---|---|
Intrinsic | 33 | 0.40 |
Extrinsic | 35 | 0.10 |
Experiment | Mean Target Velocity (mm/s) | Median X Error (mm) | Median Y Error (mm) | Median Z Error (mm) | Recall (%) |
---|---|---|---|---|---|
Fixed: 1 | 345 | 10 | 36 | 33 | 74.96 |
Fixed: 2 | 357 | 12 | 40 | 44 | 66.78 |
Fixed: 3 | 363 | 30 | 29 | 28 | 72.78 |
Random: 4 | 449 | 16 | 38 | 39 | 68.30 |
Random: 5 | 684 | 16 | 42 | 39 | 77.68 |
Random: 6 | 1026 | 14 | 72 | 34 | 85.71 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lui, M.H.; Liu, H.; Tang, Z.; Yuan, H.; Williams, D.; Lee, D.; Wong, K.C.; Wang, Z. An Adaptive YOLO11 Framework for the Localisation, Tracking, and Imaging of Small Aerial Targets Using a Pan–Tilt–Zoom Camera Network. Eng 2024, 5, 3488-3516. https://doi.org/10.3390/eng5040182
Lui MH, Liu H, Tang Z, Yuan H, Williams D, Lee D, Wong KC, Wang Z. An Adaptive YOLO11 Framework for the Localisation, Tracking, and Imaging of Small Aerial Targets Using a Pan–Tilt–Zoom Camera Network. Eng. 2024; 5(4):3488-3516. https://doi.org/10.3390/eng5040182
Chicago/Turabian StyleLui, Ming Him, Haixu Liu, Zhuochen Tang, Hang Yuan, David Williams, Dongjin Lee, K. C. Wong, and Zihao Wang. 2024. "An Adaptive YOLO11 Framework for the Localisation, Tracking, and Imaging of Small Aerial Targets Using a Pan–Tilt–Zoom Camera Network" Eng 5, no. 4: 3488-3516. https://doi.org/10.3390/eng5040182
APA StyleLui, M. H., Liu, H., Tang, Z., Yuan, H., Williams, D., Lee, D., Wong, K. C., & Wang, Z. (2024). An Adaptive YOLO11 Framework for the Localisation, Tracking, and Imaging of Small Aerial Targets Using a Pan–Tilt–Zoom Camera Network. Eng, 5(4), 3488-3516. https://doi.org/10.3390/eng5040182