Open AccessArticle

An Adaptive YOLO11 Framework for the Localisation, Tracking, and Imaging of Small Aerial Targets Using a Pan–Tilt–Zoom Camera Network

Ming Him Lui

¹,

Haixu Liu

¹,

Zhuochen Tang

Hang Yuan

²,

David Williams

³,

Dongjin Lee

⁴,

K. C. Wong

^1,*

and

Zihao Wang

School of Aerospace, Mechanical and Mechatronic Engineering, The University of Sydney, Sydney, NSW 2006, Australia

School of Engineering, Australian National University, Canberra, ACT 2601, Australia

SiNAB Pty Ltd., Sydney, NSW 2229, Australia

⁴

Department of Unmanned Aircraft Systems, Hanseo University, Seosan 31963, Republic of Korea

Author to whom correspondence should be addressed.

Eng 2024, 5(4), 3488-3516; https://doi.org/10.3390/eng5040182

Submission received: 16 October 2024 / Revised: 9 December 2024 / Accepted: 10 December 2024 / Published: 20 December 2024

(This article belongs to the Special Issue Feature Papers in Eng 2024)

Download

Browse Figures

Figure 1
SAM and Stable Diffusion Augmentation Process. (a) Original Frame. (b) Object Mask. (c) Augmented Frame. "> Figure 2
LLM and Stable Diffusion augmentation processes. (a) Hilly terrain—raining. (b) Hilly terrain—evening. (c) Hilly terrain—winter. "> Figure 3
YOLO11x network architecture diagram. "> Figure 4
Calibration pattern visualisation. (a) Intrinsic parameters. (b) Extrinsic parameters. "> Figure 5
Experimental setup. "> Figure 6
Top view—environment visualisation. "> Figure 7
mAP50 vs. inference time. "> Figure 8
mAP50-95 vs. inference time. "> Figure 9
YOLO training metrics for 30 epochs. (a) Train DFL. (b) Precision. (c) Recall. (d) Val DFL. (e) mAP50. (f) mAP50-95. "> Figure 10
YOLO Confusion matrices on test datasets. (a) YOLOv8n. (b) YOLO11x. "> Figure 11
Experiment 3—path visualisation. "> Figure 12
Experiment 3—predicted vs. ground truth coordinates. "> Figure 13
Experiment 6—Path Visualisation. "> Figure 14
Experiment 6—predicted vs. ground truth coordinates. "> Figure 15
Error distribution across axes for all experiments. Each box plot displays the distribution of each quartile alongside the measured outliers, which are shown as red crosses towards the right. "> Figure 16
Zoom vs. focal length (left). Zoom vs. principal point (right). "> Figure 17
Pan, tilt, and zoom feed (UF = 0.2). "> Figure 18
Zoom-only feed (UF = 0.4). "> Figure A1
YOLO F1-confidence curves. (a) YOLO11x. (b) YOLOv8n. "> Figure A2
Mean reprojection error per image pair. (a) Intrinsic parameters. (b) Extrinsic parameters. "> Figure A3
Experiment 1—fixed flight. (a) Target path visualisation. (b) Predicted vs. ground truth coordinates. "> Figure A4
Experiment 2—fixed flight. (a) Target path visualisation. (b) Predicted vs. ground truth coordinates. "> Figure A5
Experiment 4—random flight. (a) Target path visualisation. (b) Predicted vs. ground truth coordinates. "> Figure A6
Experiment 5—Random Flight. (a) Target Path Visualisation. (b) Predicted vs. Ground Truth Coordinates. ">

Versions Notes

Abstract

This article presents a cost-effective camera network system that employs neural network-based object detection and stereo vision to assist a pan–tilt–zoom camera in imaging fast, erratically moving small aerial targets. Compared to traditional radar systems, this approach offers advantages in supporting real-time target differentiation and ease of deployment. Based on the principle of knowledge distillation, a novel data augmentation method is proposed to coordinate the latest open-source pre-trained large models in semantic segmentation, text generation, and image generation tasks to train a BicycleGAN for image enhancement. The resulting dataset is tested on various model structures and backbone sizes of two mainstream object detection frameworks, Ultralytics’ YOLO and MMDetection. Additionally, the algorithm implements and compares two popular object trackers, Bot-SORT and ByteTrack. The experimental proof-of-concept deploys the YOLOv8n model, which achieves an average precision of 82.2% and an inference time of 0.6 ms. Alternatively, the YOLO11x model maximises average precision at 86.7% while maintaining an inference time of 9.3 ms without bottlenecking subsequent processes. Stereo vision achieves accuracy within a median error of 90 mm following a drone flying over 1 m/s in an 8 m

\times

4 m area of interest. Stable single-object tracking with the PTZ camera is successful at 15 fps with an accuracy of 92.58%.

Keywords:

object detection; object tracking; data augmentation; Stable Diffusion; pan–tilt–zoom; camera calibration; YOLO11x; YOLOv8-Nano; Segment Anything Model; BicycleGAN

1. Introduction

The widespread use of small drones has created opportunities for improving efficiency and technology advancements in many applications, such as precision agriculture, wildlife monitoring, and urban logistics. However, the booming drone usage also increases drone incidents and threats in both the commercial and defence sectors. Unauthorised drone flights pose a significant safety threat to the public as they may potentially collide with airliners and buildings. One of the most notorious recent examples is the 2018 incident at London’s Gatwick Airport, where multiple unregistered drones resulted in the cancellation of approximately 800 flights [1]. In addition, large outdoor gatherings such as open-air concerts and carnivals attract unauthorised drone flights, which pose a serious threat to public safety. Yet, a suitable system that can detect and track these illegal drones is almost non-existent. This paper presents the development and experimental validation of a portable and field-deployable PTZ camera system that can detect, track and classify small aerial targets, with the object detection computer vision algorithm being a key component of the study. To demonstrate the versatility of the method, the cameras used in this study are unmodified commercial off-the-shelf products controlled by the original firmware.

Detecting and tracking aerial targets has always been highly challenging in object detection. The cost and accessibility appeal of unmanned aerial vehicles (UAVs), more commonly known as drones, will only raise the possibility of airspace obstacles and covert attacks. Wildlife also falls under this category of aerial targets, with frequent bird strikes below altitudes of 1000 m threatening aircraft take-off and approach [2]. In these cases, vision-based autonomous airspace monitoring holds advantages over existing radar and surveillance systems, which struggle to detect small targets.

Small targets may be classified as taking up less than 0.1% of the frame, adapted from the definition from the International Society for Optics and Photonics [3]. Fewer pixels—and, thus, a narrower feature space—create challenges when identifying small objects in varying environmental conditions. Preliminary studies have shown that targets such as birds were only a few pixels in size, far below the standard definition of small objects in typical detection tasks. The size, movement speed, and trajectory of aerial targets add complexity to environmental clutter and occlusions. Fast-moving objects become blurred or distorted in camera frames, making them difficult to identify compared to when stationary. Furthermore, an overlooked concern arises when annotating many small targets. While a significant burden, segments of objects are more prone to being excluded from their bounding boxes as they more easily blend into their environments. Existing datasets that are not sufficiently expansive may also encounter targets that cannot be identified. Finally, balancing model performance given fixed computational resources tends to be neglected in existing literature. Current object detection frameworks such as Ultralytics and MMDetection [4] support hundreds of detection algorithms, though they cannot assess algorithm suitability for system-specific deployment.

The proposed detection and tracking framework first investigates the effectiveness of two mainstream object detector frameworks: You Only Look Once (YOLO) and MMDetection. The former is known for its accurate real-time capabilities, while the latter provides modularity depending on the required task. Both are actively developed in the computer vision community. To construct a dataset, semi-supervised techniques assist in labelling small targets. Two dataset augmentation methods are proposed: one is based on the Segment Anything Model (SAM) [5] and Stable Diffusion [6] to replace image backgrounds; the other uses Llama 3 [7] and Stable Diffusion to generate image pairs for training generative adversarial networks (GANs) [8] to simulate environmental variations. Target ID tracking between frames is tested and implemented between two accessible and state-of-the-art object trackers: BoT-SORT and ByteTrack. While object detection effectiveness is determined on the mean average precision (mAP) and inference time, object tracking is independently measured based on a set of proposed metrics.

The complete framework is deployed into a pan–tilt–zoom (PTZ) camera network for detailed imaging of small aerial objects. The network comprises two low-cost cameras for target triangulation and a PTZ camera for high-definition visualisation and tracking. Such a system offers advantages over standalone PTZ trackers by enabling target localisation and depth perception over a significantly larger monitoring area. To ensure real-time capabilities, the YOLO model is deployed through TensorRT to dramatically decrease inference time. Following camera calibration, triangulation accuracy is validated using a motion capture system. The resulting PTZ frame balances between keeping the target within view and having it fill a notable proportion of the frame.

The primary contribution of this study is the development and experimental validation of a networked camera system for ground-to-air surveillance. The computer vision techniques used are specifically optimised for the detection and classification of small flying targets. Furthermore, the proposed PTZ camera network implementation offers the advantage of depth perception and, thus, target localisation over single PTZ camera tracking.

2. Related Work

2.1. Data Augmentation

Data augmentation is a common training strategy used to enhance the performance and robustness of object detection models. It is commonly integrated into existing image detection frameworks. Traditional image augmentation typically includes synchronously rotating and flipping images and bounding boxes, altering brightness and contrast, and randomly adding motion blur and noise.

Modern augmentation methods have appeared through CutMix [9] and Mixup [10], proposed in ICLR 2018 and ICCV 2019, respectively. These methods improve a model’s generalisation ability by combining two images and their labels. YOLO adopts the mosaic augmentation method, which is an extension of CutMix. This process randomly crops multiple images to combine into a single image. From YOLOv4 [11], each version has improved mosaic augmentation up to YOLOv8 [12], which introduces negative samples of background frames into mosaic augmentation. This enhancement method significantly improves training efficiency and effectiveness.

Yanming et al. [13] proposed a particular data augmentation method against small targets. It divides the image into nine equal parts and randomly changes the scale of the targets in each part, replicating them within that section. The advantage of this approach is that it ensures the semantics of the augmented targets in the overall image remain unchanged. Thus, the augmented results become more suitable for backbone networks like Swin Transformer [14] and ViT [15] that tend to extract global features. However, these methods do not resolve interference from background pixels of irrelevant objects within the bounding boxes. This leads to poor model generalisation and an over-sensitivity to backgrounds during training.

With the introduction of GANs, pixel-level augmentation schemes for datasets have been proposed where generative tasks assist object detection tasks. However, training GANs often requires paired datasets or two sets of datasets with sufficiently different styles, which are almost nonexistent in aviation.

The advent of the Stable Diffusion model [6] allows for generating high-quality images based on prompts and original images. However, there are two issues with directly using Stable Diffusion in practice. First, the inference time is excessively long, leading to low augmentation efficiency. Second, some small targets may not be preserved during generation, or details may be severely distorted. Therefore, adapted from the methods in InstructPix2Pix [16], this article proposes a new augmentation scheme.

2.2. Object Detection

The fundamental goal of object detection tasks is to classify relevant targets in an image and to locate the dimensions of bounding boxes that most tightly enclose these targets. In practice, when the quality of dataset annotations meets the requirements of semantic segmentation tasks, object detection models can be extended to instance segmentation tasks. This is attributed to the principle of instance segmentation, which identifies the region where the instance is located and classifies the pixels within that region based on whether they belong to such an instance. Since it can more precisely distinguish between the target and the background, it is more accurate than general object detection models. However, inference speed is increased as this process requires pixel-level processing.

This article collectively defines object detection and instance segmentation as generalised object detection tasks, typically implemented using three independent sequential network structures: backbone, neck, and head. These are used, respectively, for feature extraction, feature fusion, and outputting results.

Currently, the mainstream object detection frameworks are Ultralytics and MMDetection. Ultralytics primarily supports algorithms based on one-stage object detection (without region proposal networks), covering the YOLO series and Real-Time Detection Transformer (RT-DETR) [17] algorithms. It allows developers to freely integrate their modules into YAML files to improve existing YOLO models. Due to its flexibility, ease of deployment, and rapid inference, its primary users are from industry.

MMDetection integrates state-of-the-art (SOTA) object detection algorithms proposed at top computer vision conferences up to 2022. It supports users in integrating their developed backbone, neck, and head structures and combining them with existing structures. As it is highly suitable for comparative experiments with existing SOTA algorithms and routinely maintained by the Multimedia Laboratory (MMLab), its primary users are from academia.

Since its inception, the YOLO model has evolved to its 11th generation. The original YOLO simplified object detection into an end-to-end regression problem, utilising a single neural network to simultaneously predict bounding boxes and categories. Starting from the second generation, YOLO drew inspiration from another well-known single-stage object detection algorithm of the same era, SSD [18], incorporating prior anchor boxes and detection based on multi-scale feature maps. From the third generation onward, YOLO introduced the iconic CSPDarknet53 backbone network. Since then, each generation of YOLO has made improvements to its signature one-stage detection head and backbone network, reducing the number of parameters and inference latency.

The region-based convolutional neural network (R-CNN) series is the most iconic two-stage object detection framework. The original R-CNN [19] was proposed very early and used CNNs for feature extraction, SVMs for object classification, and linear regression for bounding box adjustment - typical model combinations for SOTA algorithms of that era. Subsequently, Fast R-CNN [20] and Faster R-CNN [21] gradually replaced all non-neural network components with neural networks that can be graphics processing unit (GPU)-accelerated. Mask R-CNN, proposed by Kaiming [22], extended the R-CNN series to instance segmentation tasks. Currently, the highest-scoring R-CNN algorithm in the MMDetection leaderboard is Cascade Mask R-CNN [23], an improved method proposed by Zhaowei and Nuno. It enhances the model’s performance in detecting difficult (low Intersection over Union) targets by adding a cascade matching mechanism to the detection head.

In recent years, Transformer-based object detection algorithms in the DETR [24] series have been proposed. These solutions can better handle occlusion and densely packed objects in complex scenes through global features, which aligns well with scenarios such as detecting flocks of birds. However, DETR faces challenges such as difficulty in convergence, slow inference, and poor performance in small object detection. To address these issues, DINO [25], RT-DETR [17], and Deformable DETR [26] have been proposed. The improvements in the aforementioned algorithms focus on the Head component. Standard optimisations are usually based on the classic feature pyramid network (FPN), such as PAFPN [27] and BiFPN [28]. Additionally, the SOTA solution on MMDetection’s leaderboard, DyHead [29], is an improvement based on the neck. It adaptively extracts features through dynamic convolution and enhances feature representation and fusion by introducing spatial, attention, and task attention. The process balances classification and regression tasks during training, thus achieving significant performance improvements.

The backbone network is the core of object detection tasks and is responsible for extracting features from images. It largely determines the algorithm’s inference time and accuracy. Common backbone network baselines include Darknet from the YOLO series and ResNet proposed by Kaiming. With the advent of neural architecture search (NAS) technology, EfficientNet [30] has become the SOTA solution for lightweight convolutional backbone networks. With the application of Transformers in visual tasks, using Transformer-based SOTA feature extractors as backbone networks, such as Swin Transformer, have become common due to their ability to capture global features. Recent SOTA backbone networks, such as ConvNeXt [31], have combined the advantages of both CNNs and Transformers. It has been implemented on Cascade Mask R-CNN to achieve results surpassing ResNet [32], but lacks comparative experiments. This study aims to fill this gap in the literature.

2.3. Multi-Object Tracking

Both Ultralytics and MMDetection support the integration of object trackers into existing object detection models. In recent years, mainstream multi-object tracking (MOT) tasks mostly use tracking-by-detection methods based on Kalman filter motion state estimation. End-to-end methods have also been explored to integrate detection and tracking into a single network.

The most classic MOT algorithm is the simple, online, and real-time (SORT) [33] algorithm based on Kalman filters and the Hungarian matching algorithm. The DeepSORT [34] algorithm builds upon this by adding a Re-ID component based on CNNs to extract target features and calculate similarity, assisting in matching trajectories with low reliability. The ByteTrack [35] algorithm proposed in 2022, as an improvement over SORT, does not use Re-ID to enhance computational efficiency. Instead, it re-matches low-confidence detections with the trajectories of high-confidence targets that were mismatched in previous frames, thereby reducing the mismatch rate of small and occluded targets and improving trajectory continuity. In the same year, the BoT-SORT [36] algorithm was proposed to improve the ByteTrack model. It uses an enhanced Kalman filter and camera motion compensation while reintegrating Re-ID to assist in trajectory matching.

2.4. PTZ Tracking Applications

Most literature on object tracking using a PTZ camera integrates object detection within itself rather than supplementary cameras. This method relies on the target remaining visible over consecutive frames to apply the relevant PTZ commands. This imposes a limitation on smaller targets that move quickly. Tracking aerial targets such as first-person view drones thus becomes difficult, balancing between observing the target with sufficient zoom at the risk of losing sight.

PTZ tracking is more popular in fixed surveillance systems, simplifying object detection processes due to greater background consistency. Kang et al. [37] leveraged this observation using the concept of background subtraction. While the PTZ camera remains at a single view, a corresponding background frame is generated on observations over 20–40 s. Their results were some of the earliest examples showing the viability of real-time PTZ tracking. Caterina et al. [38] provided a benchmark of comparison for real-time capability by achieving an execution time of under 38ms with an internet protocol (IP) PTZ camera.

Unlu et al. [39] presented the closest literature to this study, exploring a UAV-based PTZ tracking system based on deep learning techniques. Their work involved K-nearest neighbours’ background subtraction to segment moving objects and validate the presence of a UAV using a modified ResNet identifier. This object detector was trained on a 55:45 ratio using a mix of open-source datasets and in-house images versus negatives. The UAV target was successfully tracked for 71.2% and 60.8% of frames over indoor and outdoor locations separately, estimating the 3D position to be within 0.67 m of the ground truth within a 2

\times

\times

2 m space.

2.5. Camera Modelling and Calibration

Conventional optical cameras are modelled through the pinhole projection model, which relates 3D world coordinates to its projected point on a camera frame. This model requires knowledge of characteristics including its intrinsic parameters, extrinsic parameters and distortion coefficients. Such values are estimated through camera calibration. This process has become more accessible following the input of calibration patterns of varying orientations, introduced by Zhang [40]. Camera calibration is currently freely available via MATLAB’s Camera Calibration Toolbox or OpenCV’s calibrateCamera.

With PTZ cameras, the calibration approach remains largely similar. Varying zoom changes lens properties and, thus, the camera’s intrinsic parameters. Hence, the most conventional method for PTZ camera calibration is calibrating at discrete zoom values. Calibration results from Sinha and Pollefeys [41] have shown multiple possible trends between PTZ camera models. Focal length is expected to increase with a linear or parabolic relationship to zoom level, while the principle point deviates more severely from the frame centre with increasing zoom. Distortion effects are less critical for this study as the PTZ camera will only be used for visual identification. However, it is noted that radial distortion effects are less prominent with increasing zoom.

Additional errors in a PTZ camera may arise from mechanical variation depending on manufacturing quality. Wu and Radke [42] proposed a method to determine such errors by capturing a calibration pattern before and after a pan–tilt command. A homography matrix may be calculated to reflect the true amount of rotation. The results showed that angular error increased linearly as the camera rotated further from its directional origin, as expected of the stepper motor used in the construction. Other errors that arose included random, accumulated, and power-cycling errors. The same azimuth and elevation command was observed to have varied by 5–7 pixels after 30 min of movement, increasing to 38 pixels after 200 h.

2.6. Triangulation

Triangulation describes extracting 3D information based on 2D information from two or more scene images. Apart from stereo vision, it is also commonly used for tasks such as structure from motion and simultaneous localisation and mapping. After detecting a target’s bounding box, a line of sight (LOS) may be determined from the camera’s optical centre through the bounding box centroid. These LOS rays ideally intersect a single point representing the object’s 3D world coordinates. In reality, these lines are always skew due to measurement noise, uncertainties of intrinsic or extrinsic parameters, and subpixel 2D inaccuracies [43].

How optimal a triangulation algorithm is deemed is most widely determined by how well it minimises reprojection error [44], known as 2D error. This describes how well the calculated 3D point projects back onto the observed 2D points of each frame. A 3D error is also a valuable metric, taken from a comparison to the true 3D position in world space. Additionally, the robustness of an algorithm is determined by its invariance to affine transformations (scaling, rotation, etc.), projective transformations (changes in perspective), and Euclidean (position and orientation) reconstruction.

An elementary form of triangulation is the midpoint method, which takes the midpoint where two rays approach each other the closest. Hartley and Sturm [45] criticise this method for lacking affine and projective invariance despite the ease of computation. Lee and Civera [44] proposed a generalised weighted midpoint method that extracts the depths from each frame to a single 3D point, locates a point on each ray of the corresponding depth, and computes a weighted average of such points. More recently, Nasiri et al. [43] claimed that the midpoint method was advantageous over alternative methods due to having less sensitivity to error when there is uncertainty in extrinsic camera parameters.

Another widely used form of triangulation is Direct Linear Transformation (DLT). With a known camera projection matrix

P

and set of homogenous frame coordinates

\vec{x}

, the corresponding homogenous 3D coordinates

\vec{X}

may be estimated with

(x_{W}, y_{W}, z_{W})

representing the world frame. This system may be solved through a least-squares solution using pseudo-inverses, singular value decomposition, or the solution corresponding to the smallest eigenvector of the matrix

A^{T} A

[45]. Triangulation via DLT is used by OpenCV’s triangulatePoints.

Additionally, Hartley and Sturm [45] proposed a notable alternative to the midpoint and DLT methods by solving a sixth-degree polynomial to minimise a chosen cost function. While computational time is of a higher relative order to alternatives, it was shown to have affine and projective invariance.

3. Methodology

3.1. Dataset Collection and Annotation

This study obtained 49 videos containing birds, drones, aircraft, ships, and unidentified flying objects through on-site filming. Five of these videos were reserved for a test set, with the remaining videos used as the training set. From the training videos, one frame out of every five frames was extracted for annotation.

Subsequently, manual annotations on 18,205 images, including images without targets, were performed using LabelImg in the VOC format. Some birds occupied too few pixels in the images. However, birds usually appear in groups. In such cases, movement patterns of dense bird groups were observed in adjacent frames and annotated as several bird flocks.

Finally, a semi-supervised annotation tool was trained using YOLOv7x to label the remaining images, with manual checks and adjustments conducted on these annotations. The manual and semi-supervised annotations amounted to 199,141 labels. Images without targets were deleted, resulting in a total of 21,217 images. The cleaned dataset was divided into an 80:20 split for training and validation, then annotations were converted into COCO and YOLO formats using X-Anylabeling.

3.2. Dataset Augmentation

This article proposes a novel data augmentation method to overcome the over-sensitivity of object detection models to background pixels within bounding boxes. It relies on the coordinated work among the latest open-source pre-trained large models in segmentation, text generation, and image generation tasks.

First, there must be considerations towards segmenting the target from the original image background rather than cropping the entire bounding box. This article leverages SAM to accomplish this task. Compared to traditional segmentation models, SAM has the advantage of accepting bounding boxes and coordinate points as input prompts, accurately extracting the masks of targets within the bounding boxes or associated with the coordinate points. Based on this feature, an algorithm filters bounding boxes of birds, aircraft and drones occupying more than 30 pixels. These bounding boxes are passed to SAM to complete instance segmentation and obtain object masks.

Next, a method is required to generate backgrounds appropriate to deployment environments. Limitations arise from directly using existing background image libraries or generating images via text prompts using Stable Diffusion, as the scene composition is not guaranteed to remain consistent between the original and augmented background. For aerial object detection, unrealistic scenes can be generated, such as an aircraft in front of a forest background. These combinations stray the model away from real-world conditions while increasing the training burden, thus cannot effectively improve the model’s generalisation ability.

This study’s dataset includes environments of farmland, forests, oceans, courtyards, and indoor laboratories. Based on detailed text prompts and ControlNet, which can use the original image as a control condition, this article refines and guides the generation process of Stable Diffusion. This process generates frames with new scenes, such as parks, roads, cities, hills, lakes, warehouses, ports, and beaches. The targets segmented by SAM are randomly copied within preset pixel coordinate ranges to obtain new images with enhanced backgrounds, demonstrated in Figure 1. Simultaneously, this process records the coordinates of the new target bounding boxes to generate new labels.

A procedure was adopted to ensure bounding box locations were consistent when converted into augmented images. First, the segmentation mask of the object was extracted, where a value of 1 represents the segmented pixels of the target and a value of 0 represents the background. Element-wise multiplication of the mask with the original image was performed to isolate the segmented target, resulting in an image where all pixels outside the target are set to 0. Subsequently, the new background, resized to match the dimensions of the original image, was multiplied element-wise with the binary inverse of the segmentation mask, effectively removing the regions corresponding to the embedded object. Finally, the isolated target and the background image with the target region removed were combined through element-wise addition to generate the final augmented image.

Apart from background sensitivity, aerial object detectors are also highly sensitivity to lighting conditions affected by the time of day, weather, and seasons. Hence, the following process will also enhance images of the same background under these conditions while preserving existing details. Therefore, the computational cost and volatility of Stable Diffusion are too high. GANs can overcome these two issues. This step is motivated by the CVPR highlight article “InstructPix2Pix” to obtain a high-quality training set for the GAN. Suitable images from the LAION-Aesthetics dataset are filtered based on captions containing words representing seasons, weather, day and night, and outdoor scenes.

Next, using Llama 3 and prompt templates, the captions are batch-modified into Stable Diffusion prompts. For example, if the original image’s caption is “a snowy lake in winter”, the generated prompt is “Please strictly base the generated image on the scene provided in my image and generate a lake in sunny winter”. The generated prompts and filtered original images are fed into a Stable Diffusion model guided by ControlNet to obtain a large-scale, high-quality set of paired images. Large-scale generation is conducted on 16 NVIDIA cloud GPUs operating for over 168 h. The paired image dataset trains BicycleGAN [46], which augments these images while retaining their core features. If particular image pairs have notably different details, CycleGAN [47] is trained instead. The GAN weights are then used to obtain augmented images based on the required changes in time of day, weather and season, such as those in Figure 2.

This approach consumes a large amount of time in the data preparation stage for training the GANs. However, it achieves a once-and-for-all solution. Once the GANs are trained, images can be rendered quickly without repeatedly recalling the Stable Diffusion model. The efficiency of these large models is improved through inference acceleration methods provided by Hugging Face. The lightweight version of SAM, MobileSAM, was evaluated to be less precise in segmentation performance at the benefit of efficiency. For Stable Diffusion, optimised samplers were utilised to reduce the number of sampling steps. Model deployment techniques were also used to accelerate the inference time of Stable Diffusion and Llama 3, including mixed precision (FP16) to load pre-trained models and multi-GPU pipeline parallelism.

3.3. Object Detection

The presented framework supports both the Ultralytics and MMDetection libraries. The Ultralytics library was further developed to integrate enhancement modules open-sourced by YOLO-related developer communities and ensure compatibility with YOLOv7, the only version from 5 to 11 omitted from the native library. Most importantly, this modification allows the framework to fetch pre-trained models from the Hugging Face timm library as backbone networks.

This study selects mAP50 and mAP50-95 as the primary evaluation metrics while also considering floating point operations per second (FLOPS), parameter count, and inference time to ensure that the model size and computational requirements remain within reasonable limits for real-time inference. Confusion matrices are generated to determine class-specific detection capabilities.

Hyperparameter selection is a further consideration. Based on the champion of the 2023 ICCV UAV Detection Challenge, increasing the input size significantly enhances the performance of detection models. Therefore, the input size was set to 1080p for all models. Additionally, the number of epochs is set to 30 as most models converge at this value during preliminary experiments. Other parameters, such as training strategies and augmentation methods, use the models’ default settings.

Initially, experiments are conducted on 38 models from the Ultralytics library, encompassing various network architectures from the YOLOv5 to YOLO11 generations. Subsequently, the following algorithms are chosen from the MMDetection framework: the SOTA DyHead algorithm, the best-performing Cascade Mask R-CNN from the R-CNN series, and the highest-scoring Transformer-based representative algorithm Deformable DETR. These algorithms are paired with backbone networks, including ResNet, Swin Transformer, ConvNeXt, and EfficientNet, resulting in 12 comparative experiments. An example of a standard network architecture layout is visualised in Figure 3, showing that of YOLO11x.

3.4. Object Tracking

The chosen object tracking methods employed the representative tracking-by-detection algorithms ByteTrack and BoT-SORT. They determine whether the targets identified in consecutive frames belong to the same object and assign IDs accordingly. BoT-SORT is derived from ByteTrack, though the former possesses more complex hyperparameters and distance calculation logic than the latter. The robustness of these algorithms is evaluated by running them on test videos containing only a single drone. The drone performs steady flight at constant speeds, then flight with complex trajectories at varying speeds.

Ultralytics does not provide evaluation methods to determine MOT metrics. Therefore, assuming that only one target in the video exists, an appropriate set of metrics is proposed, as follows:

Number of tracks (NOT): Since it is known in advance that only one target appears in the video from start to finish, the optimal result is that all detected targets are assigned a single track ID. Therefore, a larger number of track IDs corresponds to lesser model stability. It is noted that track IDs are not sequential in ByteTrack. When the target temporarily mismatches, the algorithm assumes a new target has appeared and assigns a new track ID. Once these targets are successfully re-matched with the previous target, the previous track ID is reused, and the newly assigned ID is removed. Therefore, track IDs are not equivalent to the number of tracks.
Tracking length (TL): The number of consecutive frames in which the algorithm can continuously track the target in the longest identified trajectory in the video.
Average tracking length (ATL): The mean tracking length of all trajectories. A higher value indicates a more robust algorithm.
Matching rate (MR): The percentage of frames where targets are assigned IDs to all frames where targets are detected. A higher value indicates a more robust algorithm.
Long-term matching rate (LTMR): The percentage of the total number of frames from all trajectories with tracking lengths exceeding a set threshold to all frames where targets are detected. A higher value indicates a more robust algorithm.

3.5. Camera Calibration

Camera calibration will be conducted using MATLAB 2024a’s stereo camera calibrator. Intrinsic parameters are calibrated by placing two cameras sufficiently close so that a checkerboard pattern fills up most of both frames. Outlier frame pairs are removed until the reprojection error is sufficiently low. Schmalz et al. suggested aiming for a reprojection error of less than 0.3 for “day-to-day calibrations” [48]. The calculated intrinsic parameters are then fixed when calculating extrinsic parameters, as it becomes much more difficult to determine distortion effects, with the calibration pattern taking up much less of the screen. A comparison of pattern distances to cameras is visualised in Figure 4.

MATLAB’s single-camera stereo calibrator is then used to determine the intrinsic parameters of the PTZ camera. Frames are taken at appropriate increments of zoom values, with smaller increments taken closer to 100% zoom due to the dramatic increase in focal length from initial observations. The scale factor (S) of a frame will then vary with respect to the corresponding focal length (f) and the focal length at 0% zoom (

f_{0}

\begin{matrix} S = \frac{f}{f_{0}} \end{matrix}

(1)

A relationship between the zoom (Z) and focal length will be curve-fitted depending on the appropriate shape as a function

f = g (Z)

3.6. Stereo Triangulation

Object detector models are deployed on two optical cameras facing an area of interest. As variations in the target pose are unknown, the bounding box centroid is extracted from each frame for triangulation. Following results from Nasiri et al. on the midpoint method’s advantages under uncertainty of intrinsic parameters, this method was chosen to conduct indoor testing and validation [43]. The deployment pipeline is largely conducted through Python 3.11. Firstly, lens distortion is corrected through OpenCV’s undistortPointswhich iterates through normalised distorted pixels to reach the corresponding undistorted pixels.

When introducing equations for the midpoint method, the notation is equivalent to that used in Section 2.6. First, the unit vector of each line of sight from frame coordinates

(u_{j}, v_{j})

to world coordinates

v_{j}

are determined, derived from the pinhole projection model on the concept of similar triangles:

\begin{matrix} v_{j} = R^{T} {[\frac{u_{j} - u_{0}}{f_{x}}, \frac{v_{j} - v_{0}}{f_{y}}, 1]}^{T} for j = 1, 2 \end{matrix}

(2)

The shortest distance between these skew rays is determined, corresponding to the distance between two points on each ray

p_{1}, p_{2}

. The distances from the optical centre to each of these points

d_{1}, d_{2}

are such, with

R_{j}, t_{j}

representing the corresponding rotational and translational matrices of Camera j:

\begin{matrix} d_{1} & = \frac{((R_{1}^{T} t_{1} - R_{2}^{T} t_{2}) \times v_{2}) \cdot (v_{1} \times v_{2})}{{|v_{1} \times v_{2}|}^{2}} \end{matrix}

(3)

\begin{matrix} d_{2} & = \frac{((R_{2}^{T} t_{2} - R_{1}^{T} t_{1}) \times v_{1}) \cdot (v_{2} \times v_{1})}{{|v_{2} \times v_{1}|}^{2}} \end{matrix}

(4)

The coordinates of the calculated point on each ray are such:

\begin{matrix} p_{j} = - R_{j}^{T} t_{j} + d_{j} \cdot v_{j} for j = 1, 2 \end{matrix}

(5)

From this step, the midpoint method takes the average of both points. When multiple objects are detected in the frame, a distance threshold between points is imposed. The calculated midpoint for a certain frame is only passed if the two predicted points are under such a threshold. These equations are extendable to take the point averages across n cameras.

3.7. Coordinate Transformation

The direction vector from the PTZ camera position to the target may be converted to azimuth

θ

, elevation

ϕ

, and distance

| \vec{v} |

\begin{matrix} θ & = atan 2 (y, x) \end{matrix}

(6)

\begin{matrix} ϕ & = {tan}^{- 1} (\frac{z}{\sqrt{x^{2} + y^{2}}}) \end{matrix}

(7)

\begin{matrix} | \vec{v} | & = \sqrt{x^{2} + y^{2} + z^{2}} \end{matrix}

(8)

The pan and tilt commands correspond to azimuth and elevation, respectively. To determine an appropriate zoom command, an object’s size is assumed to be inversely proportional to its distance from the camera. This consideration is used to determine the required scale factor S for the frame. For Camera j, we have the following:

\begin{matrix} S_{j} = \frac{| \vec{v} |}{Cam to Point Dist .} \times min (\frac{Frame Width}{BBox Width}, \frac{Frame Height}{BBox Height}) \end{matrix}

(9)

The minimum value between

S_{1}

and

S_{2}

is taken after calculating for instances of each camera to ensure the whole target is within the frame. A zoom undershoot factor

0 < U F < 1

is also applied to account for PTZ camera errors addressed by Wu and Radke [42] and display a complete view of the surrounding environment:

\begin{matrix} S = U F \times min (S_{1}, S_{2}) \end{matrix}

(10)

3.8. Model Deployment and Experimental Setup

The chosen YOLO models are trained in Python to output a weighted format. This format generally has a longer inference time and becomes less advantageous for real-time deployment. TensorRT is widely accepted to most effectively minimise inference speed while sacrificing negligible mAP results, while having been tested with YOLOv5 and announced on the Ultralytics forum.

Stereo vision is conducted over an area of interest measuring 8 m × 4 m, visualised in Figure 5 and Figure 6. The experimental hardware is listed in Table 1. Before conducting extrinsic calibration, the webcams are positioned at appropriately high parallax to minimise triangulation error. Notably, tracking performance is independent of any obstructions to the PTZ camera, such as the glass wall. Such a configuration is only reliant on the visibility of the fixed cameras.

Inferences are conducted on a Jetson TX2 for portability, which returns results to a Ryzen 9 5900HX laptop for localisation and PTZ control. If real-time capabilities cannot be achieved, inference shall be conducted solely on the laptop, which supports TensorRT through a mobile RTX3050 GPU. The motion capture system is positioned overhead. The asynchronous framework in Python will allow the task of dual object detection and localisation to occur simultaneously with the task of sending PTZ inputs through cURL commands.

A motion capture system is used to validate the accuracy of triangulation and determine the 3D error. The effectiveness of the PTZ camera in tracking a moving aerial target is determined by its computational efficiency and the frequency with which it maintains the target within the frame. Hence, PTZ tracking performance is evaluated as follows:

•: E2E time: Median end-to-end processing time from the webcam frame capture to PTZ command.
•: Success rate: Percentage of frames where the target is completely in the PTZ frame.

4. Results and Discussion

4.1. Detection Performance Metrics

Appendix A Table A1 provides model comparisons over the YOLO, cascade R-CNN, DyHead, and deformable DETR algorithms. The accuracy rankings of the models implemented using MMDetection align with their corresponding official accuracy rankings. However, the best model configuration, DyHead combined with Swin Transformer, only achieved performance comparable to the early lightweight YOLO models. Notably, MMDetection models typically converge faster than the YOLO series, which may be due to differences in training strategies. A significant difference in inference time is noted between the YOLO and MMDetection series. YOLO indeed has an advantage in inference time, though this is likely related to the implementation of the frameworks. The Ultralytics framework, geared towards industrial applications, likely incorporates more inference optimisations.

Most of the MMDetection models are from work published two years ago, so it is reasonable that they have been surpassed in performance by the more recent YOLO versions. Furthermore, most models exhibited abnormal convergence when using EfficientNet as the backbone, so it was not included in the comparison.

Due to its notable performance advantages, focus shall be placed on comparing the YOLO version models. A performance comparison between YOLO model formats from v5 to v11 is shown in Figure 7 and Figure 8, with extensive metrics in Appendix A Table A1. The best-performing model based on test set precision is YOLO11x at a mAP50 of 86.7%, with a mAP50-95 of 60.9%. However, the YOLOv8n model, among the lightweight variants, holds the highest mAP50 at 82.2% while maintaining the advantage of the lowest inference time at 0.6 ms. Older versions of v5 and v7 show abnormal results that do not follow the trend where the mAP and inference time increase with larger models. YOLOv7 likely lacks streamlined integration as Ultralytics do not natively support it. The larger YOLOv6 models likely overfitted during training to cause lower mAP even with more parameters. At IoU thresholds of 50–95%, overfitting is also observed with YOLOv6, though YOLOv7 holds results that are more consistent with expected trends. The mAP degrades from v8 to v10 with smaller models, suggesting unsuitability in model architecture or hyperparameters to the existing dataset.

Figure 9 presents performance variation of key metrics during the training process to demonstrate stability. Notably, there was a significant spike in training loss around the 20th epoch due to the training process disabling YOLO’s built-in mosaic augmentation. While the training distribution focal loss (DFL) increased, the validation metrics did not deteriorate, validating the rationale behind this proposed training strategy. Precision plateaus at 30 epochs while recall, and thus mAP, retains a positive gradient. There is potential for improvement in results by increasing the number of epochs, though it comes at the risk of overfitting to the test set.

The confusion matrices for the highlighted YOLOv8n and YOLO11x models are shown in Figure 10. Both models achieved high detection accuracy for the critical targets of drones, aircraft, and flocks of birds. YOLO11x showed notable improvements in detecting larger objects of aircraft and ships, as more parameters were available for feature extraction. Both models struggled with the classes of birds and bird flocks, commonly declaring false positives to background objects. The contrary of this observation was also true, with birds often failing to be detected. Annotations on the bird class likely lacked distinctive features, such as being predominantly composed of one colour for cases of crows and ravens at a distance. The proposed data augmentation strategy also could not be applied to objects under 30 pixels due to the limitations of SAM. Class-based F1 score metrics are provided in Appendix A Figure A1, to visualise the balance of precision and recall with varying confidence.

Table 2 compares BoT-SORT and ByteTrack through the proposed object tracking metrics. ByteTrack holds higher matching and long-term matching rates, showing higher confidence in linking an existing track to the corresponding ID. However, the higher NOT suggests that ByteTrack is more susceptible to hastily assigning new IDs when a track disappears and is thus less stable. A higher average tracking length also validates that BoT-SORT holds higher stability than ByteTrack and will thus be chosen for subsequent experiments. However, both algorithms are seamlessly interchangeable within Ultralytics, and ByteTrack can be substituted if experiments miss too many existing IDs.

Deployment with the TensorRT model is tested on a Jetson TX2 due to its small form factor, achieving an inference time of 105.08 ms at 9.38 frames per second (FPS). Applied to two cameras, the FPS will halve. While there will be difficulty in implementing real-time localisation and tracking on this device, the feasibility of this pipeline is proven. Improvements may be made through network pruning, further inference optimisation, and switching to higher-performing devices. Hence, deployment is conducted on a laptop (the specifications are proposed in Section 3.8).

4.2. Stereo Validation

Outlier image pairs of high mean error were filtered until the overall mean error reached around 1 pixel or less. Image pairs and overall mean reprojection error are shown in Table 3 with individual errors shown in Appendix A Figure A2. Image pairs with higher mean errors, such as pairs 3 and 14, had their calibration planes at a greater angle to the camera field of view. Further uncertainties arise from calibrating extrinsic parameters using an A1-sized checkerboard pattern, as the grid corners are no longer in focus at greater distances. Without fixing the intrinsic parameters before extrinsic calibration, earlier calibration experiments have also shown overfitting on a local cluster of patterns. In these cases, the mean reprojection error becomes misleadingly low. Overfitting can be noticed by observing discrepancies between the estimated camera extrinsic characteristics and pattern locations.

Intrinsic parameters were generated with 33 image pairs to a mean reprojection error of 0.40 pixels. These parameters were set as defaults when calculating extrinsic parameters. A mean reprojection error of 0.10 pixels is achieved, validating against overfitting.

Triangulation 3D error was verified through controlled drone flying experiments in fixed and randomised paths. Three flight tests were conducted for each scenario to be compared with the motion capture system. Figure 11, Figure 12, Figure 13 and Figure 14 provide a visualisation of the travelled paths and 1D results between the ground truth and predicted coordinates. Pre-processing was conducted to correct any constant offsets in either axes. The results for the highest drone target velocity and, thus, the highest median Cartesian error are shown. Additional results are shown in Appendix A Figure A3, Figure A4, Figure A5 and Figure A6.

Predictions do not return a positive result at around 10 s, consistent across all three experiments. At this position, the drone is furthest from both cameras and appears much smaller in frame. Hence, the detection framework may struggle more to return positive detections. The prediction error is most significant when approaching closest to Camera 1, around 25 to 30 s. Camera calibration accuracy may become less reliable as the target strays from the cameras’ working distances. In this case, the object is so close to Camera 1 that it almost leaves the field of view of Camera 2. Both observations expose the shortcomings of the asymmetrical camera positioning, which are limited by device connection distances and laboratory space.

Results from the random flight test expose more significant errors in the y and z axes. The geometry of the camera setup is separated by a greater x-distance, resulting in a more considerable disparity along the corresponding x-axis. Otherwise, the magnitude of error remains proportional to the target velocity. Less significant predictive discrepancies also remain in the experimental setup. Algorithm errors may arise from skewed camera lines of sight or mean projection errors during calibration, especially as only one pair of sight lines is used for localisation. Up to two tracking markers may also disappear due to the inherent motion capture configuration and propeller inference. In these cases, one marker is used to estimate ground truth, which will have a discrepancy from the true centroid. Outside this article’s scope of investigation, temperature variation and mechanical changes over time will also influence camera accuracy. Table 4 displays the 1D axes and Cartesian errors of all experiments.

Table 4 displays the 1D axes and Cartesian errors compared to the drone velocity, reflecting reliable accuracy in estimating the 3D position of the target. The median errors for the x- and y-axes fall within half the dimensions of the DJI Spark, which are 289

\times

245

\times

56 mm. However, the z-axis error may localise the drone’s position away from the ground truth within an order of

10^{- 2}

mm. It should be noted that these errors may compound as the viewable area of interest increases. The recall metric is determined by a true positive from both cameras. Greater variation was present in random experiments, with a higher value corresponding to when the drone flew more often towards Camera 2. The drone class within the dataset likely holds fewer annotations of such scale viewed by Camera 2.

Figure 15 visualises the 3D error distributions combined over the six experiments over 5123 data points, taken as the norm of all 1D errors. Similar trends are observable, with the x-direction having a smaller error than the other directions at a median of less than 20 mm. Considering outliers, the upper bound of errors approaches equal magnitudes of up to 180 mm. The results remain positively skewed, supporting prediction reliability.

Single camera calibration at zoom segments verified that zoom and focal length did not follow a linear relationship. Thus, calibration points were curve-fit using an exponential relationship. Figure 16 displays results for the x-axis focal length, as the difference in x- and y-directions was negligible.

A relationship between the zoom and principle point was also determined to ensure that corresponding PTZ commands would keep the target close to centred. The linear fit on u mostly aligns with the ideal optical centre of

u_{0} = 540

for a 1080p display, while variation in v holds uncertainty due to deviations at 100% zoom. A horizontal correction shall be included during deployment. Calibration at each zoom was repeated for over 50 image pairs at the same requirement of under a 0.50 mean reprojection error. Thus, errors may have arisen from limited pattern angles or motion blur as frame pairs were adjusted.

4.3. PTZ Tracking Accuracy

Results are extracted using the fixed path from earlier experiments over 40 s. For this case, an undershoot factor of 0.2 is applied to show the surrounding environment adequately, as seen in Figure 17. Table 5 quantifies the final performance of the end-to-end system holding real-time performance at 15 FPS to a high tracking efficacy of 92.58%.

Table 5. PTZ tracking performance metrics.

E2E Time (ms)	Success Rate (%)
64.23	92.58

Since asynchronous processes are used to send a PTZ command and calculate the subsequent 3D coordinates, the end-to-end processing time entirely depends on the PTZ IP delay. The final deployment runs at an equivalent of 15 FPS. The results have a notably higher success rate than existing literature, such as that of Unlu et al. [39] at 71.2% while displaying the target at higher resolution. Figure 17 and Figure 18 display frame segments where the PTZ camera undergoes a pan–tilt–zoom combination.

Figure 17. Pan, tilt, and zoom feed (UF = 0.2).

The drone veering from the frame centre reflects the input lag contributed by the inherent delay of IP cameras, distinguishable in Figure 17. This deviation increases as the drone’s velocity increases, likely due to negative detections from motion blur. When stationary, the drone exhibits the same central offset that varies depending on its position within the area of interest. A coordinate offset is likely present between the world frame and the PTZ coordinate frame. Figure 18 shows that zoom estimation is capable within this small-scale environment. An undershoot factor of 0.4 corresponds to the drone width, taking up approximately 40% of the PTZ camera frame. Frames that lose sight of the drone are almost always a result of the detection framework failing to detect it.

5. Conclusions

The proposed camera network successfully detected, localised, and tracked a single aerial target in real-time. The pipeline was operational at 15 FPS, with the target visible in 92.58% of the PTZ camera frames. The object detection framework successfully augmented a dataset of aerial targets to output YOLO models, achieving an accuracy of at least 82.2% even in its most lightweight version through YOLOv8n. Maximising accuracy through YOLO11 at 86.7% retained real-time capabilities.

5.1. Contributions

This article provides the following contributions:

•: Collection and annotation of a large dataset of videos for object detection in flight under various backgrounds, seasons, and weather conditions.
•: A data augmentation pipeline that utilises knowledge distillation based on the collaboration of open-source pre-trained large models for different tasks.
•: Comparative experiments to analyse how detection metrics vary by combining representative model structures (head, neck, and backbone) and backbone network sizes. In particular, this study incorporates the latest object detection model at the time of publication: YOLO11.
•: Metrics to aid the evaluation of single object tracking performance and optimise hyperparameters in the absence of video annotations.
•: A solution towards the absence of depth perception for PTZ-based imaging systems in the literature, improving the means of characterising aerial targets through localisation.

5.2. Limitations and Future Work

In real-world scenarios, aerial objects may remain unrecognised or sparsely annotated in existing datasets. Techniques such as out-of-distribution detection or zero-shot learning can be introduced to improve the detector’s performance in handling such cases. The linear Kalman filter may perform poorly in motion estimation for targets with complex trajectories. Hence, tracking can extend to non-linear motion estimation algorithms or adopt end-to-end trained trackers to achieve more accurate motion predictions. This also resolves the cases where the PTZ camera loses track of the target by filling in between available predictions.

Providing rectangular boxes to SAM can generate masks for the instances within the boxes, yielding segmentation labels. This approach allows for extending object detection datasets into instance segmentation datasets, enabling the training of more confident detectors. Furthermore, many studies have proposed neural network modules specifically designed to enhance small object detection and moving object detection, such as SPD convolution and DyHead. These modules can be integrated into YOLO to improve inference accuracy.

Furthermore, the object detection dataset augmentation method proposed in this study holds significant potential for further research. In our future work, we will explore how to better optimise the coordination between these large models, which involves engineering optimisations, such as employing inference acceleration techniques. It may also involve structural and learning strategy optimisations, such as improving the network architecture at the task interface of the collaborative large models and leveraging fine-tuning techniques to make the proposed workflow as end-to-end as possible. Alternatively, we may explore knowledge distillation to learn more lightweight networks as substitutes for large models.

Deployment shows functionality within a small-scale environment of an 8 m

\times

4 m area of interest. However, accuracy will deteriorate with increasing surveillance area, given that object scale and camera hardware remain identical. Deployment should be considered for a large-scale environment where cameras can be connected wirelessly, such as through 900 MHz LoRa radio modules. In the case of monitoring very large airspace, the system is envisioned to be a modular setup where each computational node only communicates with a few cameras that cover a smaller part of the airspace of interest. The computer vision is done locally at each node. The global position of the tracked target is then computed by a central computer once the data from each individual node is collected. In this case, the accuracy and coverage may improve by including more modules without compromising the stability of the whole system.

An extrinsic calibration process suggested in research by Wu and Radke [42] may be explored to improve accuracy in keeping the drone precisely in the PTZ frame centre. Alternative PTZ camera models should also be tested to minimise end-to-end processing and reduce latency, thus improving the operational FPS.

Author Contributions

Conceptualisation, Z.W., D.W. and K.C.W.; methodology, M.H.L., H.L., Z.T., H.Y., Z.W. and D.W.; software, M.H.L., H.L., Z.T., H.Y., Z.W. and D.W.; validation, M.H.L., H.L. and Z.T.; formal analysis: M.H.L. and H.L.; investigation: M.H.L., H.L., Z.T., H.Y., Z.W. and D.L.; resources: Z.W., D.W. and K.C.W.; data curation: M.H.L., H.L., Z.T. and H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This project is supported by the Vacation Research Scholarship from the Faculty of Engineering at the University of Sydney. The PTZ camera is provided by SiNAB Pty Ltd.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The dataset was provided by Zihao Wang, and annotated by Haixu Liu and Hang Yuan. The use of this dataset requires citation of this article.

Conflicts of Interest

Author David Williams was employed by the company SiNAB Pty Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ATL	average tracking length
DFL	distribution focal loss
DLT	direct linear transformation
E2E	end-to-end
FLOPS	floating point operations per second
FPN	feature pyramid network
FPS	frames per second
GAN	generative adversarial network
GPU	graphics processing unit
IP	Internet Protocol
LOS	line of sight
LTMR	long-term matching rate
mAP	mean average precision
MMLab	Multimedia Laboratory
MOT	multi-object tracking
MR	matching rate
NOT	number of tracks
PTZ	pan–tilt–zoom
R-CNN	region-based convolutional neural network
RT-DETR	Real-Time Detection Transformer
SAM	Segment Anything Model
SOTA	state-of-the-art
SORT	simple, online, and real-time
TL	tracking length
UAV	unmanned aerial vehicle
UF	undershoot factor
YOLO	You Only Look Once

Appendix A

Table A1. Performance metrics of tested models. Bolded data relate to chosen models of the main body.

Model	Scale	mAP50	mAP50:100	Parameters	Flops (G)	Inference (ms)
YOLOv5	n	0.783	0.486	2,509,634	7.2	0.6
	s	0.789	0.508	9,124,514	24.1	1.2
	m	0.816	0.535	25,068,610	64.4	2.5
	l	0.827	0.557	53,167,970	135.3	4.3
	x	0.828	0.551	97,205,186	246.9	7.4
YOLOv6	n	0.765	0.486	4,238,738	11.9	0.6
	s	0.800	0.521	16,307,010	44.2	1.3
	m	0.795	0.519	51,998,962	161.6	3.8
	l	0.774	0.510	110,897,826	391.9	7.4
	x	0.789	0.515	173,025,874	611.2	8.3
YOLOv7	tiny	0.721	0.404	8,116,226	21.3	0.7
	vanilla	0.836	0.549	44,224,385	132.2	4.7
	x	0.837	0.553	44,224,386	132.2	5.3
	w6	0.826	0.552	102,496,192	–	5.7
	e6	0.827	0.556	141,203,328	–	9.3
	e6e	0.836	0.562	195,713,904	–	13
	d6	0.835	0.562	197,285,568	–	10
YOLOv8	n	0.822	0.538	3,012,018	8.2	0.6
	s	0.845	0.570	11,137,922	28.7	1.3
	m	0.856	0.585	25,859,794	79.1	2.7
	l	0.860	0.596	43,634,466	165.4	4.7
	x	0.866	0.607	68,158,386	258.1	7.6
YOLOv9	t	0.808	0.534	2,006,578	7.9	0.7
	s	0.831	0.563	7,289,730	27.4	1.6
	m	0.851	0.590	20,162,658	77.6	3.3
	c	0.852	0.588	25,533,858	103.7	4.5
	e	0.854	0.587	58,149,538	192.7	9.8
YOLOv10	n	0.804	0.525	2,709,380	8.4	0.8
	s	0.840	0.570	8,070,980	24.8	1.6
	m	0.846	0.575	16,491,076	64.0	2.9
	b	0.848	0.585	20,460,276	98.7	4.0
	l	0.851	0.592	25,774,580	127.2	4.8
	x	0.855	0.594	31,666,420	171.1	7.7
YOLO11	n	0.819	0.535	2,591,010	6.4	0.9
	s	0.85	0.581	9,430,098	21.6	1.8
	m	0.862	0.595	20,057,618	68.2	4
	l	0.862	0.599	25,315,090	87.3	5.3
	x	0.867	0.609	56,880,690	195.5	9.3
Cascade R-CNN	ResNet	0.504	0.316	69,167,000	166	21.4
	Swin Transformer	0.618	0.417	93,883,000	229	36.1
	ConvNeXt	0.704	0.522	94,501,000	224	29.9
DyHead	ResNet	0.739	0.506	38,901,000	70.5	59.5
	Swin Transformer	0.785	0.545	210,000,000	569	78.7
	ConvNeXt	0.75	0.522	64,276,000	130	61.4
Deformable DETR	ResNet	0.573	0.245	40,100,000	127	28.5
	Swin Transformer	0.111	0.04	61,908,000	191	42.6
	ConvNeXt	0.626	0.31	62,525,000	184	36.7

Figure A1. YOLO F1-confidence curves. (a) YOLO11x. (b) YOLOv8n.

Figure A2. Mean reprojection error per image pair. (a) Intrinsic parameters. (b) Extrinsic parameters.

Figure A3. Experiment 1—fixed flight. (a) Target path visualisation. (b) Predicted vs. ground truth coordinates.

Figure A4. Experiment 2—fixed flight. (a) Target path visualisation. (b) Predicted vs. ground truth coordinates.

Figure A5. Experiment 4—random flight. (a) Target path visualisation. (b) Predicted vs. ground truth coordinates.

Figure A6. Experiment 5—Random Flight. (a) Target Path Visualisation. (b) Predicted vs. Ground Truth Coordinates.

References

O’Malley, J. The no drone zone. Eng. Technol. 2019, 14, 34–38. [Google Scholar] [CrossRef]
Metz, I.C.; Ellerbroek, J.; Mühlhausen, T.; Kügler, D.; Hoekstra, J.M. Analysis of risk-based operational bird strike prevention. Aerospace 2021, 8, 32. [Google Scholar] [CrossRef]
Zhang, W.; Cong, M.; Wang, L. Algorithms for optical weak small targets detection and tracking. In Proceedings of the International Conference on Neural Networks and Signal Processing, 2003, Proceedings of the 2003, Nanjing, China, 14–17 December 2003; IEEE: New York, NY, USA, 2003; Volume 1, pp. 643–647. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2021, arXiv:2112.10752. [Google Scholar]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 6023–6032. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. arXiv 2018, arXiv:1710.09412. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://www.scirp.org/reference/referencespapers?referenceid=3532980 (accessed on 15 October 2024).
Hui, Y.; Wang, J.; Li, B. STF-YOLO: A small target detection algorithm for UAV remote sensing images based on improved SwinTransformer and class weighted classification decoupling head. Measurement 2024, 224, 113936. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Alexey, D. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Brooks, T.; Holynski, A.; Efros, A.A. InstructPix2Pix: Learning to Follow Image Editing Instructions. arXiv 2022, arXiv:2211.09800. [Google Scholar]
Wang, S.; Xia, C.; Lv, F.; Shi, Y. RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision. arXiv 2024, arXiv:2409.08475. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I. Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In IEEE Transactions on Pattern Analysis and Machine Intelligence; IEEE: New York, NY, USA, 2017. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
Tan, M. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: New York, NY, USA, 2016; Volume 9. [Google Scholar] [CrossRef]
Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. arXiv 2017, arXiv:1703.07402. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
Aharon, N.; Orfaig, R.; Bobrovsky, B.Z. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
Kang, S.; Paik, J.K.; Koschan, A.; Abidi, B.R.; Abidi, M.A. Real-time video tracking using PTZ cameras. In Proceedings of the Sixth International Conference on Quality Control by Artificial Vision; SPIE: Bellingham, WA, USA, 2003; Volume 5132, pp. 103–111. [Google Scholar]
Di Caterina, G.; Hunter, I.; Soraghan, J.J. An embedded smart surveillance system for target tracking using a PTZ camera. In Proceedings of the 4th European Education and Research Conference (EDERC 2010), Nice, France, 1–2 December 2010; pp. 165–169. [Google Scholar]
Unlu, H.U.; Niehaus, P.S.; Chirita, D.; Evangeliou, N.; Tzes, A. Deep learning-based visual tracking of UAVs using a PTZ camera system. In Proceedings of the IECON 2019–2045th Annual Conference of the IEEE Industrial Electronics Society, Lisbon, Portugal, 14–17 October 2019; IEEE: New York, NY, USA, 2019; Volume 1, pp. 638–644. [Google Scholar]
Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
Sinha, S.N.; Pollefeys, M. Towards calibrating a pan-tilt-zoom camera network. In Proceedings of the 5th Workshop Omnidirectional Vision, Camera Networks and Non-Classical Cameras; Citeseer: University Park, PA, USA, 2004; pp. 42–54. [Google Scholar]
Wu, Z.; Radke, R.J. Keeping a pan-tilt-zoom camera calibrated. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 1994–2007. [Google Scholar] [CrossRef] [PubMed]
Nasiri, S.M.; Hosseini, R.; Moradi, H. The optimal triangulation method is not really optimal. IET Image Process. 2023, 17, 2855–2865. [Google Scholar] [CrossRef]
Lee, S.H.; Civera, J. Triangulation: Why optimize? arXiv 2019, arXiv:1907.11917. [Google Scholar]
Sturm, P. Richard I. Hartley and Peter Sturm. Computer Vision and Image Understanding. Available online: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=970be04e6e469d841cb8a214f3ad95e5c659cc2f (accessed on 15 October 2024).
Zhu, J.Y.; Zhang, R.; Pathak, D.; Darrell, T.; Efros, A.A.; Wang, O.; Shechtman, E. Toward multimodal image-to-image translation. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Schmalz, C.; Forster, F.; Angelopoulou, E. Camera calibration: Active versus passive targets. Opt. Eng. 2011, 50, 113601. [Google Scholar]

Figure 1. SAM and Stable Diffusion Augmentation Process. (a) Original Frame. (b) Object Mask. (c) Augmented Frame.

Figure 2. LLM and Stable Diffusion augmentation processes. (a) Hilly terrain—raining. (b) Hilly terrain—evening. (c) Hilly terrain—winter.

Figure 3. YOLO11x network architecture diagram.

Figure 4. Calibration pattern visualisation. (a) Intrinsic parameters. (b) Extrinsic parameters.

Figure 5. Experimental setup.

Figure 6. Top view—environment visualisation.

Figure 7. mAP50 vs. inference time.

Figure 8. mAP50-95 vs. inference time.

Figure 9. YOLO training metrics for 30 epochs. (a) Train DFL. (b) Precision. (c) Recall. (d) Val DFL. (e) mAP50. (f) mAP50-95.

Figure 10. YOLO Confusion matrices on test datasets. (a) YOLOv8n. (b) YOLO11x.

Figure 11. Experiment 3—path visualisation.

Figure 12. Experiment 3—predicted vs. ground truth coordinates.

Figure 13. Experiment 6—Path Visualisation.

Figure 14. Experiment 6—predicted vs. ground truth coordinates.

Figure 15. Error distribution across axes for all experiments. Each box plot displays the distribution of each quartile alongside the measured outliers, which are shown as red crosses towards the right.

Figure 16. Zoom vs. focal length (left). Zoom vs. principal point (right).

Figure 18. Zoom-only feed (UF = 0.4).

Table 1. Experimental Hardware.

Component	Manufacturer	Details	Manufacturer Country
GPU	NVIDIA	Geforce RTX3050	Santa Clara, CA, USA
CPU	AMD	Ryzen 9 5900HX	Santa Clara, CA, USA
Webcam	Logitech	C922	Lausanne, Switzerland
PTZ Camera	FLIR	M300C	Washington, DC, USA
Embedded System Module	NVIDIA	Jetson TX2	Santa Clara, CA, USA

Table 2. Object tracking metrics.

Tracker	NOT	TL	ATL	MR (%)	LTMR (%)
BoT-SORT	45	262	79.91	87.79	67.92
ByteTrack	51	276	72.37	96.64	68.48

Table 3. Calibration metrics.

	Number of Image Pairs	Overall Mean Error (px)
Intrinsic	33	0.40
Extrinsic	35	0.10

Table 4. Predicted vs. ground truth metrics.

Experiment	Mean Target Velocity (mm/s)	Median X Error (mm)	Median Y Error (mm)	Median Z Error (mm)	Recall (%)
Fixed: 1	345	10	36	33	74.96
Fixed: 2	357	12	40	44	66.78
Fixed: 3	363	30	29	28	72.78
Random: 4	449	16	38	39	68.30
Random: 5	684	16	42	39	77.68
Random: 6	1026	14	72	34	85.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lui, M.H.; Liu, H.; Tang, Z.; Yuan, H.; Williams, D.; Lee, D.; Wong, K.C.; Wang, Z. An Adaptive YOLO11 Framework for the Localisation, Tracking, and Imaging of Small Aerial Targets Using a Pan–Tilt–Zoom Camera Network. Eng 2024, 5, 3488-3516. https://doi.org/10.3390/eng5040182

AMA Style

Lui MH, Liu H, Tang Z, Yuan H, Williams D, Lee D, Wong KC, Wang Z. An Adaptive YOLO11 Framework for the Localisation, Tracking, and Imaging of Small Aerial Targets Using a Pan–Tilt–Zoom Camera Network. Eng. 2024; 5(4):3488-3516. https://doi.org/10.3390/eng5040182

Chicago/Turabian Style

Lui, Ming Him, Haixu Liu, Zhuochen Tang, Hang Yuan, David Williams, Dongjin Lee, K. C. Wong, and Zihao Wang. 2024. "An Adaptive YOLO11 Framework for the Localisation, Tracking, and Imaging of Small Aerial Targets Using a Pan–Tilt–Zoom Camera Network" Eng 5, no. 4: 3488-3516. https://doi.org/10.3390/eng5040182

APA Style

Lui, M. H., Liu, H., Tang, Z., Yuan, H., Williams, D., Lee, D., Wong, K. C., & Wang, Z. (2024). An Adaptive YOLO11 Framework for the Localisation, Tracking, and Imaging of Small Aerial Targets Using a Pan–Tilt–Zoom Camera Network. Eng, 5(4), 3488-3516. https://doi.org/10.3390/eng5040182

Article Menu