Open AccessTechnical Note

STCA: High-Altitude Tracking via Single-Drone Tracking and Cross-Drone Association

Yu Qiao

^1,2

Huijie Fan

^2,*

Qiang Wang

Tinghui Zhao

^2,4

and

Yandong Tang

School of Information Engineering, Shenyang University of Chemical Technology, Shenyang 110016, China

State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China

Key Laboratory of Manufacturing Industrial Integrated Automation, Shenyang University, Shenyang 110003, China

⁴

University of Chinese Academy of Sciences, Beijing 100049, China

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(20), 3861; https://doi.org/10.3390/rs16203861

Submission received: 30 September 2024 / Revised: 14 October 2024 / Accepted: 14 October 2024 / Published: 17 October 2024

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

Download

Browse Figures

Graphical abstract
"> Figure 1
In scenarios where targets appear similar within a single-drone view and exhibit deformation across multiple drones, Re-ID methods struggle to correctly assign IDs to targets. Single-Drone Tracking is used to differentiate similar targets within a single drone, while Cross-Drone Association is employed to accurately associate each target across multiple drones. "> Figure 2
Principles of LightGlue on the MDMT dataset. "> Figure 3
Proposed T-LightGlue framework. "> Figure 4
Proposed STCA framework. Single-Drone Tracking requires training, while Cross-Drone Association is used only during the inference phase. During inference, Single-Drone Tracking provides tracking information between consecutive frames. Cross-Drone Association is then used to establish target associations within the current frame, directly generating all tracking relationships in that frame. "> Figure 5
The results of the Common View Area Model are presented in the form of a visualization. We visually present the target labeling rules under three different conditions. In this scenario, the targets represented by Line 1 and Line 3 have an even number of intersection points with the common field of view and are labeled as outside the common area. The target represented by Line 2 has an odd number of intersections with the common field of view, so it is labeled as being inside the common area. The common area is outlined by a blue line. The centroids of objects within the area are highlighted in green, while those outside the area are displayed in red. "> Figure 6
Schematic comparison of the Hungarian algorithm, K-Nearest Neighbors algorithm, and Local-Matching Model. "> Figure 7
Proposed Adjacent Frame Association Model framework. "> Figure 8
Visualization of the tracking effect of STCA on the MDMT dataset. "> Figure 9
The per-frame MDA analysis of sequence 34 demonstrates that the proposed model consistently maintains an advantage. ">

Versions Notes

Abstract

In this paper, we introduce a high-altitude multi-drone multi-target (HAMDMT) tracking method called STCA, which aims to collaboratively track similar targets that are easily confused. We approach this challenge by categorizing the HAMDMT tracking into two principal tasks: Single-Drone Tracking and Cross-Drone Association. Single-Drone Tracking employs positional and appearance data vectors to overcome the challenges arising from similar target appearances within the field of view of a single drone. The Cross-Drone Association employs image-matching technology (LightGlue) to ascertain the topological relationships between images captured by disparate drones, thereby accurately determining the associations between targets across multiple drones. In Cross-Drone Association, we enhanced LightGlue into a more efficacious method, designated T-LightGlue, for cross-drone target tracking. This approach markedly accelerates the tracking process while reducing indicator dropout. To narrow down the range of targets involved in the cross-drone association, we develop a Common View Area Model based on the four vertices of the image. Considering to mitigate the occlusion encountered by high-altitude drones, we design a Local-Matching Model that assigns the same ID to the mutually nearest pair of targets from different drones after mapping the centroids of the targets across drones. The MDMT dataset is the only one captured by a high-altitude drone and contains a substantial number of similar vehicles. In the MDMT dataset, the STCA achieves the highest MOTA in Single-Drone Tracking, with the IDF1 system achieving the second-highest performance and the MDA system achieving the highest performance in Cross-Drone Association.

Keywords:

single-drone tracking; cross-drone association; common view area; local matching

Graphical Abstract

1. Introduction

Multi-camera multi-target tracking [1] presents a significant challenge in computer vision. Using multiple cameras captures more information and helps mitigate occlusion issues compared with single-camera tracking [2,3,4,5]. For example, a target might be occluded in one camera’s field of view but visible in another, leading to more accurate tracking results. The primary objective of multi-camera multi-target tracking is to ensure consistent target representation in both single-camera and cross-camera scenarios. This field has numerous potential applications, including crowd behavior analysis [6], urban surveillance [7,8], military security [9,10], and traffic scene understanding [11,12,13].

Target tracking tasks usually follow the tracking-by-detection paradigm [14,15,16,17], which means that before associating targets in each frame, target detection is first used to detect the bounding boxes of all targets in that frame for subsequent target tracking. The robust performance of target detection enables the successful identification of even the smallest targets, propelling the advancement of target-tracking technology in the domain of high-altitude drones. These drones, operating at elevated altitudes [18,19], facilitate the expansion of monitoring areas and offer a solution to the challenge of detecting and tracking targets within a confined region. Drones operating at elevated altitudes are typically distinguished by their elevated flight altitude and rapid movement, which present significant challenges to target tracking.

Most existing target association methods for multi-drone multi-target tracking are based on target appearance. However, when drones operate at high altitudes, the captured targets appear small due to the increased distance between the camera and the target. There are a large number of identical or even similar target vehicles captured by a single drone in some special scenarios, such as traffic scenarios. These small targets have similar characteristics simultaneously, and during multi-drone tracking, these similar small targets have larger deformation under different viewpoints. Therefore, the tracking method based on target appearance cannot achieve better results in the working scenario of high-altitude drones.

To solve the problem of similar small targets captured by high-altitude drones, which are difficult to accurately distinguish in Single-Drone Tracking by relying solely on their appearance, and the problem of association errors in appearance-based cross-drone tracking methods between cross-drones due to the combined effect of similar small targets and large deformation. We propose a novel tracking model, STCA. By constructing a joint feature vector that incorporates target appearance and positional data, we effectively address the issue of target association confusion caused by target similarity in Single-Drone Tracking. This vector is then integrated into the Adjacent Frame Association Model to generate the tracking matrix. Cross-Drone Association uses an improved model based on LightGlue [20] to reduce the error rate by determining accurate inter-drone associations, especially when associating similar targets within a single drone view. To demonstrate the effectiveness of the STCA method, we conduct extensive experiments on it using a more challenging dataset, MDMT. In the MDMT dataset, targets are often numerous identical or similar vehicles captured by drones at high altitudes. This results in smaller and similar-looking targets, thereby increasing the complexity of the HAMDMT tracking process. To demonstrate the improvements the STCA offers over the Re-ID approach, we extracted a region comprising 2.5% of the original image from the MDMT [21] dataset, as shown in Figure 1.

The transformation matrix of images captured by different drones requires the image-matching algorithm to be rotationally invariant. However, LightGlue, a deep learning-based algorithm, lacks this property. Consequently, we designed a new model, T-LightGlue, which is both faster and more suitable for cross-drone tracking, given its ability to generate a specified number of matches with each iteration.

In the field of multi-drone multi-object tracking, the tracked targets can be classified into two distinct categories. One category of targets is comprised of those that engage in both single-drone and multi-drone tracking. In the absence of occlusion or low confidence, the aforementioned targets should be visible in the field of view of multiple drones simultaneously. The remaining category comprises targets that are only tracked by a single drone, which encompasses all targets. In Cross-Drone Association, we designed the Common View Area Model by mapping the four vertex coordinates of images based on the concept of shared areas within the views of two drones. This model considers only the association relationships of targets within the common view area during the Cross-Drone Association process.

In addition, we propose a Local-Matching Model. Unlike the traditional Hungarian and K-Nearest Neighbors algorithms, the Local-Matching Model fully considers the situation where a target is detected in one field of view, but the corresponding target in another field of view does not participate in the subsequent matching due to occlusion, low target confidence, and other reasons. Ultimately, the Single-Drone Tracking matrix and the Cross-Drone Association matrix are integrated to generate a target association matrix, establishing tracking relationships across consecutive frames. The main contributions are summarized as follows:

We divide the task of HAMDMT tracking into two sub-tasks: Single-Drone Tracking and Cross-Drone Association, to collaboratively track easily confusable similar targets. Single-Drone Tracking constructs a joint vector from the target appearance and position, leveraging positional data to handle appearance similarities. The Cross-Drone Association utilizes the improved T-LightGlue algorithm to accurately link object relations between the two views.
We utilize the four vertices of the image to design a Common View Area Model, considering only targets within this area for Cross-Drone Association. Additionally, the Local-Matching Model ensures that the nearest center points of targets from different drones, after mapping, are identified as the same target.
Extensive experiments conducted on the only known high-altitude multi-drone dataset, MDMT, demonstrate that the STCA achieves state-of-the-art performance.

2. Related Work

In this section, we explain the relevant concepts of object detection, object tracking, and image matching. More specifically, Section 2.1 focuses on explaining the target detection algorithm. Section 2.2 presents the object-tracking task, and Section 2.3 discusses image-matching techniques.

2.1. Object Detection

Target detection for target tracking is locating and identifying a specific target within a video sequence. This information is then used to determine the necessary target position for subsequent tracking, which is the primary task of most target tracking systems. Target detection is primarily classified into two categories based on the detection process: single-stage (1-stage) detection models and two-stage (2-stage) detection models. The single-stage model performs target localization and classification directly on the input image, which means that there is only one stage in the entire detection process. This type of detector is distinguished by its rapid detection speed, rendering it well-suited for scenarios that demand expeditious detection. A convolutional neural network is employed to simultaneously generate the bounding box and category prediction of the target. The two-stage model subdivides the process of target detection into two distinct phases. Initially, a series of potential regions of interest are identified, and subsequently, further classification and bounding box regression are conducted on these regions. This particular detector is distinguished by its high degree of accuracy in detection and is particularly effective in scenarios characterized by complexity or the detection of smaller targets. The most commonly utilized one-stage target detection algorithms are as follows: Examples of such algorithms include YOLOv1 [22], YOLOv2 [23], SSD [24], and so forth. Two-stage target detection algorithms are also commonly employed, including Mask R-CNN [25] and Faster R-CNN [26], among others. Given the emphasis on detection speed in target tracking, we have opted to utilize a single-stage target detector, Centernet [27], to identify the target in each frame of the video.

2.2. Target Tracking

In recent years, there has been a considerable focus on unifying target detection and target tracking into an end-to-end model in the context of multi-drone multi-target tracking efforts. The majority of these approaches are investigated by the Re-identification (RE-ID) method. To illustrate the graph-based method, the Reconfigurable Spatiotemporal Graph Model (ReST) [28] employs the target appearance to construct nodes, thereby facilitating the completion of the tracking task. Cross-View Multi-Object Tracking (CrossMOT) [29] integrates the training of target detection heads, single-camera Re-ID, and cross-camera Re-ID heads to forge a comprehensive tracking model. This reliance on appearance information is also evident in other tracking models [30,31]. Re-ID models (including [32,33]) require more detailed appearance features of different targets, as shown in datasets such as MvMHAT [34] dataset, VisionTrack [35] dataset, and WILDTRACK [36] dataset.

This over-reliance on target appearance is less suitable for high-altitude multi-drone tracking to track similar small targets. Despite the attention paid to HAMDMT tracking, less research has been conducted on multi-drone multi-target tracking of similar targets. Three-dimensional spatial consistency and topological mapping relationships (TSC-TMR) [37] fully exploit the three-dimensional spatial consistency among targets and propose a highly consistent constraint-based association method. Multi-matching Identity Authentication network (MIA-Net) [21] employs the SIFT model (traditional image matching technology) to develop a HAMDMT tracking model, yielding markedly enhanced tracking precision compared with appearance-based methodologies. Additionally, some methods [38,39] utilize the topological structure between targets for association. While existing methods have demonstrated incremental improvements in tracking performance, inherent limitations remain. The use of three-dimensional spatial consistency can be susceptible to association errors due to inaccuracies in camera pose measurement. Tracking methods using the SIFT approach may result in the generation of erroneous correlation matrices, which is a result of the inherent constraints inherent in SIFT. This in turn may lead to degraded tracking performance. Some topology-based tracking methods have requirements on the number of targets involved.

2.3. Image Matching

Image-matching methods can be broadly classified into two main categories. The first category encompasses traditional image-matching methods, including SIFT [40] and ORB [41], which represent the more established approaches in this field. The second category comprises deep learning-based image-matching methods, including LightGlue and DeDoDe [42]. Traditional image-matching methods exhibit rotational invariance during the matching process. However, this approach presents challenges in simultaneously achieving the desired accuracy and speed for specific tasks. The advent of deep learning has led to a surge of interest in image-matching methods based on this technology. These methods offer a significant advantage over traditional approaches, as they can adaptively learn features in the data without the need for manually designing feature descriptors. Furthermore, image-matching methods based on deep learning demonstrate excellent generalization ability in situations such as occlusion, change in view angle, and change in lighting, which closely aligns with the real-world scenarios encountered by drones. Consequently, we have employed a deep learning-based image-matching methodology to implement cross-drone tracking, which has been designated LightGlue. LightGlue to determine when to cease inference operations by adapting to the complexity of each image. Additionally, it eliminates unmatched points at an early stage, thereby concentrating on the common view area. Due to the presence of a partially common view area of targets captured between different drones, LightGlue can disregard the impact of non-common view areas, thereby establishing a topological relationship that is conducive to targets within the common view area.

When employing the deep learning image-matching method LightGlue to a multi-drone multi-target tracking task, it becomes evident that there is an inherent limitation. LightGlue lacks rotational invariance, with a maximum capacity of 60 degrees of planar rotation, which is insufficient for the task of target tracking between multiple drones. In this study, we utilize the number of matchable points in two images as an initial point of reference for applying LightGlue, which lacks rotational invariance, to the domain of cross-scene multi-target tracking. The application of LightGlue to multi-drone multi-target tracking is as follows: The image captured by Drone 2 is fixed, and the image in Drone 1 is rotated by 0, 90, 180, and 270 degrees; subsequently, the image in Drone 1 is matched with the image in Drone 2 using LightGlue in turn. As a consequence of the common view area between the images returned from the two drones across the scene, the number of feature points obtained between the two images at different angles differs according to the properties of LightGlue. The angle that generates the greatest number of matching feature points is selected as the optimal rotation angle. As illustrated in Figure 2, this condition exhibits the highest number of 270-degree matching points, and 270-degree represents the optimal rotation angle.

3. Preliminaries

This section briefly outlines the preparations before tracking. In particular, Section 3.1 delineates the scenarios, Section 3.2 elucidates the target detectors employed, and Section 3.3 expounds upon the deployment of LightGlue, which lacks rotational invariance, for target tracking purposes.

3.1. Problem Definition

Drones

C_{1}

and

C_{2}

are deployed at high altitudes to track ground targets. Multi-drone cooperative tracking indicates that some targets appear in the field of view of both drones simultaneously. If we assume that the capture area of drone

C_{1}

is H and that the area captured by drone

C_{2}

is I, then the intersection of H and I represents the common area of the two drones, which we may also define as the common view area. In cross-drone tracking, we only consider the target association across multiple drones within the common field of view, while for targets outside the common view, we only need to consider the tracking relationship within a single drone.

The viewpoint with the fewest targets within the common view area is designated as the mapping viewpoint. The remaining viewpoints are classified as matching viewpoints. The targets mapped by the mapping viewpoints are then matched with the targets that were originally present in the matching viewpoints.

3.2. Centernet

The STCA model employs CenterNet as the target detector and DLA34 as the backbone network for feature extraction. CenterNet is a single-stage object detection framework that employs a centroid-based approach to recognize objects. In contrast to conventional detection methodologies that depend on bounding box proposals, CenterNet is capable of directly predicting the heat map at the center of the object and estimating the size and offset of the object. This method offers a streamlined approach to target detection, facilitating rapid and precise outcomes. It is particularly well-suited to target tracking tasks. It is assumed that the detection results of Drone 1 and Drone 2 are represented by the set notation

D = {D_{M}^{1 t}, D_{N}^{2 t}}

, where

D_{M}^{1 t}

and

D_{N}^{2 t}

are the numbers of targets present in the images returned by Drone 1 and Drone 2 at the t frame, respectively.

During the training process, CenterNet employs a comprehensive loss function. The loss function utilized for target detection is identical to that described in the original article, as follows:

λ_{s i z e} = 0.1

and

λ_{o f f} = 1.0

are hyperparameters that adjust the weights of different loss terms , where

λ_{s i z e} = 0.1

λ_{o f f} = 1.0

L_{h e a t m a p}

denotes the heatmap centroid loss, and

{\hat{Y}}_{x y c}

is the predicted value of the heatmap in CenterNet.

L_{o f f}

denotes the centroid offset loss, and

L_{s i z e}

denotes the target aspect loss.

L_{heatmap} = - \frac{1}{N} \sum_{x, y, c} \{\begin{matrix} {(1 - {\hat{Y}}_{x y c})}^{α} log ({\hat{Y}}_{x y c}), & if Y_{x y c} = 1 \\ {(1 - Y_{x y c})}^{β} {({\hat{Y}}_{x y c})}^{α} log (1 - {\hat{Y}}_{x y c}), & otherwise \end{matrix}

(1)

L_{size} = \frac{1}{N} \sum_{i} |{\hat{b}}_{i} - b_{i}|

(2)

L_{off} = \frac{1}{N} \sum_{i} |{\hat{o}}_{i} - o_{i}|

(3)

L_{\det} = L_{heatmap} + λ_{size} L_{size} + λ_{off} L_{off}

(4)

3.3. T-LightGlue

To ensure the robustness of the computed transformation matrix, we retained the pairs of feature points that exhibited the highest confidence scores in the case of the optimal matching angle. These points were then utilized to compute the transformation matrix, designated as T for the two drones’ images, using the RANSAC (Random Sample Consensus) algorithm. It is necessary to calculate the keypoints four times using LightGlue for each frame of the input drone image, which has an impact on the tracking speed to some extent.

For this purpose, we designed the T-LightGlue model based on the characteristics of multi-drone multi-target tracking, as shown in Figure 3, where the T-LightGlue rotates the image in Drone 1 by 0, 90, 180, and 270 degrees in the first frame and then sequentially uses it for image-matching with the image in the same LightGlue Drone 2. Due to the huge difference in the number of keypoints at different angles, we save the rotation angle and the number of keypoints with the highest number of keypoints and pass it to the next frame when the image enters the next frame. We directly calculate the number of keypoints of the two drone images at the optimal rotation angle and compare it with 0.8 times the number of keypoints saved in the previous frame. If the value exceeds the threshold, recalculating the number of key points for the remaining three angles is unnecessary. Conversely, if the value falls below the threshold, the key points must be recalculated for the two adjacent angles. This is because it is nearly impossible for the drone to undergo a significant rotational change in adjacent frames. It is thus unnecessary to calculate the viewpoint of the non-adjacent frame.

4. Methods

In this section, as shown in Figure 4, we introduce the STCA model and divide multi-drone multi-target tracking into two sub-problems: single-drone and cross-drone matching. The Single-Drone Tracking and Cross-Drone Association sub-problems obtain the corresponding matrices, respectively, and establish global associations directly from the current frame.

4.1. Cross-Drone Association

To solve the problem of correlation error between different machines caused by the combined influence of similar small targets and large deformation in the appearance-based cross-machine tracking method. We use the T-LightGlue algorithm to calculate the transformation matrix between images from two drones.

After calculating the transformation matrix, based on the actual conditions of the high-altitude drone, as shown in Figure 5, we designed the Common View Area Model to prevent targets outside the common view area from entering the subsequent matching algorithm. The transformation matrix is employed to map the four vertices of the input image, thereby facilitating the construction of the Common View Area Model. The calculated points are connected in sequence to form a quadrilateral. Subsequently, the ray casting method is employed to extend a ray horizontally to the right from the center point of the target. The number of intersections between this ray and the edges of the polygon determines whether the point in question is within the common view area. If the number of intersections is odd, the target is deemed to be within the common view area. Conversely, if the number of intersections is even, the target is considered to be outside the common view area.

A common challenge in multi-drone multi-target tracking is that a target may be fully detected within the field of view of one drone, while its corresponding target in the other drone remains undetected due to factors such as non-maximum suppression (NMS) in target detection, low target confidence, and other variables. This leads to errors when applying traditional correlation algorithms commonly used in target tracking, such as the Hungarian and K-Nearest Neighbors algorithms, to high-altitude multi-drone scenarios, as shown in Figure 6. The maximum weight matching determined by the Hungarian algorithm provides a global optimal solution for the overall objective. However, it does not guarantee local optimality for each objective in the process of Cross-Drone Association. After mapping the objective mapping of the mapping perspective, Hungarian algorithms considering the overall optimum may impose wrong associations on the objectives. The primary goal of the K-nearest neighbor (K-Nearest Neighbors) algorithm is to find the closest objects within a given set. However, due to the inherent ambiguity in the matching objectives, obtaining a unique solution becomes challenging.

To resolve the aforementioned issues, a Local-Matching Model is designed to compute locally optimal matches and ensure unique matching solutions for cross-drone targets, based on the principle of local optimality. As with the Hungarian and K-Nearest Neighbors algorithms, local matching employs distance as a matching criterion. The process entails the extraction of the detection frame of the target, as provided by Centernet, and the subsequent calculation of the coordinates of the center point of the aforementioned detection frame. Replacing the entire target with the center point. Using the transformation matrix provided by LightGlue, we map these points to their new positions. The mapped points and the closest corresponding points from the other perspective are paired to form matching points. The Local-Matching Model enables the derivation of the cross-drone target association matrix

M^{C r o s s}

for the current frame.

4.2. Single-Drone Tracking

Benefiting from the inspiration provided by GMT [35] for Single-Drone Tracking, CenterNet [27] is used as the object detector, with DLA34 as the backbone network for feature extraction.

B^{t} = \{b_{1}^{t}, b_{2}^{t}, . . ., b_{N_{t}}^{t}\}

represents the detection boxes of all high-confidence objects in frame t, and

N_{t}

is the number of objects detected in frame t. Here,

b_{i} = (x_{1}^{i}, y_{1}^{i}, x_{2}^{i}, y_{2}^{i})

denotes the top-left and bottom-right corners of the i-th target.

Given that target appearance and position in Single-Drone Tracking remain stable, and positional data can guide despite similar appearances, we combine appearance and positional information to construct the feature vector. Once the targets within frame t have been identified by the detection model, the ReID encoder generates a Re-ID feature vector, denoted as

f_{t}^{R e - I D} = {f_{1 t}^{r}, f_{2 t}^{r}, . . ., f_{N_{t} t}^{r}}

, where

f_{i t}^{r} \in R^{D / 2}

. The position encoder generates the position feature vector, denoted by

f_{t}^{P o s} = {f_{1 t}^{p}, f_{2 t}^{p}, . . ., f_{N_{t} t}^{p}}

, where

f_{i t}^{p} \in R^{D / 2}

. The appearance feature vectors and position feature vectors of the same target are concatenated to obtain the feature vector

F_{t} = {f_{1 t}, f_{2 t}, . . ., f_{N_{t} t}}

for the current frame.

F = F_{t - 1} \cup F_{t} \in R^{N \times D}

serves as the input for the Adjacent Frame Association Model, wherein it is assumed that the total number of targets in the t frame and the

t - 1

frame is N.

The Adjacent Frame Association Model employs the Transformer [43] architecture, as illustrated in Figure 7, comprising an encoder layer and a decoder layer. The Adjacent Frame Association Model is employed to process

F \in R^{N \times D}

and

Q \in R^{N_{t} \times D}

, thereby generating enhanced feature matrices, designated as

F_{e} \in R^{N \times D}

and

Q_{e} \in R^{N_{t} \times D}

. A similarity matrix

M^{S i n g l e} = Q_{e} F_{e}^{⊤} \in R^{N_{t} \times N}

between F and Q is derived through matrix multiplication. The score in matrix

M_{i j}^{S i n g l e}

represents the degree of similarity between the i-th target in Q and the j-th target in F. Define

M_{i ⌀}^{S i n g l e} = 0

to indicate no association.

In the training phase of the Single-Drone Tracking, F and Q represent the targets of the current frame and the preceding frame, respectively. The degree of overlap between bounding boxes is measured using IOU. If the IOU between a bounding box detected in the current frame and a bounding box from a true trajectory exceeds a certain threshold, the target is considered to belong to that trajectory. Targets i in frame t are designated as belonging to the same trajectory, denoted as

p_{i} \leftrightarrow p_{j}

if they correspond to target j in frame

t - 1

. Conversely, the target i in frame t that does not belong to any trajectory is denoted as

p_{i} = ⌀

The loss for the Single-Drone Tracking is divided into two parts, designated as

p_{i} = ⌀

and

p_{i} \leftrightarrow p_{j}

. The trajectory loss of the target that does not belong to any trajectory is

L_{i ⌀}

. The first part,

p_{i} = ⌀

, has the following loss:

Y_{i ⌀} = \{\begin{matrix} 1 & if p_{i} = ⌀ \\ 0 & otherwise \end{matrix}

(5)

L_{i ⌀} = - \frac{1}{N} (\sum_{i = 1}^{N} Y_{i ⌀} log (M_{i ⌀}^{S i n g l e}))

(6)

where

M_{i ⌀}^{S i n g l e}

denotes the result of applying the softmax function to

X_{i ⌀}^{S i n g l e}

. The loss for objects i in frame t to be assigned to the same trajectory is

L_{i j}

. The second part, with

p_{i} \leftrightarrow p_{j}

, has the following loss:

Y_{i j} = \{\begin{matrix} 1 & if p_{i} \leftrightarrow p_{j} \\ 0 & otherwise \end{matrix}

(7)

L_{i j} = - \frac{1}{N} (\sum_{i = 1}^{N} \sum_{j = 1}^{N} Y_{i j} log (M_{i j}^{S i n g l e}))

(8)

where

M_{i j}^{S i n g l e}

denotes the result of applying the softmax function to

X_{i j}^{S i n g l e}

. The total loss for the Single-Drone Tracking is as follows:

L_{Sin gle} = \sum_{c = 1}^{2} \sum_{t = 1}^{T} (L_{i j} + L_{i ⌀})

(9)

The total loss of STCA consists of the object detection loss and the loss from the Single-Drone Tracking. LightGlue is only applied during the inference phase. The loss function for CenterNet remains the same as in the original paper and is denoted as

L_{\det}

. The total loss is as follows:

L_{all} = L_{\det} + L_{Single}

(10)

5. Experiments

5.1. Datasets

The MDMT dataset was utilized, which is currently the only known dataset captured by high-altitude drones. The test set consists of a total of 14 sequences, exhaustively enumerated for evaluation. To assess system performance, we employed two metrics: MOTA and IDF1 for single-drone performance and MDA for cross-drone target association.

5.2. Experimental Details

In the offline training phase, the images in the training set of the MDMT dataset are resized to 1280 × 720 pixels by the specified scale. Before entering the backbone network, the images undergo random scaling between 0.8 and 1.2. The optimizer utilizes the Adam algorithm, the dimension of the appearance features and position features is 1024, the initial learning rate is

5 \times 10^{- 5}

, and the model is trained on the MDMT dataset for 30,000 iterations with a batch size of 32. Subsequently, the Single-Drone Tracking was trained 20,000 times using Adam with a batch size of 12 and an initial learning rate of

5 \times 10^{- 5}

. The experiment was conducted on a server equipped with two 48GB RTX A6000 GPUs, utilizing the Python 3.8 and PyTorch 1.9.1 software environments.

In the context of online tracking, the threshold of the target detector, designated as CenterNet, has been set to a value of 0.55. Additionally, the latitude of the location features and the appearance features of the Single-Drone Tracking have all been assigned a value of 1024. In the Cross-Drone Association, the LightGlue model loads the pre-trained SuperpointV1 model, in which the feature point extractor is SuperPoint and the number of extracted keypoints is set to 2048 to ensure optimal matching. The online tracking phase was subjected to a trial run on a server equipped with a 48GB RTX A6000 GPU.

5.3. Evaluation Metrics

We use MOTA and IDF1 to evaluate single drone tracking metrics for STCA and MDA to evaluate cross-drone metrics for STCA. These metrics are described below:

MOTA is a comprehensive metric used to evaluate the performance of multi-object tracking algorithms. It aims to quantify how well an algorithm can handle aspects such as object detection, tracking, and identity maintenance. The calculation formula is as follows:

MOTA = 1 - \frac{(FN + FP + IDs)}{GT}

(11)

FN (False Negatives): False Negatives represent the number of actual targets (Ground Truth, GT) that were incorrectly identified as not detected by the algorithm. A higher FN count indicates poorer performance in object detection.

FP (False Positives): False Positives are the number of instances where the algorithm incorrectly identifies non-target areas as targets. An increase in FP suggests a high False Positive rate, resulting in unnecessary target tracking.

IDs (Identity Switches): Identity Switches refer to the instances when the algorithm mistakenly transfers the identity of one target to another during tracking. ID switches reduce the coherence and accuracy of tracking.

GT (Ground Truth): Ground Truth denotes the actual number of targets present in the test dataset. It serves as a baseline to evaluate the relative performance of the tracking algorithm.

IDF1 is a metric designed to assess the consistency of target identities maintained by a tracking algorithm. It combines precision and recall to provide a comprehensive performance measure. The calculation formula is as follows:

IDF 1 = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}

(12)

TP (True Positives): True Positives indicate the number of targets that were correctly identified and tracked by the algorithm. TP is a crucial indicator of the detection capability of the algorithm, reflecting its accuracy.

FP (False Positives): As mentioned earlier, False Positives refer to the incorrect identifications made by the algorithm, affecting the final precision score.

FN (False Negatives): Similarly, False Negatives represent the number of targets that were not detected by the algorithm, influencing the recall rate.

To evaluate the efficacy of cross-drone tracking, as illustrated in Equation (8), we utilize mean deviation accuracy (MDA) as the evaluation metric. Here, the subscripts i and j represent the i frame and the j acquisition camera device, respectively. A denotes the number of target ID pairs truthfully associated by the multi-device multi-target tracking algorithm, B denotes the number of all associated multi-device ID pairs in the ground truth, C denotes the number of multi-device ID pairs incorrectly associated by the multi-device multi-target tracking algorithm, and D denotes the number of ID pairs incorrectly associated by the multi-device multi-target tracking algorithm but appearing in the ground truth. The accuracy of associating the same target between the jth and kth devices is calculated within parentheses, and the effectiveness of the STCA method is evaluated using the MDA metric.

\{\begin{matrix} M D A & = \frac{1}{C_{N}^{2} \cdot F} \sum_{j = 1}^{N - 1} \sum_{k = j + 1}^{N} \sum_{i = 1}^{F} (\frac{T_{A_{(j k, i)}}}{G_{A_{(j k, i)}} + F_{A_{(j k, i)}} + M_{A_{(j k, i)}}}) \\ C_{N}^{2} & = \frac{N!}{(N - 2)! \cdot 2!} \end{matrix}

(13)

5.4. Comparisons with State-of-the-Art Algorithms

Initially, we evaluated the Single-Drone Tracking performance of the proposed method, as shown in Table 1. The cross-drone methods of the first two approaches are based on the Re-ID paradigm, where AutoAssign and Carafe are used as detectors, and Bytetrack serves as the tracker. Experimental results demonstrate that the proposed model achieves optimal performance regarding the MOTA and MDA metrics. The sub-optimal IDF1 score of the proposed model may result from its dual focus on Single-Drone Tracking and Cross-Drone Association, which could affect the performance of individual drones. Cross-Drone Association methods based on the Re-ID paradigm achieve MDA values of only 0.1830 and 0.1819, indicating severe limitations in their performance.

Secondly, we also evaluated the speed of existing methods, which greatly reduced the number of LightGlue runs due to the use of T-LightGlue. FPS stands for the number of frames processed per second, and from the experimental results, AutoAssign and Bytetrack run the least efficiently at 1.02 FPS. The results demonstrated that the Carafe and Bytetrack processes exhibited a similar operational velocity with a rate of approximately 1.12 FPS, comparable to that observed for the AutoAssign and Bytetrack processes. MIA-Net exhibited a substantial enhancement in processing speed, achieving a rate of 1.74 FPS and a runtime of 6744 s. The STCA achieved a processing speed of 4.29 FPS with a runtime of 2736 s, which is 2.55 FPS faster than the MIA-Net method and 4008 s faster than the MIA-Net method.

We further evaluated the Cross-Drone Association performance of STCA, as shown in Table 2. We have evaluated that the MDA values for Cross-Drone Associations based on the Re-ID paradigm are relatively low; therefore, we do not perform per-sequence MDA analysis based on the Re-ID paradigm. The existing methods suitable for HAMDMT tracking are the three different methods in MIA-Net. MIA-Net(g) employs SIFT for global matching, using SIFT to provide a transform matrix between images. MIA-Net(l) employs a localized matching method that utilizes SIFT in the initial frame or matches using a transform matrix comprising a minimum of four identified matching targets. MIA-Net is a method that strikes a balance between global and local matching. When there are fewer than four confirmed matching targets, SIFT matching is employed. Conversely, when there are more than four targets, a transformation matrix consisting of the four confirmed matching targets is utilized. The proposed model secured the best results across 13 sequences, achieving an average MDA of 0.5095, which is 0.0923 higher than the next best method, thereby validating the efficacy of the STCA.

In the 71st sequence, the MDA value of the algorithm is lower than that of other methods. This is due to the randomized selection of mapping viewpoints, which depends on the viewpoint with fewer targets within the common view area of the two viewpoints. In this sequence, the number of targets within the common view area is relatively balanced, leading to frequent switching between mapping viewpoints and thereby affecting the MDA value to some extent.

5.5. Ablation Study

To validate the effectiveness of the proposed model, we conducted ablation experiments, as shown in Table 3. For comparison, we used the K-Nearest Neighbors algorithm and the Hungarian matching algorithm as baselines in the first two columns of Table 3, where we set K-Nearest Neighbors to match the nearest target.

We conducted experiments on the Local-Matching Model separately, and the average MDA reached 0.4960, which exceeded the performance of the Hungarian algorithm and the K-nearest neighbor algorithm in all sequences. This is because the MDMT dataset itself is a dataset with a large amount of occlusions. We used the Local-Matching Model to alleviate the occlusion problem, proving the superiority of the Local-Matching Model.

The final column of Table 3, which depicts the integration of the Common View Area Model into the Local-Matching Model, illustrates an average MDA enhancement of 0.0135. This outcome suggests that the Common View Area Model effectively precludes coplanar area targets from entering the subsequent model. We found that in sequence 52, the MDA value decreased a bit after we added the coplanar region model, mainly because there were more targets at the edge of the coplanar region, resulting in some frames where some targets particularly close to the edge of the coplanar region were identified as targets outside the coplanar region. However, the effectiveness of the coplanar region model was demonstrated in the remaining thirteen test set sequences. The basis for delineating the Common View Area Model is the transformation matrix provided by T-LightGlue. Thanks to T-LightGlue’s focus on the common view area, the generated matrix has a certain degree of accuracy. However, when there is no target in the common view area, we simplify multi-drone multi-target tracking to single-machine tracking of multiple drones, pay attention to whether there is a target in the common view area in real-time, and associate the targets between multiple drones at any time.

In addition, as shown in Table 4, we evaluate the role of the added location information in the Single-Drone Tracking task and the speedup provided by using T-LightGlue in the Cross-Drone Association. The first two lines show that after adding position information, MOTA increases by 0.98, IDF1 increases by 3.33, and MDA increases by 0.275, which proves the effectiveness of position information, but adding position information decreases the speed by 87 s.

The final two rows of Table 4 illustrate the role of T-LightGlue in the accession process, wherein STCA represents the methodology employed in utilizing T-LightGlue. In the use of LightGlue, 23,532 times of image-matching were performed, and matching time reached 6530 s, while in the use of T-LightGlue, the various indicators, although the decline only calculated 6045 times of image-matching, had a matching time of 2736 s. Compared with the use of LightGlue, matching times were reduced by 17,487 times and matching time by 3794 s. The experimental results proved that T-LightGlue can significantly improve the tracking speed with less decrease in each index.

5.6. Visualization

To visualize the effect of STCA, we first visualize the MDMT dataset as shown in Figure 8. From the figure, it can be observed that the task presents a certain degree of difficulty due to the relatively small size of the targets involved, with the more similar ones circled using yellow dashed lines. It can be seen that STCA provides better tracking results in both cross-drone and single-drone small target scenarios, even when there are a large number of identical or similar targets. Next, to facilitate a more intuitive comparison with other methods, the MDA for each frame of sequence 34 was visualized, as shown in Figure 9.

6. Discussion

The prevailing multi-drone multi-target tracking algorithms rely extensively on target appearance data across aircraft positions. Nevertheless, in the context of high-altitude drones, where targets are frequently smaller and appearance information is less reliable, single drone tracking can be effectively constrained using position-based methods. For high-altitude drones, relying solely on target appearance is suboptimal, as the same target observed from different aircraft positions may exhibit significant variations in visual features. Moreover, targets observed by different drones lack consistent relationships in terms of position, movement trajectory, and other spatial–temporal attributes.

When the drone flies at high altitude, the angle formed by the line between the camera and the center of the field of view being photographed and the ground is generally 60 to 90 degrees because if the camera is tilted too much, the target photographed by the drone at high altitude will have a large deformation, which is not conducive to detecting and tracking a specifically defined target in a fixed area. This provides a new solution for high-altitude drones to solve the problem of associating the same target between different aircraft positions, the image-matching method. As far as we know, STCA was the first model to be applied to the field of high-altitude drone tracking using high-accuracy image-matching techniques and deep learning. The findings indicate that STCA continues to possess a degree of utility in the domain of high-altitude drone tracking, particularly in the context of traffic monitoring, even when confronted with scenarios characterized by a high degree of visual resemblance.

7. Conclusions

To address the make-or-break situation faced by high-altitude drones, we propose a HAMDMT tracking model (STCA) for collaborative tracking of small targets. When tracking a single drone, the model uses appearance and positional information to construct a joint feature vector to determine the tracking relationships of targets. When associated across drones, the model utilizes the LightGlue algorithm to establish target associations across multiple drones. The experimental results on the MDMT dataset demonstrate that STCA is more effective than existing state-of-the-art methods. Furthermore, the enhanced T-LightGlue has facilitated an acceleration in tracking speed while concurrently ensuring a minor decline in tracking accuracy.

To improve the accuracy of Cross-Drone Association, we design a Common View Area Model and a Local-Matching Model for practical situations. Among them, the Common View Area Model effectively keeps the targets within the common view area of multiple drones to participate in Cross-Drone Association. Targets outside the common view area do not participate in Cross-Drone Association but only in single-drone matching. In addition, we analyze the impact of the traditional Hungarian algorithm and the K-Nearest Neighbors algorithm used for tracking this task, consider their advantages, and design a Local-Matching Model that fully takes into account the fact that there may be a target within the common view region that is not detected at a particular drone viewpoint and better improves the cross-drone tracking accuracy. The method is specifically designed for high-altitude drone tracking in small target scenarios. However, for the MDMT dataset, where most of the targets are vehicles, resulting in a large number of similar or even identical targets, STCA exhibits superior robustness. A large number of experiments have shown that STCA is more effective than existing state-of-the-art methods, as well as proving the validity of the Common View Area Model and the Local-Matching Model.

Author Contributions

All authors made significant contributions to this work. Conceptualization, H.F.; methodology, Y.Q., H.F., and Q.W.; investigation, Y.Q.; Programming, Y.Q. and H.F.; datacuration, H.F. and Y.Q.; validation, Y.Q.; writing original draft preparation, Y.Q. and H.F.; writing review and editing, Q.W., T.Z., and Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 62273339, 61991413, and U20A20200).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Li, Y.J.; Weng, X.; Xu, Y.; Kitani, K.M. Visio-temporal attention for multi-camera multi-target association. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9834–9844. [Google Scholar] [CrossRef]
Yu, C.; Feng, Z.; Wu, Z.; Wei, R.; Song, B.; Cao, C. HB-YOLO: An Improved YOLOv7 Algorithm for Dim-Object Tracking in Satellite Remote Sensing Videos. Remote Sens. 2023, 15, 3551. [Google Scholar] [CrossRef]
Hong, Y.; Li, D.; Luo, S.; Chen, X.; Yang, Y.; Wang, M. An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention. Remote Sens. 2022, 14, 6354. [Google Scholar] [CrossRef]
Wang, H.; Jin, L.; He, Y.; Huo, Z.; Wang, G.; Sun, X. Detector–Tracker Integration Framework for Autonomous Vehicles Pedestrian Tracking. Remote Sens. 2023, 15, 2088. [Google Scholar] [CrossRef]
Xue, Y.; Zhang, J.; Lin, Z.; Li, C.; Huo, B.; Zhang, Y. SiamCAF: Complementary Attention Fusion-Based Siamese Network for RGBT Tracking. Remote Sens. 2023, 15, 3252. [Google Scholar] [CrossRef]
Ma, Z.; Wei, X.; Hong, X.; Gong, Y. Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6142–6151. [Google Scholar] [CrossRef]
Li, B.; Leroux, S.; Simoens, P. Decoupled appearance and motion learning for efficient anomaly detection in surveillance video. Comput. Vis. Image Understanding 2021, 210, 103249. [Google Scholar] [CrossRef]
Li, D.; Wei, X.; Hong, X.; Gong, Y. Infrared-visible cross-modal person re-identification with an x modality. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 4610–4617. [Google Scholar] [CrossRef]
Wang, S.; Jiang, F.; Zhang, B.; Ma, R.; Hao, Q. Development of UAV-based target tracking and recognition systems. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3409–3422. [Google Scholar] [CrossRef]
Wang, X.; Chen, X.; Wang, F.; Xu, C.; Tang, Y. Image Recovery and Object Detection Integrated Algorithms for Robots in Harsh Battlefield Environments. In Proceedings of the Intelligent Robotics and Applications, Singapore, 16–18 December 2023; pp. 575–585. [Google Scholar] [CrossRef]
Tang, Z.; Naphade, M.; Liu, M.Y.; Yang, X.; Birchfield, S.; Wang, S.; Kumar, R.; Anastasiu, D.; Hwang, J.N. CityFlow: A City-Scale Benchmark for Multi-Target Multi-Camera Vehicle Tracking and Re-Identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8789–8798. [Google Scholar] [CrossRef]
Shim, K.; Yoon, S.; Ko, K.; Kim, C. Multi-Target Multi-Camera Vehicle Tracking for City-Scale Traffic Management. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 20–25 June 2021; pp. 4188–4195. [Google Scholar] [CrossRef]
Hsu, H.M.; Cai, J.; Wang, Y.; Hwang, J.N.; Kim, K.J. Multi-target multi-camera tracking of vehicles using metadata-aided re-id and trajectory-based camera link model. IEEE Trans. Image Process. 2021, 30, 5198–5210. [Google Scholar] [CrossRef]
Hong, T.; Liang, H.; Yang, Q.; Fang, L.; Kadoch, M.; Cheriet, M. A Real-Time Tracking Algorithm for Multi-Target UAV Based on Deep Learning. Remote Sens. 2023, 15, 2. [Google Scholar] [CrossRef]
Quach, K.G.; Nguyen, P.; Le, H.; Truong, T.D.; Duong, C.N.; Tran, M.T.; Luu, K. DyGLIP: A Dynamic Graph Model with Link Prediction for Accurate Multi-Camera Multiple Object Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13779–13788. [Google Scholar] [CrossRef]
Wei, H.; Wan, G.; Ji, S. ParallelTracker: A Transformer Based Object Tracker for UAV Videos. Remote Sens. 2023, 15, 2544. [Google Scholar] [CrossRef]
Wang, X.; Chen, X.; Ren, W.; Han, Z.; Fan, H.; Tang, Y.; Liu, L. Compensation Atmospheric Scattering Model and Two-Branch Network for Single Image Dehazing. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 2880–2896. [Google Scholar] [CrossRef]
Zhuo, L.; Liu, B.; Zhang, H.; Zhang, S.; Li, J. MultiRPN-DIDNet: Multiple RPNs and Distance-IoU Discriminative Network for Real-Time UAV Target Tracking. Remote Sens. 2021, 13, 2772. [Google Scholar] [CrossRef]
Ma, J.; Liu, D.; Qin, S.; Jia, G.; Zhang, J.; Xu, Z. An Asymmetric Feature Enhancement Network for Multiple Object Tracking of Unmanned Aerial Vehicle. Remote Sens. 2024, 16, 70. [Google Scholar] [CrossRef]
Lindenberger, P.; Sarlin, P.E.; Pollefeys, M. LightGlue: Local Feature Matching at Light Speed. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 17581–17592. [Google Scholar] [CrossRef]
Liu, Z.; Shang, Y.; Li, T.; Chen, G.; Wang, Y.; Hu, Q.; Zhu, P. Robust multi-drone multi-target tracking to resolve target occlusion: A benchmark. IEEE Trans. Multimedia 2023, 25, 1462–1476. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6568–6577. [Google Scholar] [CrossRef]
Cheng, C.C.; Qiu, M.X.; Chiang, C.K.; Lai, S.H. ReST: A Reconfigurable Spatial-Temporal Graph Model for Multi-Camera Multi-Object Tracking. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 10017–10026. [Google Scholar] [CrossRef]
Hao, S.; Liu, P.; Zhan, Y.; Jin, K.; Liu, Z.; Song, M.; Hwang, J.N.; Wang, G. Divotrack: A novel dataset and baseline method for cross-view multi-object tracking in diverse open scenes. Proc. Int. J. Comput. Vis. 2024, 132, 1075–1090. [Google Scholar] [CrossRef]
Specker, A.; Stadler, D.; Florin, L.; Beyerer, J. An Occlusion-aware Multi-target Multi-camera Tracking System. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 20–25 June 2021; pp. 4168–4177. [Google Scholar] [CrossRef]
Hsu, H.M.; Wang, Y.; Hwang, J.N. Traffic-Aware Multi-Camera Tracking of Vehicles Based on ReID and Camera Link Model. In Proceedings of the 28th ACM International Conference on Multimedia, New York, NY, USA, 12–16 October 2020; pp. 964–972. [Google Scholar] [CrossRef]
Jiang, N.; Bai, S.; Xu, Y.; Xing, C.; Zhou, Z.; Wu, W. Online Inter-Camera Trajectory Association Exploiting Person Re-Identification and Camera Topology. In Proceedings of the 26th ACM International Conference on Multimedia, New York, NY, USA, 28 October–1 November 2018; pp. 1457–1465. [Google Scholar] [CrossRef]
Ristani, E.; Tomasi, C. Features for Multi-target Multi-camera Tracking and Re-identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6036–6046. [Google Scholar] [CrossRef]
Gan, Y.; Han, R.; Yin, L.; Feng, W.; Wang, S. Self-supervised Multi-view Multi-Human Association and Tracking. In Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 20–24 October 2021; pp. 282–290. [Google Scholar] [CrossRef]
Fan, H.; Zhao, T.; Wang, Q.; Fan, B.; Tang, Y.; Liu, L. GMT: A Robust Global Association Model for Multi-Target Multi-Camera Tracking. arXiv 2024, arXiv:2407.01007. [Google Scholar]
Chavdarova, T.; Baqué, P.; Bouquet, S.; Maksai, A.; Jose, C.; Bagautdinov, T.; Lettry, L.; Fua, P.; Van Gool, L.; Fleuret, F. WILDTRACK: A Multi-camera HD Dataset for Dense Unscripted Pedestrian Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5030–5039. [Google Scholar] [CrossRef]
Pan, T.; Dong, H.; Deng, B.; Gui, J.; Zhao, B. Robust Cross-Drone Multi-Target Association Using 3D Spatial Consistency. IEEE Signal Process. Lett. 2024, 31, 71–75. [Google Scholar] [CrossRef]
Lua, C.G.; Lau, Y.H.; Heimsch, D.; Srigrarom, S. Multi-Target Multi-Camera Aerial Re-identification by Convex Hull Topology. In Proceedings of the 2022 Sensor Data Fusion: Trends, Solutions, Applications (SDF), Bonn, Germany, 12–14 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
Li, X.; Wu, L.; Niu, Y.; Jia, S.; Lin, B. Topological similarity-based multi-target correlation localization for aerial-ground systems. Guid. Navig. Control. 2021, 1, 2150016. [Google Scholar] [CrossRef]
Ng, P.C.; Henikoff, S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003, 31, 3812–3814. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Washington, DC, USA, 6–13 November 2011; pp. 2564–2571. [Google Scholar] [CrossRef]
Edstedt, J.; Bökman, G.; Wadenbäck, M.; Felsberg, M. DeDoDe: Detect, Don’t Describe—Describe, Don’t Detect for Local Feature Matching. In Proceedings of the 2024 International Conference on 3D Vision (3DV), Davos, Switzerland, 18–21 March 2024; pp. 148–157. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Zhu, B.; Wang, J.; Jiang, Z.; Zong, F.; Liu, S.; Li, Z.; Sun, J. Autoassign: Differentiable label assignment for dense object detection. arXiv 2007, arXiv:2007.03496. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 1–21. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-Aware ReAssembly of FEatures. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar] [CrossRef]

Figure 1. In scenarios where targets appear similar within a single-drone view and exhibit deformation across multiple drones, Re-ID methods struggle to correctly assign IDs to targets. Single-Drone Tracking is used to differentiate similar targets within a single drone, while Cross-Drone Association is employed to accurately associate each target across multiple drones.

Figure 2. Principles of LightGlue on the MDMT dataset.

Figure 3. Proposed T-LightGlue framework.

Figure 4. Proposed STCA framework. Single-Drone Tracking requires training, while Cross-Drone Association is used only during the inference phase. During inference, Single-Drone Tracking provides tracking information between consecutive frames. Cross-Drone Association is then used to establish target associations within the current frame, directly generating all tracking relationships in that frame.

Figure 5. The results of the Common View Area Model are presented in the form of a visualization. We visually present the target labeling rules under three different conditions. In this scenario, the targets represented by Line 1 and Line 3 have an even number of intersection points with the common field of view and are labeled as outside the common area. The target represented by Line 2 has an odd number of intersections with the common field of view, so it is labeled as being inside the common area. The common area is outlined by a blue line. The centroids of objects within the area are highlighted in green, while those outside the area are displayed in red.

Figure 6. Schematic comparison of the Hungarian algorithm, K-Nearest Neighbors algorithm, and Local-Matching Model.

Figure 7. Proposed Adjacent Frame Association Model framework.

Figure 8. Visualization of the tracking effect of STCA on the MDMT dataset.

Figure 9. The per-frame MDA analysis of sequence 34 demonstrates that the proposed model consistently maintains an advantage.

Table 1. Single-Drone Tracking performance.

Method	MOTA ↑	idF1 ↑	MDA ↑	FPS ↑	Time ↓
Re-ID (AutoAssign [44] + Bytetrack [45])	42.27	55.29	0.1819	1.02	11,566 s
Re-ID (Carafe [46] + Bytetrack)	47.89	56.44	0.1830	1.12	10,467 s
MIA-Net	49.68	68.24	0.4172	1.74	6744 s
STCA	50.53	59.27	0.5104	4.29	2736 s

Table 2. Cross-Drone Association performance.

Sequences	MIA-Net(l)	MIA-Net(g)	MIA-Net	STCA
26	0.40400	0.12515	0.43552	0.55808
31	0.33815	0.30541	0.33896	0.41771
34	0.38728	0.38659	0.38703	0.77514
48	0.53765	0.53518	0.58822	0.71156
52	0.61217	0.51815	0.61239	0.66466
55	0.31009	0.36450	0.34638	0.50004
56	0.49026	0.45156	0.49613	0.63470
57	0.82487	0.8000	0.83107	0.85503
59	0	0.04830	0.04490	0.04900
61	0.34016	0.33832	0.34118	0.42205
62	0.56124	0.51826	0.57184	0.67832
68	0.17331	0.05196	0.19562	0.22983
71	0.35241	0.33847	0.35235	0.24305
73	0.29016	0.24864	0.29889	0.40614
All	0.4013	0.3593	0.4172	0.5104

Table 3. Ablation study of the proposed method on MDMT dataset.

Sequences	K-Nearest Neighbors	Hungarian	Local-Matching	STCA
26	0.44655	0.49797	0.55215	0.55809
31	0.35238	0.29300	0.41215	0.41771
34	0.39859	0.52203	0.75854	0.77514
48	0.63772	0.62413	0.68639	0.71156
52	0.58071	0.60141	0.67348	0.66466
55	0.33905	0.31859	0.47265	0.500038
56	0.45160	0.33676	0.62377	0.63470
57	0.67221	0.71469	0.83382	0.855028
59	0.03001	0.04258	0.04610	0.04900
61	0.28270	0.30133	0.40755	0.42205
62	0.52447	0.54246	0.65054	0.67832
68	0.13347	0.13023	0.22184	0.22983
71	0.23521	0.17159	0.23275	0.243051
73	0.31843	0.29595	0.37303	0.40614
All	0.3859	0.3852	0.4960	0.5104

Table 4. Experiments on the effectiveness of location information and T-LightGlue.

Method	MOTA ↑	idF1 ↑	MDA ↑	FPS ↑	Time ↓	Calculation Times ↓
STCA (Re-ID + LightGlue)	49.67	56.08	0.4829	1.82	6443 s	23,532
STCA (Re-ID + Position + LightGlue)	50.65	59.41	0.5197	1.79	6530 s	23,532
STCA	50.53	59.27	0.5104	4.29	2736 s	6045

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiao, Y.; Fan, H.; Wang, Q.; Zhao, T.; Tang, Y. STCA: High-Altitude Tracking via Single-Drone Tracking and Cross-Drone Association. Remote Sens. 2024, 16, 3861. https://doi.org/10.3390/rs16203861

AMA Style

Qiao Y, Fan H, Wang Q, Zhao T, Tang Y. STCA: High-Altitude Tracking via Single-Drone Tracking and Cross-Drone Association. Remote Sensing. 2024; 16(20):3861. https://doi.org/10.3390/rs16203861

Chicago/Turabian Style

Qiao, Yu, Huijie Fan, Qiang Wang, Tinghui Zhao, and Yandong Tang. 2024. "STCA: High-Altitude Tracking via Single-Drone Tracking and Cross-Drone Association" Remote Sensing 16, no. 20: 3861. https://doi.org/10.3390/rs16203861

APA Style

Qiao, Y., Fan, H., Wang, Q., Zhao, T., & Tang, Y. (2024). STCA: High-Altitude Tracking via Single-Drone Tracking and Cross-Drone Association. Remote Sensing, 16(20), 3861. https://doi.org/10.3390/rs16203861

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

STCA: High-Altitude Tracking via Single-Drone Tracking and Cross-Drone Association

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Target Tracking

2.3. Image Matching

3. Preliminaries

3.1. Problem Definition

3.2. Centernet

3.3. T-LightGlue

4. Methods

4.1. Cross-Drone Association

4.2. Single-Drone Tracking

5. Experiments

5.1. Datasets

5.2. Experimental Details

5.3. Evaluation Metrics

5.4. Comparisons with State-of-the-Art Algorithms

5.5. Ablation Study

5.6. Visualization

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI