1. Introduction
In recent years, capturing videos from satellites has become a reality, with the launch of commercial satellites such as SkySat, ChangGuang, and Urthecast. Unlike conventional satellites, video satellites can capture consecutive images (videos) by staring at a designated area, thus providing critical information for dynamic monitoring [
1,
2,
3].
Videos satellites have opened up new areas for remote sensing applications, such as large-scale traffic surveillance, 3D reconstruction of buildings, urban management, etc. In traffic surveillance, extracting moving vehicles from satellite videos is one of the most critical tasks. Moving vehicle detection from ground-based cameras has been an active research topic for many years, and many state-of-the-art methods have been proposed, for example, the Temporal Differencing [
4], the Gaussian Mixture Model (GMM) [
5], the Codebook method [
6], and the Visual Background Extraction (ViBE) [
7]. However, these methods do not work well on satellite videos because of several challenges as follows [
2,
8]:
- (1)
Tiny Size of Targets: In ground-based surveillance videos, targets are relatively large and can be easily distinguished from noise. In satellite videos, moving vehicles are usually composed of several pixels without distinctive colour and texture, making them difficult to distinguish from noise. Under such conditions, motion information is probably the most robust feature that can be used to identify moving vehicles.
- (2)
Pseudo-Moving Objects: Due to the camera′s ego-motion, the parallax of tall buildings causes many pseudo-moving objects, which significantly affect the detection accuracy. False objects often appear on the edges of tall buildings. In the absence of available features, efficiently distinguishing between true and false targets is a crucial challenge.
- (3)
Cluttered Background: Satellite videos cover expansive views and capture various ground objects, such as roads, buildings, water bodies, vegetation, etc. These objects form complex and cluttered backgrounds, making vehicles tiny and challenging to identify. As Ao et al. [
2] adequately mentioned, the task is like looking for a needle in a haystack. Moreover, video illumination changes cause more difficulties, resulting in inconsistent recognition results.
Motion detection has been a hot research area for decades, and many methods for detecting objects from videos have been proposed. Despite the difference between satellite- and ground-based surveillance videos, some basic techniques still apply to detect objects from satellite-based videos.
Background subtraction for motion segmentation in static scenes is a commonly used technique to detect moving objects from videos [
9]. This technique relies on subtracting the current image from a reference image (also called a background image), resulting in the difference image. The pixels where the difference is above a given threshold are classified as moving objects (i.e., foreground). Nevertheless, the critical task is to create the background image (reference image), which can be solved by the median background model [
10], the GMM [
5], the Kernel Density Estimation (KDE) [
11], the ViBE model [
7], etc. As the most popular and effective methods for motion detection, the Background Subtraction Models (BSMs) also play an essential role in generating initial detection results from satellite videos. For instance, Kopsiaftis and Karantzalos [
1] applied the BSMs and mathematical morphology to estimate traffic density. Yang et al. [
12] presented a novel saliency-based background model adapted to detect tiny vehicles from satellite-based videos. The evaluation of different background models was performed by Chen [
13], who indicated the best method to detect vehicle candidates from satellite-based videos.
Besides the BSMs, the temporal differencing method, in which the moving objects are evaluated by finding the difference between two or three consecutive frames, has been widely used to detect moving vehicles as the most efficient method. Although the frame differencing method has been used for real-time processing, it generally fails to obtain complete outlines of moving objects and causes ghost regions when objects are moving fast. Therefore, it is commonly combined or adapted with other techniques to overcome the drawbacks. More recently, Chen et al. [
14] proposed a method for detecting moving objects, called adaptive separation, which is based on the idea of frame difference. Their method can be easily implemented and offers high performance. Likewise, Shi et al. [
15] developed a normalised frame difference method, which was also a variant of frame differencing. Shu et al. [
16] proposed a hybrid method to detect small vehicles from satellite-based videos by combining three-frame differencing methods with the Gaussian mixture model.
Witnessing the great success of convolutional neural networks (CNN) in object detection, recent works have utilised CNN to detect moving vehicles. For example, Pflugfelder et al. [
17] trained and fine-tuned a FoveaNet for detecting vehicles from satellite-based videos. Chen et al. [
13] proposed a lightweight CNN (LCNN) to reject false targets from preliminary detection results to increase recognition accuracy. Zhang et al. [
18] trained a weakly supervised deep convolutional neural network (DCNN) to improve detection accuracy. They used the pseudo labels extracted by extended low-rank and structured sparse decomposition.
Some methods exploit motion information to improve detection accuracy. Unlike moving objects in ground surveillance videos, moving vehicles in satellite-based videos lack texture and colour features, and therefore motion information is accounted for. For instance, Xu et al. [
19] used a global scene motion compensation and a local dynamic updating method to remove false vehicles by considering the moving background. Likewise, Ao et al. [
2] proposed a motion-based detection algorithm via noise modelling. Hu et al. [
20] also utilised motion features in deep neural networks for accurate object detection and tracking. Feng et al. [
21] proposed a deep-learning framework guided by spatial motion information. More recently, Pi et al. [
22] integrated motion information from adjacent frames into an end-to-end neural network framework to detect tiny moving vehicles. Besides spatial information, Lei et al. [
23] utilised multiple prior information to improve the accuracy of detecting tiny moving vehicles from satellite-based videos.
Besides the above methods, some studies deal with detecting moving objects using low-rank and sparse representation [
24]. Inspired by the low-rank decomposition, Zhang et al. [
25] extended the decomposition formulation with bounded errors, called Extended LSD (E-LSD). This method has improved the background subtraction method with boosted detection precision over other methods. In addition, Zhang et al. [
26] proposed online low-rank and structured sparse decomposition (O-LSD) to reduce the computational cost, which addresses the processing delay in seeking rank minimisation.
Detecting moving vehicles from satellite videos has been studied for about a decade. Therefore, the existing detection methods have some limitations. The method based on BSMs often introduces noise that is difficult to distinguish from actual vehicles, while the methods based on frame difference are sensitive to noise. Although the deep-learning-based approach has achieved excellent performance, it requires many samples and repetitive training, resulting in a highly time-consuming task. In addition, deep-learning requires expensive computational resources and costs, making it unsuitable for high-speed processing.
Due to the cluttered background and absence of conspicuous appearance features, moving vehicles are often mistaken for false targets caused by complex moving backgrounds. Although some algorithms allow us to adjust parameters to achieve higher detection precision or recall, generally increasing precision results in low recall or vice versa. The critical task is to eliminate the effects of background movement and environmental changes (e.g., illumination change). Overall, a low computational detection algorithm with high precision and recall is lacking, and further studies are needed.
Among the state-of-art techniques for detecting moving objects from satellite-based videos, BSMs are still the most widely used since they are practical methods. Nevertheless, we have noticed that some BSM-based approaches produce results with low precision but high recall. That is to say, they can identify almost all true objects, but also misidentify many false ones. The key difficulty is to distinguish these pseudo-moving vehicles from true. If we can remove false objects and keep true vehicles, both precision and recall can be guaranteed. To this end, this paper proposes a moving vehicle detection approach based on BSMs and tracklet analysis.
Our proposed method fully uses motion information, which is a potential cue for identifying actual moving vehicles. This study aims to distinguish between true and false vehicles using the features of their tracklets, which have not been studied in the existing literature. Object tracking tries to obtain an object’s trajectory, which refers to the entire path an object travels from its appearance to its disappearance. A tracklet is considered one of the fragments of the whole path and can be used to distinguish different moving objects.
To distinguish between these tracklets, we need to investigate the characteristics of different tracklets and design appropriate features accordingly. This paper has several contributions as follows: (1) an object matching algorithm considering the directional distance metrics is proposed to create reliable tracklets of moving objects; (2) a tracklet feature classification method is proposed to distinguish between true and false targets; and (3) a tracklet rasterisation method is proposed to create a confidence map that can be used to adaptively alter the parameters in tracklet classification and derive road regions that can be utilised to verify actual moving vehicles. The following sections cover the details of our procedure and experiments.
2. Method
The proposed method mainly consists of the following steps: (1) image filtering and enhancement, (2) background modelling, (3) object tracking, (4) tracklet feature extraction, and (5) tracklet classification, as summarised in
Figure 1.
In the first step, we apply a Gaussian filter to the input videos, suppressing some random noise that often appears at object edges. This step removes part of the false objects from the object detection results. In the second step, a non-parametric background modelling method [
27] is chosen to obtain initial vehicle recognition results with high recall but low precision. In the third step, we extract object tracklets using our newly developed tracking method, derived from the Hungary Matching Algorithm (HMA) [
28] and Optical Flow (OF) [
29]; then, features are extracted for each tracklet and used to distinguish true vehicles from false targets. At the same time, a confidence map can be derived from vehicle tracklets and then used to adjust parameters in the tracklet classification. In the final step, the confidence map can also be used as a constraint to verify proper moving vehicles, achieving both high precision and recall.
2.1. Video Filtering
The challenge of detecting moving vehicles from satellite videos lies in the target size and random distracting noise. We observed that noise often occurred at building edges and road curbs. Image smoothing is an essential technique for dealing with noise. Some commonly used filters are available for our task, such as the mean, median, and Gaussian. For the detection of such tiny objects, the filter should be able to be fine-tuned through its parameters. Among the three, the mean and median filters can be adjusted by only one parameter: the window size. Thus, the targets tend to be over-blurred even when a small window size is applied. In contrast, the Gaussian filter can adjust the smoothing strength by tuning two parameters: the window size (w) and the standard deviation (σ) of the Gaussian function. The Gaussian filter was utilised in our experiments. The appropriate parameters are determined by repetitive testing until the filter can remove much noise and keep targets unspoiled. Finally, the empirical parameters were set to w = 5 and σ = 1.
2.2. Background Subtraction Model
Satellite-based videos differ from ground-based videos. We observed that some popular background models, such as GMM [
5] and ViBE [
7], do not work well when used directly on satellite-based videos. In our study, we found that GMM generated too much noise, while ViBE tended to miss a lot of true objects. Therefore, we generated initial detection results using a non-parametric modelling method [
11] that can adapt quickly to changes in the scene and yield sensitive results with low false alarm rates. This model applies a normal kernel function to estimate the density, and can obtain a more accurate estimation. In our study, we adopt an improved version, proposed by Zivkovic and Heijden [
27], and implemented it in OpenCV (an open-source package for computer vision). This model needs two parameters specified by the user, namely (1) the threshold (d
T2) on the squared distance between the pixel and the sample to decide whether a pixel is close to a data sample; and (2) the history length (L
his), which is the number of last frames that affect the background model.
2.3. Tracklet Extraction
When detecting large objects from videos, noise is much smaller than target objects and can be filtered by specifying a threshold value. However, in satellite-based videos, moving vehicles appear in the same size as the noise, making it extremely challenging to remove false targets from the initial detection results. Two major types of false targets exist, namely random noise and pseudo-moving objects, which are caused by the motion parallax of building roofs. We have noticed that a vehicle moves at a constant speed and in a steady direction, following regular trajectories, whereas random noise appears at random positions and therefore cannot form smooth trajectories. Although pseudo-moving objects also form trajectories, their trajectories are often short and erratic. Based on these observations, our method intends to discriminate true targets from false ones by taking full advantage of the features of moving objects′ trajectories.
Object tracking techniques can be utilised to obtain trajectories. Unlike object tracking, our goal is to detect moving vehicles, rather than continuously tracking each object. Therefore, we do not need to track the entire trajectory of a moving object, but rather its tracklets, which are explained in the introduction. In object tracking, an object may be lost for a while and then get tracked again afterward, so in this scenario, we can only get separate tracklets, rather than a complete trajectory. For tracking targets, the algorithm attempts to obtain their complete trajectories. However, our study is supposed to detect the positions of true vehicles, instead of the whole trajectories. Hence, in the remainder of this paper, we only discuss the tracklet tracking method.
The background modelling mentioned above allows us to detect potential moving vehicles, which consist of true and false targets. To track their tracklets, we need to associate each object on the current frame with its corresponding identification on the next frame. This work is also known as data association in Multiple Object Tracking (MOT) [
30]. One of the techniques for data association is the Hungarian Matching Algorithm (HMA) [
28]. It requires high-quality detection results to obtain satisfactory tracking results. However, some vehicles cannot be detected on specific frames, due to the brightness of targets or motion blur, resulting in incoherent detection results. Missing detection will degrade the performance of HMA. For compensation, another technique, the Lucas-Kanade Optical Flow (LKOF) [
31], is applied to track vehicles when HMA fails. Combining the two techniques makes our tracklet tracking algorithm reliable.
2.3.1. The Hungarian Matching Algorithm—HMA
Let
T1,
T2,
T3, …,
Tn denote the targets on the
k-th frame, and
D1,
D2,
D3, …,
Dm denote the detected objects on the (
k + 1)-th frame. To track
Ti, we must find its match
Dj from the next frame. If
Dj cannot be found, this indicates that the tracking of
Ti is interrupted, or
Ti disappeared from the image. If one detection
Dj cannot be matched to any target, it is considered a new object. To match a target
Ti to its corresponding detection, we need to measure the distances between the target
Ti and all the detections. Given
n targets and
m detections, a distance matrix
Cnm (as shown in
Figure 2) is generated, based on the distance measurement. Each element
Sij measures the distance between the target
Ti and the detection
Dj. Through HMA, we can find the globally optimal solution for this problem. That is to say, we can assign all targets to their detections, guaranteeing that the summary of distances is minimised.
The quality of the matching algorithm depends on the distance measurement. To ensure the algorithm′s reliability, we proposed a measurement that considers both spatial and directional distances. As shown in
Figure 3, the two triangles represent two targets:
T1 and
T2, and the two circles represent the two corresponding detections:
D1 and
D2. When matching
T1 to the detections under the measurement of only spatial distance,
T1 may be assigned to
D2. This is because the spatial distance
ds12 between
T1 and
D2 is smaller than the distance
ds11 between
T1 and
D1. Consequently, the matching is wrong and, to avoid this, directional distance is considered. After a period of tracking, each target’s tracklet is obtained. From some recent tracking points (e.g., the last 10 points), we can predict the target′s velocity, which is a vector defined by the thick, dotted lines depicted in
Figure 3. The velocity of the target
T1 and
T2 are represented by
V1 and
V2, respectively.
Here, we define the directional distance as the distance between a detection point and the line determined by a velocity. In geometry, the distance between a point and a line is defined as the length of the perpendicular segment connecting the point to the given line. For convenience, we denote
dpij as the directional distance between target
Ti and detection
Dj. For instance, in
Figure 3, the directional distance between
T1 and point
D1 is
dp11, and
dp12 is the directional distance between
T1 and point
D2. Given a straight line as
ax + by + c = 0, the distance between a point (
xp, yp) and the straight line can be calculated with the following equation:
Combining the spatial and directional distance, we define the distance measurement as:
where
dsij is the spatial distance between the target
Ti and the detection
Dj, and
dpij is the directional distance between
Vi (velocity of target
Ti) and
Dj. The coefficient α (0~1.0) controls the weights between
dsij and
dpij. In our experiments, we found that the spatial and directional distance are equally important, and thus
α was set to 0.5. Under our distance measurement, the target
T1 in
Figure 3 can be matched to its correct detection
D1.
2.3.2. Optical Flow Tracking
Optical flow (OF) is the pattern of apparent motion of image objects between consecutive frames, caused by object or camera movement. It is a two-dimensional vector field where each vector is a displacement vector showing the movement of points from one frame to the second. A practical method for calculating the OF is LKOF, which takes a
patch around one point and assumes that all the points in the patch have the same motion. The equation for the OF calculation can be represented as follows:
where
I (
x, y, t) is the intensity of the pixel (
x, y) at time
t, and
dx and
dy are pixel displacements in the x and y directions, respectively. By linearising Equation (3) and expanding the first-order Taylor, we can get the OF constraint equation as:
where
∂I/∂x and
∂I/∂y are the gradients of the image in the x and y directions, respectively. In addition,
∂I/∂t is the partial derivative with respect to time. For simplicity, Equation (4) can be rewritten as:
where (
vx, vy) is the OF velocity in the x and y directions. The LKOF method assumes that the flow (
vx, vy) is constant in a small window of size
. Using a point and its surrounding pixels, the velocity (
vx, vy) can be worked out through a group of equations as follows:
Equation (6) means that the OF can be estimated by calculating the derivatives of an image in three dimensions. LKOF is often used to track feature points in computer vision. In our research, we extracted the centroids of the detected objects as the “feature points” and then estimated the OF of each object and thus could track each object.
So far, we have introduced two methods for tracking moving objects: the HMA and the LKOF. HMA is a popular and effective method for multiple object tracking, and it is chosen as the primary tracking method in this study. However, HMA may fail if an object cannot be successfully detected in the next frame. Therefore, LKOF can be used as an auxiliary method when HMA cannot track the object.
2.4. Tracklet Feature Analysis
It is difficult to distinguish true vehicles from false ones using only static information from each frame. Since we track each moving object′s tracklet from consecutive frames, we can distinguish them using temporal information. We observed that different moving objects yielded tracklets with different shapes. These tracklets can be categorised into four types (
Figure 4). In
Figure 4, the first row shows the illustration for each tracklet type, and the second row shows the corresponding instances from a video.
Figure 4a shows a vehicle running on a straight road,
Figure 4b represents a vehicle that has just completed a right turn, and
Figure 4c depicts a vehicle that is making a U-turn. All three vehicle tracklets are smooth, and the points on the tracklets are almost evenly spaced. In contrast, the rough and erratic tracklet shown in
Figure 4d is tracked from a distractor on a building. The background modelling algorithm has misinterpreted it as a vehicle. From the characteristics of these tracklets, the roughness of the tracklet can be used as an indicator that can distinguish between true and false vehicles.
To measure the roughness of the tracklet, we proposed a method based on the directional change angle (DCA). As shown in
Figure 5, P
1, P
2, P
3, and P
4 are four consecutive points on a tracklet, and the angle
β2, between two consecutive line segments P
1P
2 and P
2P
3, is defined as the DCA. Suppose an object is moving along the tracklet, starting from P
1, it changes its direction towards P
3 when it reaches P
2. Therefore,
β2 measures how much the motion direction has changed at P
2. Similarly, we can define another DCA
β3 at P
3 and so on. All these DCAs will be small if the tracklet is smooth, whereas a rough and erratic tracklet would yield large DCAs.
Given a tracklet with
n points, we can calculate (n-2) DCAs, denoted as (
β2, β3, …,
βn-1), and then derive their descriptive statistics (measured in pixel): minimum (DCA
min), maximum (DCA
max), mean value (
), and standard deviation (
). We selected some samples for each type of tracklets and calculated their DCA statistics, and the results are presented in
Table 1. We can see that even if a vehicle is on a U-turn, the
is about 0.315 (less than 0.5) and the
is 0.218 (less than 0.5). All these data indicate that vehicle tracklets (i.e., type (A), (B), and (C)) have small DCA
min, DCA
max,
, and
, whereas these values are much larger for non-vehicle tracklets.
Among the four statistical values (
Table 1), the DCA
min and DCA
max are easily affected by outliers in the DCAs, due to the presence of noises on the tracklet. Similarly, a non-vehicle tracklet may also contain a small DCA value, resulting in a small DCA
min. In contrast, the
and
are calculated from a series of DCAs, so they are less affected by outlier values and are more reliable. From analysing the geometric features of the tracklets, we can distinguish non-vehicles from vehicles through the mean and standard deviation of the DCAs calculated from their tracklets. The following section covers the tracklet classification based on their tracklet features.
2.5. Tracklet Classification
To describe our algorithm, we define the tracklet as an object for convenience, like the concept of an object in object-oriented programming (OOP).
Figure 6 shows the tracklet object with some attributes in UML (Unified Modelling Language) format. DCAs.x means the DCA
min, DCA
max,
, and
of DCAs discussed in the previous section. Attribute “Length” is the number of points in the tracklet, and “Active” is a Boolean attribute that is true if the tracklet is still being tracked, or false if the tracking has been completed.
As we have observed in traffic videos, vehicle tracklets are smooth, even if they are traveling on a bent or making a U-turn, whereas non-vehicle tracklets are erratic and ragged. Additionally, vehicle tracklets are long (often more than 20 points) and non-vehicle tracklets are relatively shorter.
In the previous section, we mentioned that the DCAmin and DCAmax features of DCAs might not be reliable in distinguishing non-vehicles and vehicles, so we chose , , and the length of the tracklet as the features to classify tracklets. In addition, the attribute “Active” must be used when judging whether a short tracklet is a vehicle or not because a true vehicle tracklet may not be long enough at the beginning of the tracking. Regardless of the active state of a tracklet, all truly moving vehicles will be classified as non-vehicles.
In our research, we do not need to classify tracklets into four types because we aim to distinguish real vehicles from spurious ones. Therefore, our classification is a binary classification problem. The statistics in
Table 1 were calculated based on many real instances from traffic videos. Note that the
of a non-vehicle is above 1.0 and all the
values of vehicles are less than 0.3. The
of a non-vehicle is usually larger than 0.5 and that of a vehicle is less than 0.5. This indicates that non-vehicles and vehicles are linearly separable. Considering the tracklet length as another feature, we set three thresholds:
Tmean = 0.5 C,
Tstd = 0.5 C, and
Tlength = 20/C for tracklet classification, where
C is a self-adaptive coefficient (default value is 1.0), which we will discuss in the next section. We need to process each frame as fast as possible, so the classification method should be lightweight. From this point, we use the decision tree in
Figure 6 as our classifier:
From the decision tree, we can see that if a tracklet is too short (shorter than TLength), it will be classified as a non-vehicle when the tracklet has been completed (Active = False), since short tracklets are usually tracked from noise. If a short tracklet is still active, we cannot currently judge whether it is a true vehicle or not and just skip it. If the tracklet is longer than the threshold Tlength and has a small value (less than Tmean), it is a vehicle on a straight road. On the other hand, if the is larger than Tmean, we need the third feature to make further decisions: (1) if is also larger than the corresponding threshold Tstd, the tracklet is classified as a non-vehicle; (2) if is less than Tstd, it is a vehicle.
2.6. Confidence Map
In urban areas, moving vehicles are driven on certain fixed regions: roads. Getting the road ROIs (Region of Interest) in advance would help suppress false targets using the ROIs as constraints. Yang et al. [
12] and Chen et al. [
14] utilised a similar strategy in their research, where the road mask was generated from accumulated trajectories and then used to filter false targets. Unlike their work, we created a confidence map from the tracked tracklets and then used it to adapt the parameter
C in the tracklet classification.
In the previous section, three thresholds,
Tmean = 0.5 C,
Tstd = 0.5 C, and
Tlength = 20/C were set. When
C is set to the default value 1.0, the classification criteria are strict, and only true vehicles that meet the high criteria can be classified as a vehicle. The confidence of a tracklet is computed by the following formula:
where
T.DCA.mean is the mean value (i.e.,
) of the tracklet′s DCAs, and
T.Length is the length of the tracklet. The first term of Equation (7) depends on the smoothness of the tracklet and the second on the length. The smoother and longer the tracklet, the higher the confidence. The
of a tracklet is larger than 0 and usually less than 2, so the first term ranges from 0 to 0.5. The second term also ranges from 0 to 0.5, and thus
VConf varies between 0 and 1.0. The confidence measures the likelihood that a tracklet is tracked from a true moving vehicle.
A confidence map can be generated from vehicle tracklets as follows: each vehicle tracklet is rasterised into a 3-pixels wide trace, and the corresponding pixel values are set to the tracklet’s confidence value. Once all traces are created, they are overlapped together. When overlapping these traces, low confidence values are replaced by high values, as shown in
Figure 7a, where four traces with different confidence values (1, 2, 3, and 4) are displayed in blue, green, yellow, and red colours, respectively.
This way, the confidence map is accumulated and updated during tracking.
Figure 7b shows a confidence map generated and dynamically updated from real tracklets.
Once the confidence map is obtained, the coefficient
C is calculated by the following formula:
As mentioned above, Vconf varies between 0 and 1.0, and thus C ranges from 0.5 to 2.0. As previously mentioned, C is a self-adaptive coefficient that controls the three thresholds. If an object is in a high-confidence region, C × Tmean and C × Tstd increase with a large value of C, while Tlength decreases. This means that the smoothness and length criteria are lowered in high-confidence areas, thus reducing the probability of rejecting a true vehicle with a ragged tracklet. On the other hand, the criteria are raised in low confidence regions, and thus false moving vehicles could be rejected safely.
4. Discussion
The experimental results show that the frame difference method (Diff3) introduces too much noise, resulting in inferior accuracy. The ViBE method, as a recently proposed novel background subtraction model, works well with surveillance monitoring videos but not when applied to detect tiny moving objects from satellite-based videos. The MOG2 method achieves moderate performance. Both LCNN and TFC methods adopt the NPBSM as the preliminary detector and then eliminate false targets, but they use different strategies to remove them. LCNN uses a lightweight convolutional neural network that requires collecting samples and costs additional processing time, whereas TFC takes full advantage of motion information to identify false objects. Both AMS-DAT and TFC use motion information to eliminate false targets, but they adopt different strategies. AMS-DAT uses a straightforward method that accumulates the difference foreground to obtain moving trajectories. TFC applies complete temporal information and uses tracklet features to distinguish between true and false targets.
Our method uses a two-step strategy to obtain the final result. The accuracy of the final result depends on two aspects: high recall of the preliminary detection and the accuracy of the tracklet classifier. Hence, the basic premise of the TFC method is that including more false targets in the preliminary detection is more acceptable than missing true objects. This is because false targets can be filtered out through post-processing, but missed targets will never be retrieved again. Therefore, the algorithm should guarantee that all candidates can be recognised in the first step. The NPBSM used in our method can detect candidates with recall higher than 96%, fulfilling the task.
In the subsequent post-processing step, our method tracks the centroids of candidates using the HMA and LKOF, which is different from the accumulation strategy proposed by Chen et al. [
14]. The edges of tall buildings tend to form tracklets similar to vehicles. The yellow and green rectangles in
Figure 16a indicate edges that are incorrectly recognised as moving vehicles, and
Figure 16b shows the tracklets extracted by accumulation, including incorrect tracklets (in the two rectangles) caused by building edges. In contrast,
Figure 16c shows that our strategy can prevent the false targets from forming tracklets, yielding more accurate results. In addition, our method can obtain more complete tracklets than the accumulation method.
As for the detection performance of the methods involved in our experiments, a summary is drawn as follows: NPBSM has the highest recall (over 96%), indicating that it can retrieve almost all targets, but it also produces many false alarms, with the precision from 48 to 55%. Therefore, NPBSM has the advantage of high recall capability and can be used to generate initial detection in two-step methods, outperforming Diff3, MOG2, and ViBE. As two-step methods, AMS-DAT, LCNN, and TFC apply different strategies to eliminate false alarms. AMS-DAT uses a straightforward trajectory accumulation method to verify true targets and can achieve moderate precisions (78−85%) and recalls (68−78%). LCNN employs a lightweight convolutional neural network and achieves precisions from 84 to 90% and recalls of around 85%. TFC utilises tracklet features to discriminate between true and false targets and achieves precisions from 82 to 93% and recalls from 85 to 92%. On our datasets, TFC outperforms LCNN and AMS-DAT. Among all the methods in the experiments, TFC has the best overall performance, achieving the highest F1 score.