Nothing Special   »   [go: up one dir, main page]

Next Article in Journal
Performance of Haiyang-2 Derived Gravity Field Products in Bathymetry Inversion
Previous Article in Journal
Stratospheric Aerosol Characteristics from the 2017–2019 Volcanic Eruptions Using the SAGE III/ISS Observations
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Detecting Moving Vehicles from Satellite-Based Videos by Tracklet Feature Classification

1
School of Earth Sciences and Engineering, Hohai University, Nanjing 211100, China
2
School of Art, Nanjing University of Information Science and Technology, Nanjing 210044, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2023, 15(1), 34; https://doi.org/10.3390/rs15010034
Submission received: 11 November 2022 / Revised: 19 December 2022 / Accepted: 19 December 2022 / Published: 21 December 2022
(This article belongs to the Section Remote Sensing Image Processing)
Figure 1
<p>The flow of the proposed method.</p> ">
Figure 2
<p>Distance matrix between the targets and detections. <span class="html-italic">S<sub>ij</sub></span> measures the distance between the target <span class="html-italic">T<sub>i</sub></span> and the detection <span class="html-italic">D<sub>j</sub></span>.</p> ">
Figure 3
<p>Distance measurement between targets and detections. <span class="html-italic">d<sub>s11</sub></span> and <span class="html-italic">d<sub>s12</sub></span> are the spatial distance from <span class="html-italic">T</span><sub>1</sub> to <span class="html-italic">D</span><sub>1</sub> and <span class="html-italic">D</span><sub>2</sub>, respectively. <span class="html-italic">d<sub>p</sub></span><sub>11</sub> and <span class="html-italic">d<sub>p</sub></span><sub>12</sub> are the directional distance from line <span class="html-italic">V</span><sub>1</sub> to <span class="html-italic">D</span><sub>1</sub> and <span class="html-italic">D</span><sub>2</sub>, respectively.</p> ">
Figure 4
<p>Types of tracklets. The panels (<b>a</b>–<b>c</b>) show a vehicle′s tracklet when traveling on a straight road, a bend, and a U-turn, respectively. The panel (<b>d</b>) shows an erratic tracklet tracked from a non-vehicle object.</p> ">
Figure 5
<p>Calculation of the directional change angle. P<sub>1</sub>, P<sub>2</sub>, P<sub>3</sub>, and P<sub>4</sub> are points on the tracklet. <span class="html-italic">β<sub>2</sub></span> and <span class="html-italic">β<sub>3</sub></span> are two directional change angles formed by P<sub>1</sub>, P<sub>2</sub>, P<sub>3</sub>, and P<sub>4</sub>.</p> ">
Figure 6
<p>Decision tree of classification based on the attributes of the tracklet.</p> ">
Figure 7
<p>Creation of confidence map. Panel (<b>a</b>) shows a confidence map, which was created by overlapping traces with different confidence values. Panel (<b>b</b>) shows a confidence map generated from true vehicle tracklets, where the jet colourmap indicates confidence value from 0 (blue) to 1 (red).</p> ">
Figure 8
<p>Enlarged parts of the video data and manually annotated ground-truth vehicles. Image (<b>a</b>) shows the SkyBox satellite data [<a href="#B13-remotesensing-15-00034" class="html-bibr">13</a>], and (<b>b</b>) shows the ChangGuang satellite data.</p> ">
Figure 9
<p>Short tracklets of noise from the tops of buildings. Green dots indicate true vehicles and red dots depict false ones in the current frame.</p> ">
Figure 10
<p>An example of a long tracklet of a false target from the top of a building.</p> ">
Figure 11
<p>An example of a confidence map generated from tracklets.</p> ">
Figure 12
<p>Detection of Dataset 1: yellow indicates correct targets, red shows false targets, and cyan depicts missed targets. Panels (<b>a</b>) Ground-truth, (<b>b</b>) Diff3, (<b>c</b>) MOG2, (<b>d</b>) ViBE, (<b>e</b>) AMS-DAT, (<b>f</b>) NPBSM, (<b>g</b>) LCNN, and (<b>h</b>) TFC (the proposed method).</p> ">
Figure 13
<p>Detection of Dataset 2: yellow indicates correct targets, red shows false targets, and cyan depicts missed targets. Panels (<b>a</b>) Ground-truth, (<b>b</b>) Diff3, (<b>c</b>) MOG2, (<b>d</b>) ViBE, (<b>e</b>) AMS-DAT, (<b>f</b>) NPBSM, (<b>g</b>) LCNN, and (<b>h</b>) TFC (the proposed method).</p> ">
Figure 14
<p>Noise removal of NPBSM detection results under different area thresholds: yellow indicates correct targets, red shows false targets, and cyan depicts missed targets.</p> ">
Figure 15
<p>Treading curves of precision, recall, and F1 score under different area thresholds (in pixels). Increasing precision is accompanied by decreasing recall.</p> ">
Figure 16
<p>Comparison of the tracklets extracted using different methods. (<b>a</b>) The initially detected candidates of moving objects; (<b>b</b>) Motion trajectories extracted by accumulation; (<b>c</b>) Tracklets extracted using our method, in which incorrect tracklets in the rectangles in (<b>b</b>) were removed.</p> ">
Versions Notes

Abstract

:
Satellite-based video enables potential vehicle monitoring and tracking for urban traffic management. However, due to the tiny size of moving vehicles and cluttered background, it is difficult to distinguish actual targets from random noise and pseudo-moving objects, resulting in low detection accuracy. In contrast to the currently overused deep-learning-based methods, this study takes full advantage of the geometric properties of vehicle tracklets (segments of moving object trajectory) and proposes a tracklet-feature-based method that can achieve high precision and high recall. The approach is a two-step strategy: (1) smoothing filtering is used to suppress noise, and then a non-parametric-based background subtracting model is applied for obtaining preliminary recognition results with high recall but low precision; and (2) generated tracklets are used to discriminate between true and false vehicles by tracklet feature classification. Experiments and evaluations were performed on SkySat and ChangGuang acquired videos, showing that our method can improve precision and retain high recall, outperforming some classical and deep-learning methods from previously published literature.

1. Introduction

In recent years, capturing videos from satellites has become a reality, with the launch of commercial satellites such as SkySat, ChangGuang, and Urthecast. Unlike conventional satellites, video satellites can capture consecutive images (videos) by staring at a designated area, thus providing critical information for dynamic monitoring [1,2,3].
Videos satellites have opened up new areas for remote sensing applications, such as large-scale traffic surveillance, 3D reconstruction of buildings, urban management, etc. In traffic surveillance, extracting moving vehicles from satellite videos is one of the most critical tasks. Moving vehicle detection from ground-based cameras has been an active research topic for many years, and many state-of-the-art methods have been proposed, for example, the Temporal Differencing [4], the Gaussian Mixture Model (GMM) [5], the Codebook method [6], and the Visual Background Extraction (ViBE) [7]. However, these methods do not work well on satellite videos because of several challenges as follows [2,8]:
(1)
Tiny Size of Targets: In ground-based surveillance videos, targets are relatively large and can be easily distinguished from noise. In satellite videos, moving vehicles are usually composed of several pixels without distinctive colour and texture, making them difficult to distinguish from noise. Under such conditions, motion information is probably the most robust feature that can be used to identify moving vehicles.
(2)
Pseudo-Moving Objects: Due to the camera′s ego-motion, the parallax of tall buildings causes many pseudo-moving objects, which significantly affect the detection accuracy. False objects often appear on the edges of tall buildings. In the absence of available features, efficiently distinguishing between true and false targets is a crucial challenge.
(3)
Cluttered Background: Satellite videos cover expansive views and capture various ground objects, such as roads, buildings, water bodies, vegetation, etc. These objects form complex and cluttered backgrounds, making vehicles tiny and challenging to identify. As Ao et al. [2] adequately mentioned, the task is like looking for a needle in a haystack. Moreover, video illumination changes cause more difficulties, resulting in inconsistent recognition results.
Motion detection has been a hot research area for decades, and many methods for detecting objects from videos have been proposed. Despite the difference between satellite- and ground-based surveillance videos, some basic techniques still apply to detect objects from satellite-based videos.
Background subtraction for motion segmentation in static scenes is a commonly used technique to detect moving objects from videos [9]. This technique relies on subtracting the current image from a reference image (also called a background image), resulting in the difference image. The pixels where the difference is above a given threshold are classified as moving objects (i.e., foreground). Nevertheless, the critical task is to create the background image (reference image), which can be solved by the median background model [10], the GMM [5], the Kernel Density Estimation (KDE) [11], the ViBE model [7], etc. As the most popular and effective methods for motion detection, the Background Subtraction Models (BSMs) also play an essential role in generating initial detection results from satellite videos. For instance, Kopsiaftis and Karantzalos [1] applied the BSMs and mathematical morphology to estimate traffic density. Yang et al. [12] presented a novel saliency-based background model adapted to detect tiny vehicles from satellite-based videos. The evaluation of different background models was performed by Chen [13], who indicated the best method to detect vehicle candidates from satellite-based videos.
Besides the BSMs, the temporal differencing method, in which the moving objects are evaluated by finding the difference between two or three consecutive frames, has been widely used to detect moving vehicles as the most efficient method. Although the frame differencing method has been used for real-time processing, it generally fails to obtain complete outlines of moving objects and causes ghost regions when objects are moving fast. Therefore, it is commonly combined or adapted with other techniques to overcome the drawbacks. More recently, Chen et al. [14] proposed a method for detecting moving objects, called adaptive separation, which is based on the idea of frame difference. Their method can be easily implemented and offers high performance. Likewise, Shi et al. [15] developed a normalised frame difference method, which was also a variant of frame differencing. Shu et al. [16] proposed a hybrid method to detect small vehicles from satellite-based videos by combining three-frame differencing methods with the Gaussian mixture model.
Witnessing the great success of convolutional neural networks (CNN) in object detection, recent works have utilised CNN to detect moving vehicles. For example, Pflugfelder et al. [17] trained and fine-tuned a FoveaNet for detecting vehicles from satellite-based videos. Chen et al. [13] proposed a lightweight CNN (LCNN) to reject false targets from preliminary detection results to increase recognition accuracy. Zhang et al. [18] trained a weakly supervised deep convolutional neural network (DCNN) to improve detection accuracy. They used the pseudo labels extracted by extended low-rank and structured sparse decomposition.
Some methods exploit motion information to improve detection accuracy. Unlike moving objects in ground surveillance videos, moving vehicles in satellite-based videos lack texture and colour features, and therefore motion information is accounted for. For instance, Xu et al. [19] used a global scene motion compensation and a local dynamic updating method to remove false vehicles by considering the moving background. Likewise, Ao et al. [2] proposed a motion-based detection algorithm via noise modelling. Hu et al. [20] also utilised motion features in deep neural networks for accurate object detection and tracking. Feng et al. [21] proposed a deep-learning framework guided by spatial motion information. More recently, Pi et al. [22] integrated motion information from adjacent frames into an end-to-end neural network framework to detect tiny moving vehicles. Besides spatial information, Lei et al. [23] utilised multiple prior information to improve the accuracy of detecting tiny moving vehicles from satellite-based videos.
Besides the above methods, some studies deal with detecting moving objects using low-rank and sparse representation [24]. Inspired by the low-rank decomposition, Zhang et al. [25] extended the decomposition formulation with bounded errors, called Extended LSD (E-LSD). This method has improved the background subtraction method with boosted detection precision over other methods. In addition, Zhang et al. [26] proposed online low-rank and structured sparse decomposition (O-LSD) to reduce the computational cost, which addresses the processing delay in seeking rank minimisation.
Detecting moving vehicles from satellite videos has been studied for about a decade. Therefore, the existing detection methods have some limitations. The method based on BSMs often introduces noise that is difficult to distinguish from actual vehicles, while the methods based on frame difference are sensitive to noise. Although the deep-learning-based approach has achieved excellent performance, it requires many samples and repetitive training, resulting in a highly time-consuming task. In addition, deep-learning requires expensive computational resources and costs, making it unsuitable for high-speed processing.
Due to the cluttered background and absence of conspicuous appearance features, moving vehicles are often mistaken for false targets caused by complex moving backgrounds. Although some algorithms allow us to adjust parameters to achieve higher detection precision or recall, generally increasing precision results in low recall or vice versa. The critical task is to eliminate the effects of background movement and environmental changes (e.g., illumination change). Overall, a low computational detection algorithm with high precision and recall is lacking, and further studies are needed.
Among the state-of-art techniques for detecting moving objects from satellite-based videos, BSMs are still the most widely used since they are practical methods. Nevertheless, we have noticed that some BSM-based approaches produce results with low precision but high recall. That is to say, they can identify almost all true objects, but also misidentify many false ones. The key difficulty is to distinguish these pseudo-moving vehicles from true. If we can remove false objects and keep true vehicles, both precision and recall can be guaranteed. To this end, this paper proposes a moving vehicle detection approach based on BSMs and tracklet analysis.
Our proposed method fully uses motion information, which is a potential cue for identifying actual moving vehicles. This study aims to distinguish between true and false vehicles using the features of their tracklets, which have not been studied in the existing literature. Object tracking tries to obtain an object’s trajectory, which refers to the entire path an object travels from its appearance to its disappearance. A tracklet is considered one of the fragments of the whole path and can be used to distinguish different moving objects.
To distinguish between these tracklets, we need to investigate the characteristics of different tracklets and design appropriate features accordingly. This paper has several contributions as follows: (1) an object matching algorithm considering the directional distance metrics is proposed to create reliable tracklets of moving objects; (2) a tracklet feature classification method is proposed to distinguish between true and false targets; and (3) a tracklet rasterisation method is proposed to create a confidence map that can be used to adaptively alter the parameters in tracklet classification and derive road regions that can be utilised to verify actual moving vehicles. The following sections cover the details of our procedure and experiments.

2. Method

The proposed method mainly consists of the following steps: (1) image filtering and enhancement, (2) background modelling, (3) object tracking, (4) tracklet feature extraction, and (5) tracklet classification, as summarised in Figure 1.
In the first step, we apply a Gaussian filter to the input videos, suppressing some random noise that often appears at object edges. This step removes part of the false objects from the object detection results. In the second step, a non-parametric background modelling method [27] is chosen to obtain initial vehicle recognition results with high recall but low precision. In the third step, we extract object tracklets using our newly developed tracking method, derived from the Hungary Matching Algorithm (HMA) [28] and Optical Flow (OF) [29]; then, features are extracted for each tracklet and used to distinguish true vehicles from false targets. At the same time, a confidence map can be derived from vehicle tracklets and then used to adjust parameters in the tracklet classification. In the final step, the confidence map can also be used as a constraint to verify proper moving vehicles, achieving both high precision and recall.

2.1. Video Filtering

The challenge of detecting moving vehicles from satellite videos lies in the target size and random distracting noise. We observed that noise often occurred at building edges and road curbs. Image smoothing is an essential technique for dealing with noise. Some commonly used filters are available for our task, such as the mean, median, and Gaussian. For the detection of such tiny objects, the filter should be able to be fine-tuned through its parameters. Among the three, the mean and median filters can be adjusted by only one parameter: the window size. Thus, the targets tend to be over-blurred even when a small window size is applied. In contrast, the Gaussian filter can adjust the smoothing strength by tuning two parameters: the window size (w) and the standard deviation (σ) of the Gaussian function. The Gaussian filter was utilised in our experiments. The appropriate parameters are determined by repetitive testing until the filter can remove much noise and keep targets unspoiled. Finally, the empirical parameters were set to w = 5 and σ = 1.

2.2. Background Subtraction Model

Satellite-based videos differ from ground-based videos. We observed that some popular background models, such as GMM [5] and ViBE [7], do not work well when used directly on satellite-based videos. In our study, we found that GMM generated too much noise, while ViBE tended to miss a lot of true objects. Therefore, we generated initial detection results using a non-parametric modelling method [11] that can adapt quickly to changes in the scene and yield sensitive results with low false alarm rates. This model applies a normal kernel function to estimate the density, and can obtain a more accurate estimation. In our study, we adopt an improved version, proposed by Zivkovic and Heijden [27], and implemented it in OpenCV (an open-source package for computer vision). This model needs two parameters specified by the user, namely (1) the threshold (dT2) on the squared distance between the pixel and the sample to decide whether a pixel is close to a data sample; and (2) the history length (Lhis), which is the number of last frames that affect the background model.

2.3. Tracklet Extraction

When detecting large objects from videos, noise is much smaller than target objects and can be filtered by specifying a threshold value. However, in satellite-based videos, moving vehicles appear in the same size as the noise, making it extremely challenging to remove false targets from the initial detection results. Two major types of false targets exist, namely random noise and pseudo-moving objects, which are caused by the motion parallax of building roofs. We have noticed that a vehicle moves at a constant speed and in a steady direction, following regular trajectories, whereas random noise appears at random positions and therefore cannot form smooth trajectories. Although pseudo-moving objects also form trajectories, their trajectories are often short and erratic. Based on these observations, our method intends to discriminate true targets from false ones by taking full advantage of the features of moving objects′ trajectories.
Object tracking techniques can be utilised to obtain trajectories. Unlike object tracking, our goal is to detect moving vehicles, rather than continuously tracking each object. Therefore, we do not need to track the entire trajectory of a moving object, but rather its tracklets, which are explained in the introduction. In object tracking, an object may be lost for a while and then get tracked again afterward, so in this scenario, we can only get separate tracklets, rather than a complete trajectory. For tracking targets, the algorithm attempts to obtain their complete trajectories. However, our study is supposed to detect the positions of true vehicles, instead of the whole trajectories. Hence, in the remainder of this paper, we only discuss the tracklet tracking method.
The background modelling mentioned above allows us to detect potential moving vehicles, which consist of true and false targets. To track their tracklets, we need to associate each object on the current frame with its corresponding identification on the next frame. This work is also known as data association in Multiple Object Tracking (MOT) [30]. One of the techniques for data association is the Hungarian Matching Algorithm (HMA) [28]. It requires high-quality detection results to obtain satisfactory tracking results. However, some vehicles cannot be detected on specific frames, due to the brightness of targets or motion blur, resulting in incoherent detection results. Missing detection will degrade the performance of HMA. For compensation, another technique, the Lucas-Kanade Optical Flow (LKOF) [31], is applied to track vehicles when HMA fails. Combining the two techniques makes our tracklet tracking algorithm reliable.

2.3.1. The Hungarian Matching Algorithm—HMA

Let T1, T2, T3, …, Tn denote the targets on the k-th frame, and D1, D2, D3, …, Dm denote the detected objects on the (k + 1)-th frame. To track Ti, we must find its match Dj from the next frame. If Dj cannot be found, this indicates that the tracking of Ti is interrupted, or Ti disappeared from the image. If one detection Dj cannot be matched to any target, it is considered a new object. To match a target Ti to its corresponding detection, we need to measure the distances between the target Ti and all the detections. Given n targets and m detections, a distance matrix Cnm (as shown in Figure 2) is generated, based on the distance measurement. Each element Sij measures the distance between the target Ti and the detection Dj. Through HMA, we can find the globally optimal solution for this problem. That is to say, we can assign all targets to their detections, guaranteeing that the summary of distances is minimised.
The quality of the matching algorithm depends on the distance measurement. To ensure the algorithm′s reliability, we proposed a measurement that considers both spatial and directional distances. As shown in Figure 3, the two triangles represent two targets: T1 and T2, and the two circles represent the two corresponding detections: D1 and D2. When matching T1 to the detections under the measurement of only spatial distance, T1 may be assigned to D2. This is because the spatial distance ds12 between T1 and D2 is smaller than the distance ds11 between T1 and D1. Consequently, the matching is wrong and, to avoid this, directional distance is considered. After a period of tracking, each target’s tracklet is obtained. From some recent tracking points (e.g., the last 10 points), we can predict the target′s velocity, which is a vector defined by the thick, dotted lines depicted in Figure 3. The velocity of the target T1 and T2 are represented by V1 and V2, respectively.
Here, we define the directional distance as the distance between a detection point and the line determined by a velocity. In geometry, the distance between a point and a line is defined as the length of the perpendicular segment connecting the point to the given line. For convenience, we denote dpij as the directional distance between target Ti and detection Dj. For instance, in Figure 3, the directional distance between T1 and point D1 is dp11, and dp12 is the directional distance between T1 and point D2. Given a straight line as ax + by + c = 0, the distance between a point (xp, yp) and the straight line can be calculated with the following equation:
d = | a x p + b y p + c | a 2 + b 2
Combining the spatial and directional distance, we define the distance measurement as:
S i j = α d s i j + ( 1 α ) d p i j
where dsij is the spatial distance between the target Ti and the detection Dj, and dpij is the directional distance between Vi (velocity of target Ti) and Dj. The coefficient α (0~1.0) controls the weights between dsij and dpij. In our experiments, we found that the spatial and directional distance are equally important, and thus α was set to 0.5. Under our distance measurement, the target T1 in Figure 3 can be matched to its correct detection D1.

2.3.2. Optical Flow Tracking

Optical flow (OF) is the pattern of apparent motion of image objects between consecutive frames, caused by object or camera movement. It is a two-dimensional vector field where each vector is a displacement vector showing the movement of points from one frame to the second. A practical method for calculating the OF is LKOF, which takes a w × w patch around one point and assumes that all the points in the patch have the same motion. The equation for the OF calculation can be represented as follows:
I ( x , y , t ) = I ( x + d x , y + d y , t + 1 )
where I (x, y, t) is the intensity of the pixel (x, y) at time t, and dx and dy are pixel displacements in the x and y directions, respectively. By linearising Equation (3) and expanding the first-order Taylor, we can get the OF constraint equation as:
I x x t + I y y t + I t = 0
where ∂I/∂x and ∂I/∂y are the gradients of the image in the x and y directions, respectively. In addition, ∂I/∂t is the partial derivative with respect to time. For simplicity, Equation (4) can be rewritten as:
I x v x + I y v y + I t = 0
where (vx, vy) is the OF velocity in the x and y directions. The LKOF method assumes that the flow (vx, vy) is constant in a small window of size w × w . Using a point and its surrounding pixels, the velocity (vx, vy) can be worked out through a group of equations as follows:
[ I x 1 I y 1 I x n I y n ] [ v x v y ] = [ I t 1 I t n ]
Equation (6) means that the OF can be estimated by calculating the derivatives of an image in three dimensions. LKOF is often used to track feature points in computer vision. In our research, we extracted the centroids of the detected objects as the “feature points” and then estimated the OF of each object and thus could track each object.
So far, we have introduced two methods for tracking moving objects: the HMA and the LKOF. HMA is a popular and effective method for multiple object tracking, and it is chosen as the primary tracking method in this study. However, HMA may fail if an object cannot be successfully detected in the next frame. Therefore, LKOF can be used as an auxiliary method when HMA cannot track the object.

2.4. Tracklet Feature Analysis

It is difficult to distinguish true vehicles from false ones using only static information from each frame. Since we track each moving object′s tracklet from consecutive frames, we can distinguish them using temporal information. We observed that different moving objects yielded tracklets with different shapes. These tracklets can be categorised into four types (Figure 4). In Figure 4, the first row shows the illustration for each tracklet type, and the second row shows the corresponding instances from a video. Figure 4a shows a vehicle running on a straight road, Figure 4b represents a vehicle that has just completed a right turn, and Figure 4c depicts a vehicle that is making a U-turn. All three vehicle tracklets are smooth, and the points on the tracklets are almost evenly spaced. In contrast, the rough and erratic tracklet shown in Figure 4d is tracked from a distractor on a building. The background modelling algorithm has misinterpreted it as a vehicle. From the characteristics of these tracklets, the roughness of the tracklet can be used as an indicator that can distinguish between true and false vehicles.
To measure the roughness of the tracklet, we proposed a method based on the directional change angle (DCA). As shown in Figure 5, P1, P2, P3, and P4 are four consecutive points on a tracklet, and the angle β2, between two consecutive line segments P1P2 and P2P3, is defined as the DCA. Suppose an object is moving along the tracklet, starting from P1, it changes its direction towards P3 when it reaches P2. Therefore, β2 measures how much the motion direction has changed at P2. Similarly, we can define another DCA β3 at P3 and so on. All these DCAs will be small if the tracklet is smooth, whereas a rough and erratic tracklet would yield large DCAs.
Given a tracklet with n points, we can calculate (n-2) DCAs, denoted as (β2, β3, …, βn-1), and then derive their descriptive statistics (measured in pixel): minimum (DCAmin), maximum (DCAmax), mean value ( DCA ¯ ), and standard deviation ( σ DCA ). We selected some samples for each type of tracklets and calculated their DCA statistics, and the results are presented in Table 1. We can see that even if a vehicle is on a U-turn, the DCA ¯ is about 0.315 (less than 0.5) and the σ DCA is 0.218 (less than 0.5). All these data indicate that vehicle tracklets (i.e., type (A), (B), and (C)) have small DCAmin, DCAmax, DCA ¯ , and σ DCA , whereas these values are much larger for non-vehicle tracklets.
Among the four statistical values (Table 1), the DCAmin and DCAmax are easily affected by outliers in the DCAs, due to the presence of noises on the tracklet. Similarly, a non-vehicle tracklet may also contain a small DCA value, resulting in a small DCAmin. In contrast, the DCA ¯ and σ DCA are calculated from a series of DCAs, so they are less affected by outlier values and are more reliable. From analysing the geometric features of the tracklets, we can distinguish non-vehicles from vehicles through the mean and standard deviation of the DCAs calculated from their tracklets. The following section covers the tracklet classification based on their tracklet features.

2.5. Tracklet Classification

To describe our algorithm, we define the tracklet as an object for convenience, like the concept of an object in object-oriented programming (OOP). Figure 6 shows the tracklet object with some attributes in UML (Unified Modelling Language) format. DCAs.x means the DCAmin, DCAmax, DCA ¯ , and σ DCA of DCAs discussed in the previous section. Attribute “Length” is the number of points in the tracklet, and “Active” is a Boolean attribute that is true if the tracklet is still being tracked, or false if the tracking has been completed.
As we have observed in traffic videos, vehicle tracklets are smooth, even if they are traveling on a bent or making a U-turn, whereas non-vehicle tracklets are erratic and ragged. Additionally, vehicle tracklets are long (often more than 20 points) and non-vehicle tracklets are relatively shorter.
In the previous section, we mentioned that the DCAmin and DCAmax features of DCAs might not be reliable in distinguishing non-vehicles and vehicles, so we chose DCA ¯ , σ DCA , and the length of the tracklet as the features to classify tracklets. In addition, the attribute “Active” must be used when judging whether a short tracklet is a vehicle or not because a true vehicle tracklet may not be long enough at the beginning of the tracking. Regardless of the active state of a tracklet, all truly moving vehicles will be classified as non-vehicles.
In our research, we do not need to classify tracklets into four types because we aim to distinguish real vehicles from spurious ones. Therefore, our classification is a binary classification problem. The statistics in Table 1 were calculated based on many real instances from traffic videos. Note that the DCA ¯ of a non-vehicle is above 1.0 and all the DCA ¯ values of vehicles are less than 0.3. The σ DCA of a non-vehicle is usually larger than 0.5 and that of a vehicle is less than 0.5. This indicates that non-vehicles and vehicles are linearly separable. Considering the tracklet length as another feature, we set three thresholds: Tmean = 0.5 C, Tstd = 0.5 C, and Tlength = 20/C for tracklet classification, where C is a self-adaptive coefficient (default value is 1.0), which we will discuss in the next section. We need to process each frame as fast as possible, so the classification method should be lightweight. From this point, we use the decision tree in Figure 6 as our classifier:
From the decision tree, we can see that if a tracklet is too short (shorter than TLength), it will be classified as a non-vehicle when the tracklet has been completed (Active = False), since short tracklets are usually tracked from noise. If a short tracklet is still active, we cannot currently judge whether it is a true vehicle or not and just skip it. If the tracklet is longer than the threshold Tlength and has a small DCA ¯ value (less than Tmean), it is a vehicle on a straight road. On the other hand, if the DCA ¯ is larger than Tmean, we need the third feature σ DCA to make further decisions: (1) if σ DCA is also larger than the corresponding threshold Tstd, the tracklet is classified as a non-vehicle; (2) if σ DCA is less than Tstd, it is a vehicle.

2.6. Confidence Map

In urban areas, moving vehicles are driven on certain fixed regions: roads. Getting the road ROIs (Region of Interest) in advance would help suppress false targets using the ROIs as constraints. Yang et al. [12] and Chen et al. [14] utilised a similar strategy in their research, where the road mask was generated from accumulated trajectories and then used to filter false targets. Unlike their work, we created a confidence map from the tracked tracklets and then used it to adapt the parameter C in the tracklet classification.
In the previous section, three thresholds, Tmean = 0.5 C, Tstd = 0.5 C, and Tlength = 20/C were set. When C is set to the default value 1.0, the classification criteria are strict, and only true vehicles that meet the high criteria can be classified as a vehicle. The confidence of a tracklet is computed by the following formula:
V c o n f = 0.5 max ( 0 , 2 T . D C A . m e a n 2 ) + 0.5 min ( T . L e n g t h ,   255 ) 255
where T.DCA.mean is the mean value (i.e.,   DCA ¯ ) of the tracklet′s DCAs, and T.Length is the length of the tracklet. The first term of Equation (7) depends on the smoothness of the tracklet and the second on the length. The smoother and longer the tracklet, the higher the confidence. The DCA ¯ of a tracklet is larger than 0 and usually less than 2, so the first term ranges from 0 to 0.5. The second term also ranges from 0 to 0.5, and thus VConf varies between 0 and 1.0. The confidence measures the likelihood that a tracklet is tracked from a true moving vehicle.
A confidence map can be generated from vehicle tracklets as follows: each vehicle tracklet is rasterised into a 3-pixels wide trace, and the corresponding pixel values are set to the tracklet’s confidence value. Once all traces are created, they are overlapped together. When overlapping these traces, low confidence values are replaced by high values, as shown in Figure 7a, where four traces with different confidence values (1, 2, 3, and 4) are displayed in blue, green, yellow, and red colours, respectively.
This way, the confidence map is accumulated and updated during tracking. Figure 7b shows a confidence map generated and dynamically updated from real tracklets.
Once the confidence map is obtained, the coefficient C is calculated by the following formula:
C = 2 max ( 0.25 ,   V c o n f )
As mentioned above, Vconf varies between 0 and 1.0, and thus C ranges from 0.5 to 2.0. As previously mentioned, C is a self-adaptive coefficient that controls the three thresholds. If an object is in a high-confidence region, C × Tmean and C × Tstd increase with a large value of C, while Tlength decreases. This means that the smoothness and length criteria are lowered in high-confidence areas, thus reducing the probability of rejecting a true vehicle with a ragged tracklet. On the other hand, the criteria are raised in low confidence regions, and thus false moving vehicles could be rejected safely.

3. Result

To evaluate the performance of the proposed method, two experiments were conducted on two different satellite-based videos. The two parameters of the background model were set as: Lhis = 50 and dT2 = 200. All other parameters were set as mentioned in the previous sections. Our approach was implemented in Python and the OpenCV package, which offers basic image processing functions, the optical flow tracking algorithm, and classical background subtraction models. The HMA, tracklet classification, and evaluation modules were implemented in Python language, along with NumPy (a mathematical package for Python).

3.1. Datasets

The datasets consist of videos from two recently launched satellites. The first video was acquired by SkySat over Las Vegas, USA, on 25 March 2014, with a frame size of 1280 × 720. This dataset has been used in our previous study [13] and can be reused in this experiment. Due to a large number of tiny and dim vehicles in videos, the manual labelling work of samples is too enormous to apply to the whole frame. Therefore, the evaluation was conducted in a sub-region of 360 × 460, in which tall buildings cause obvious motion parallax. Eleven frames (200, 350, 400, 500, 600, 700, 800, 900, 1000, 1100, and 1200) were chosen as the validation data. The second video was captured by ChangGuang satellite over Atlanta, USA, on 3 May 2017, with a frame size of 1920 × 1080. A 416 × 516 sized sub-region was cropped as the experimental area and ten frames (410, 510, 560, 610, 640, 670, 730, 760, 790, and 820) were chosen as the validation data. In the two validation datasets, true moving vehicles were manually annotated as ground-truth images. The green rectangles in Figure 8 are the moving vehicles annotated manually.

3.2. Evaluation Metrics

The validation of object detection is often conducted by comparing the detection results against the ground-truth images. If a detected object overlaps with the corresponding ground-truth object, it is considered a True Positive (TP) detection; otherwise, it is a False Positive (FP) target. A ground-truth object is a False Negative (FN) when it is not successfully detected. Once the TP, FP, and FN are calculated out, the precision, recall, and F1 score can be computed as follows [32]:
P r e c i s i o n = ( T P ) ( T P ) + ( F P )
P r e c i s i o n = ( T P ) ( T P ) + ( F P )
F 1 = 2 ( Precision ) ( Recall ) ( P r e c i s i o n ) + ( R e c a l l )

3.3. Evaluation

3.3.1. Qualitative Evaluation

This section presents some results of the tracklets and evaluates the performance by visual inspection. For qualitative evaluation, we conducted experiments on Dataset 1 (Figure 8a), which contains tall buildings causing motion parallax that resembles moving vehicles.
Figure 9 shows the tracklets from frame 359 to frame 379 of Dataset 1. We only show an enlarged portion of the image so that we can see the tracklets and the vehicles. In each frame, the green dots indicate positive cases (true moving vehicles), and the red dots show false ones (non-vehicles). Note that a green dot does not mean it is a true vehicle forever, and a red dot does not mean a false target. The colours only indicate the decision (true or false vehicle) the classifier can make up to the current frame. Therefore, the decision is dynamically updated from one frame to the next. The final decision is only made once a tracklet has been completed. Let us take the two green dots in the dotted rectangle in Figure 9b as examples. They are random noise on building tops and are misidentified as moving vehicles by the background model. Currently, the classifier cannot judge whether they are vehicles or not, as they appear in frame 362 for the first time, and their tracklets are not long enough to make decisions; then, the two objects are classified as non-vehicles in frame 365 in Figure 9c; where their tracking is finished, the classifier can make the final decision. After frame 365, the two non-vehicle objects in the rectangle in Figure 9c become inactive and disappear.
Figure 9d–f provides more examples of such random noise. Notice that three new objects and two new objects appear in the dotted rectangles in frames 372 and 375, respectively. Similarly, some other objects appear on the building tops from frame 375 to frame 378, which are not presented here, due to space constraints. When the classifier proceeds to frame 379, the objects on building tops are correctly classified as non-vehicle—red dots in the rectangle in Figure 9f. These examples show that the classifier effectively distinguishes random noise in the detection results.
Our classifier makes decisions dynamically with the accumulation of tracklets. This can be verified by the examples in the circle in Figure 9b, where some true vehicles are marked in red spots. This does not mean that our classifier cannot make correct decisions; this is because these vehicles are newly tracked and their tracklets are not long enough to obtain a steady calculation of the features. With the accumulation of the tracklets, the calculation becomes stable, and the decision can be made correctly, as shown in the circle in Figure 9f, where all true vehicles are marked in green colour.
Apart from some random noise that cannot track long tracklets, some distractors can also track long tracklets. In such a scenario, our classifier can also identify these false targets using the tracklet features as discussed above. Figure 10 shows an instance in the dotted rectangle on the top of a building, which tracks a long tracklet and is misidentified as a vehicle, as shown from frame 514 to frame 525. Although this situation persists for an extended period, it is correctly identified as a non-vehicle when the algorithm proceeds to frame 530 in Figure 10d. This example shows that our classifier can make a correct decision once the tracklet is accumulated.
As mentioned before, our algorithm can generate a confidence map as a by-product during the tracklet classification. The confidence map is generated from the vehicle tracklets and used to adjust the thresholds adaptively. Figure 11 shows the phases of the confidence map generated from Dataset 1, at intervals of 100 frames. One can notice that the road regions are fully formed at frames 600 to 700 and remain almost unchanged afterward. This indicates that the confidence map can also be used as road ROIs to verify our detection results and to remove false targets outside these road ROIs. With the assistance of the confidence map, our algorithm can also remove false targets when the tracklet classification fails, thus guaranteeing detection accuracy.

3.3.2. Quantitative Evaluation

This section compares the detection results with the ground-truth images, and then the quantitative evaluation metrics are calculated. For comparison, we also implemented several other classic motion detection models and methods from previously published literature: (1) three-frame difference (Diff3): a classic moving object detection method that uses the difference image between consecutive frames [33]; (2) MOG2: an improved method based on GMM [34]; (3) ViBE [7]: a background model using an update mechanism composed of random substitutions and spatial diffusion; (4) AMS-DAT [14]: a novel moving vehicle detection approach using adaptive motion separation and difference accumulated trajectory; (5) NPBSM: a non-parametric background subtraction model [27]; (6) LCNN [13]: a moving vehicle detection method based on a background model and a lightweight convolutional neural network. For convenience, the proposed method is abbreviated as TFC (Tracklet Feature Classification).
The performance of different methods depends on the datasets’ characteristics and the videos’ content. An area threshold is often used to filter out noise to improve object detection accuracy. That is, objects smaller than the area threshold are removed. However, the performance of the Diff3, MOG2, and ViBE methods is significantly affected by the area threshold: a small threshold keeps much noise in the detection, resulting in low precision values, whereas a large threshold often omits true objects, leading to low recall values. To ensure fairness, the thresholds for the three methods are carefully adjusted to obtain the best trade-off between precision and recall (i.e., the best F1 score). NPBSM also depends on the area threshold, but it was set to a small value (3 pixels) to ensure a high precision value because our method (i.e., TFC) took its results as input, and its high precision was required. The rest of these methods were also adjusted to their best performance trade-off.
The detection results of Datasets 1 and 2 are presented in Figure 12 and Figure 13, respectively, and the quantitative evaluation results are shown in Table 2. Note that the statistics in Table 2 are calculated over multiple frames, while Figure 12 and Figure 13 show only one frame of the results. Therefore, the scores may seem inconsistent with the results in some of these images. From the statistics in Table 2, one can see that TFC can achieve the highest F1 score on both datasets.
Dataset 1 contains tall buildings that cause motion parallax resembling moving vehicles. After removing many noises, Diff3 achieves a pretty good recall but a low precision, because this method is very susceptible to noise. Diff3 can detect subtle changes between frames and thus retrieves a large portion of true vehicles as well as introduces many false targets. ViBE has been reported to have satisfactory performance in detecting moving objects from ground surveillance videos, but it tends to omit many true vehicles in satellite-based videos, resulting in the lowest recall and F1 score. As a classical method, MOG2 achieves a precision, recall, and F1 score of around 70%. NPBSM achieves the highest recall (96.4%) but the lowest precision (48.6%). As we have mentioned earlier, NPBSM was not tuned to the best trade-off between precision and recall because our method needed its high recall results as input, showing that our method can significantly improve its precision. Finally, AMS-DAT, LCNN, and TFC can achieve precision and F1 scores of over 80%. TFC obtains the highest F1 score among the three methods, indicating the best overall performance—the best trade-off between precision and recall.
Dataset 2 does not contain tall buildings, but has many “distractors” from the parking lots and viaducts in the scene, introducing random noise. Moreover, some local distortions, as well as dim and blurred vehicles, make detection difficult. One can notice that Diff3′s F1 score is the lowest among all the results and that its precision is inferior, indicating its high sensitivity to noise. With the area threshold set to 3 pixels, MOG2 can achieve the highest precision (89.9%), but its recall is unsatisfactory (66%, the second-lowest value). ViBE again achieves the lowest recall and the second-lowest F1 score on this dataset. AMS-DAT obtains moderate precision, recall, and F1 score. Similarly, NPBSM achieves the highest recall but low precision. Both LCNN and TFC achieve precision, recall, and F1 scores of over 80%. Of the two, TFC has higher recall but lower precision, but the F1 scores also show that TFC achieves the best overall performance.

3.3.3. Analysis of Area Threshold

We have mentioned earlier that using area thresholds to filter out noise or false objects is not effective for detecting small moving vehicles from satellite-based videos. This is because the vehicles in the videos often resemble noise in size or shape. Removing noise by an area threshold can improve the precision to some extent, but it also removes true targets, lowering recall value. This can be confirmed by the following experiment. Among the background subtraction models used in the previous experiments, the NPBSM proved to have the best overall performance. Therefore, we chose the NPBSM method to demonstrate the results.
In this experiment, the area threshold gradually increased from 1 pixel to 15 pixels, and detected objects smaller than each of these thresholds were filtered out. Thus, we obtained a series of detection results under different area thresholds, as shown in Figure 14. These detection results were quantitatively evaluated, and the assessments are presented in Table 3 (due to space limitation in the row, only assessments under odd-numbered thresholds are shown).
The first image in Figure 14 shows the detection results when objects less than, or equal to, 1 pixel are removed. Under this threshold, almost all true vehicles (objects in yellow rectangles) are correctly recognised, except for one (the object in the cyan rectangle). However, many distractors (objects in red rectangles) are also mistaken as true vehicles. With the increase in the area threshold, more red rectangles disappear and more cyan rectangles appear in the following images. This means that precision increases while recall decreases, as shown in Figure 15. In Figure 15, the precision and recall curves intersect at about 7.5 pixels, where the precision and recall reach the best compromise. The assessments in Table 3 also show that the F1 score reaches the highest value (about 83.3%) at the threshold of 7 or 8 pixels.
The experiment shows that high precision and recall cannot be obtained simultaneously by directly filtering out small objects via area thresholds. The highest F1 score the filtering method can obtain on Dataset 1 is 83.3%, which is lower than that TFC achieves (89.5%). This indicates that the TFC method effectively improves detection precision while retaining high recall.

4. Discussion

The experimental results show that the frame difference method (Diff3) introduces too much noise, resulting in inferior accuracy. The ViBE method, as a recently proposed novel background subtraction model, works well with surveillance monitoring videos but not when applied to detect tiny moving objects from satellite-based videos. The MOG2 method achieves moderate performance. Both LCNN and TFC methods adopt the NPBSM as the preliminary detector and then eliminate false targets, but they use different strategies to remove them. LCNN uses a lightweight convolutional neural network that requires collecting samples and costs additional processing time, whereas TFC takes full advantage of motion information to identify false objects. Both AMS-DAT and TFC use motion information to eliminate false targets, but they adopt different strategies. AMS-DAT uses a straightforward method that accumulates the difference foreground to obtain moving trajectories. TFC applies complete temporal information and uses tracklet features to distinguish between true and false targets.
Our method uses a two-step strategy to obtain the final result. The accuracy of the final result depends on two aspects: high recall of the preliminary detection and the accuracy of the tracklet classifier. Hence, the basic premise of the TFC method is that including more false targets in the preliminary detection is more acceptable than missing true objects. This is because false targets can be filtered out through post-processing, but missed targets will never be retrieved again. Therefore, the algorithm should guarantee that all candidates can be recognised in the first step. The NPBSM used in our method can detect candidates with recall higher than 96%, fulfilling the task.
In the subsequent post-processing step, our method tracks the centroids of candidates using the HMA and LKOF, which is different from the accumulation strategy proposed by Chen et al. [14]. The edges of tall buildings tend to form tracklets similar to vehicles. The yellow and green rectangles in Figure 16a indicate edges that are incorrectly recognised as moving vehicles, and Figure 16b shows the tracklets extracted by accumulation, including incorrect tracklets (in the two rectangles) caused by building edges. In contrast, Figure 16c shows that our strategy can prevent the false targets from forming tracklets, yielding more accurate results. In addition, our method can obtain more complete tracklets than the accumulation method.
As for the detection performance of the methods involved in our experiments, a summary is drawn as follows: NPBSM has the highest recall (over 96%), indicating that it can retrieve almost all targets, but it also produces many false alarms, with the precision from 48 to 55%. Therefore, NPBSM has the advantage of high recall capability and can be used to generate initial detection in two-step methods, outperforming Diff3, MOG2, and ViBE. As two-step methods, AMS-DAT, LCNN, and TFC apply different strategies to eliminate false alarms. AMS-DAT uses a straightforward trajectory accumulation method to verify true targets and can achieve moderate precisions (78−85%) and recalls (68−78%). LCNN employs a lightweight convolutional neural network and achieves precisions from 84 to 90% and recalls of around 85%. TFC utilises tracklet features to discriminate between true and false targets and achieves precisions from 82 to 93% and recalls from 85 to 92%. On our datasets, TFC outperforms LCNN and AMS-DAT. Among all the methods in the experiments, TFC has the best overall performance, achieving the highest F1 score.

5. Conclusions

Satellite videos offer a new data source for monitoring moving vehicles for urban traffic management. Because of the tiny size and lack of distinctive appearance features of the vehicles, improving the accuracy of moving vehicle detection is challenging. This study proposed a two-step approach to improve detection accuracy. First, the Gaussian smoothing filtering is applied to suppress noise, and then the background subtraction model is used to obtain the preliminary detection results. Second, the Hungarian matching algorithm and Lucas–Kanade optical flow are used to generate tracklets, which are then used to discriminate between true and false targets by extracting tracklet features. Experiments on different datasets demonstrated that the proposed method could achieve both satisfactory precision and recall, outperforming some classical and other specific methods from recently published literature.
Our method takes full advantage of motion information and does not need any auxiliary data. Therefore, this method avoids much extra work, such as sample collection and training, which are indispensable and time-consuming in deep-learning-based methods. In addition, this method is easy to be implemented. The experiments showed that the algorithm could detect vehicles traveling at different speeds, indicating that the detection results are not affected by the vehicle speed as long as the tracklets can be tracked. It is important to note that our method depends on motion information and thus can only detect moving vehicles (it fails when vehicles stop at road intersections). Future work will be devoted to detecting both moving and stationary vehicles. Another limitation is that the method cannot recognise vehicle types, due to the low resolution and lack of colour and texture information. The only clue available is the target size, which can distinguish large vehicles (e.g., trucks and buses) from small ones (cars). This can be solved by extending this method to high-resolution data, such as unmanned aerial vehicle (UAV) videos.

Author Contributions

Conceptualisation, R.C.; Methodology, R.C.; Validation, X.L.; Writing—original draft, R.C.; Writing—review and editing, V.G.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 41471276.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the editors and reviewers for providing suggestions. The authors also acknowledge the company Skybox Imaging and ChangGuang Satellite Technology for the satellite videos.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kopsiaftis, G.; Karantzalos, K. Vehicle Detection and Traffic Density Monitoring from Very High Resolution Satellite Video Data. In Proceedings of the IGARSS, Milan, Italy, 26–31 July 2015; pp. 1881–1884. [Google Scholar]
  2. Ao, W.; Fu, Y.; Hou, X.; Xu, F. Needles in a Haystack: Tracking City-Scale Moving Vehicles From Continuously Moving Satellite. IEEE Trans. on Image Process. 2020, 29, 1944–1957. [Google Scholar] [CrossRef] [PubMed]
  3. Ahmadi, S.A.; Ghorbanian, A.; Mohammadzadeh, A. Moving Vehicle Detection, Tracking and Traffic Parameter Estimation from a Satellite Video: A Perspective on a Smarter City. Null 2019, 40, 8379–8394. [Google Scholar] [CrossRef]
  4. Shaikh, S.H.; Saeed, K.; Chaki, N. Moving Object Detection Approaches, Challenges and Object Tracking. In Moving Object Detection Using Background Subtraction; SpringerBriefs in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; pp. 5–14. ISBN 978-3-319-07385-9. [Google Scholar]
  5. Friedman, N.; Russell, S. Image Segmentation in Video Sequences: A Probabilistic Approach. arXiv 1997, arXiv:1302.1539. [Google Scholar] [CrossRef]
  6. Kim, K.; Chalidabhongse, T.H.; Harwood, D.; Davis, L. Background Modeling and Subtraction by Codebook Construction. In Proceedings of the 2004 International Conference on Image Processing, 2004. ICIP ’04, Singapore, 24–27 October 2004; Volume 5, pp. 3061–3064. [Google Scholar]
  7. Barnich, O.; Droogenbroeck, M.V. ViBE: A Powerful Random Technique to Estimate the Background in Video Sequences. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Taipei, Taiwan, 19–24 April 2009; pp. 945–948. [Google Scholar]
  8. Zhang, W.; Jiao, L.; Liu, F.; Li, L.; Liu, X.; Liu, J. MBLT: Learning Motion and Background for Vehicle Tracking in Satellite Videos. IEEE Trans. Geosci. Remote Sensing 2021, 60, 1–15. [Google Scholar] [CrossRef]
  9. Piccardi, M. Background Subtraction Techniques: A Review. In Proceedings of the 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583), IEEE, The Hague, The Netherlands, 10–13 October 2004; Volume 4, pp. 3099–3104. [Google Scholar]
  10. Hung, M.-H.; Pan, J.-S.; Hsieh, C.-H. A Fast Algorithm of Temporal Median Filter for Background Subtraction. J. Inf. Hiding Multimed. Signal Process. 2014, 5, 33–40. [Google Scholar]
  11. Elgammal, A.; Harwood, D.; Davis, L. Non-Parametric Model for Background Subtraction. In Computer Vision—ECCV 2000; Vernon, D., Ed.; Springer: Berlin/Heidelberg, Germany, 2000; Volume 1843, pp. 751–767. ISBN 978-3-540-67686-7. [Google Scholar]
  12. Yang, T.; Wang, X.; Yao, B.; Li, J.; Zhang, Y.; He, Z.; Duan, W. Small Moving Vehicle Detection in a Satellite Video of an Urban Area. Sensors 2016, 16, 1528. [Google Scholar] [CrossRef] [Green Version]
  13. Chen, R.; Li, X.; Li, S. A Lightweight CNN Model for Refining Moving Vehicle Detection From Satellite Videos. IEEE Access 2020, 8, 221897–221917. [Google Scholar] [CrossRef]
  14. Chen, X.; Sui, H.; Fang, J.; Zhou, M.; Wu, C. A Novel AMS-DAT Algorithm for Moving Vehicle Detection in a Satellite Video. IEEE Geosci. Remote Sens. Lett. 2020, 19, 3501505. [Google Scholar] [CrossRef]
  15. Shi, F.; Qiu, F.; Li, X.; Zhong, R.; Yang, C.; Tang, Y. Detecting and Tracking Moving Airplanes from Space Based on Normalized Frame Difference Labeling and Improved Similarity Measures. Remote Sens. 2020, 12, 3589. [Google Scholar] [CrossRef]
  16. Shu, M.; Zhong, Y.; Lv, P. Small Moving Vehicle Detection via Local Enhancement Fusion for Satellite Video. Null 2021, 42, 7189–7214. [Google Scholar] [CrossRef]
  17. Pflugfelder, R.; Weissenfeld, A.; Wagner, J. On Learning Vehicle Detection in Satellite Video. arXiv 2020, arXiv:2001.10900. [Google Scholar]
  18. Zhang, J.; Zhang, J.; Jia, X. Learning Via Watching: A Weakly Supervised Moving Object Detector for Satellite Videos. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, IEEE, Brussels, Belgium, 11–16 July 2021; pp. 2333–2336. [Google Scholar]
  19. Xu, A.; Wu, J.; Zhang, G.; Pan, S.; Wang, T.; Jang, Y.; Shen, X. Motion Detection in Satellite Video. J. Remote Sens. GIS 2017, 6, 194. [Google Scholar] [CrossRef]
  20. Hu, Z.; Yang, D.; Zhang, K.; Chen, Z. Object Tracking in Satellite Videos Based on Convolutional Regression Network With Appearance and Motion Features. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 783–793. [Google Scholar] [CrossRef]
  21. Feng, J.; Zeng, D.; Jia, X.; Zhang, X.; Li, J.; Liang, Y.; Jiao, L. Cross-Frame Keypoint-Based and Spatial Motion Information-Guided Networks for Moving Vehicle Detection and Tracking in Satellite Videos. ISPRS J. Photogramm. Remote Sens. 2021, 177, 116–130. [Google Scholar] [CrossRef]
  22. Pi, Z.; Jiao, L.; Liu, F.; Liu, X.; Li, L.; Hou, B.; Yang, S. Very Low-Resolution Moving Vehicle Detection in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5624517. [Google Scholar] [CrossRef]
  23. Lei, J.; Dong, Y.; Sui, H. Tiny Moving Vehicle Detection in Satellite Video with Constraints of Multiple Prior Information. Null 2021, 42, 4110–4125. [Google Scholar] [CrossRef]
  24. Liu, X.; Zhao, G.; Yao, J.; Qi, C. Background Subtraction Based on Low-Rank and Structured Sparse Decomposition. IEEE Trans. Image Process. 2015, 24, 2502–2514. [Google Scholar] [CrossRef]
  25. Zhang, J.; Jia, X.; Hu, J. Error Bounded Foreground and Background Modeling for Moving Object Detection in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2020, 58, 2659–2669. [Google Scholar] [CrossRef] [Green Version]
  26. Zhang, J.; Jia, X.; Hu, J.; Chanussot, J. Online Structured Sparsity-Based Moving-Object Detection From Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2020, 58, 6420–6433. [Google Scholar] [CrossRef] [Green Version]
  27. Zivkovic, Z.; van der Heijden, F. Efficient Adaptive Density Estimation per Image Pixel for the Task of Background Subtraction. Pattern Recognit. Lett. 2006, 27, 773–780. [Google Scholar] [CrossRef]
  28. Kuhn, H.W. The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly 1955, 2, 83–97. [Google Scholar] [CrossRef] [Green Version]
  29. Horn, B.K.P.; Schunck, B.G. Determining Optical Flow. Artif. Intell. 1981, 17, 185–203. [Google Scholar] [CrossRef]
  30. Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Zhao, X.; Kim, T.-K. Multiple Object Tracking: A Literature Review. arXiv 2017, arXiv:1409.7618. [Google Scholar] [CrossRef]
  31. Lucas, B.D.; Kanade, T. An Iterative Image Registration Technique with an Application to Stereo Vision. In Proceedings of the Proceedings of the 7th International Joint Conference on Artificial Intelligence—Volume 2; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1981; pp. 674–679. [Google Scholar]
  32. Goyal, K.; Singhai, J. Recursive-Learning-Based Moving Object Detection in Video with Dynamic Environment. Multimed. Tools Appl. 2021, 80, 1375–1386. [Google Scholar] [CrossRef]
  33. Zhang, Y.; Wang, X.; Qu, B. Three-Frame Difference Algorithm Research Based on Mathematical Morphology. Procedia Eng. 2012, 29, 2705–2709. [Google Scholar] [CrossRef] [Green Version]
  34. Zivkovic, Z.; von der Heijden, F. Recursive Unsupervised Learning of Finite Mixture Models. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 651–656. [Google Scholar] [CrossRef]
Figure 1. The flow of the proposed method.
Figure 1. The flow of the proposed method.
Remotesensing 15 00034 g001
Figure 2. Distance matrix between the targets and detections. Sij measures the distance between the target Ti and the detection Dj.
Figure 2. Distance matrix between the targets and detections. Sij measures the distance between the target Ti and the detection Dj.
Remotesensing 15 00034 g002
Figure 3. Distance measurement between targets and detections. ds11 and ds12 are the spatial distance from T1 to D1 and D2, respectively. dp11 and dp12 are the directional distance from line V1 to D1 and D2, respectively.
Figure 3. Distance measurement between targets and detections. ds11 and ds12 are the spatial distance from T1 to D1 and D2, respectively. dp11 and dp12 are the directional distance from line V1 to D1 and D2, respectively.
Remotesensing 15 00034 g003
Figure 4. Types of tracklets. The panels (ac) show a vehicle′s tracklet when traveling on a straight road, a bend, and a U-turn, respectively. The panel (d) shows an erratic tracklet tracked from a non-vehicle object.
Figure 4. Types of tracklets. The panels (ac) show a vehicle′s tracklet when traveling on a straight road, a bend, and a U-turn, respectively. The panel (d) shows an erratic tracklet tracked from a non-vehicle object.
Remotesensing 15 00034 g004
Figure 5. Calculation of the directional change angle. P1, P2, P3, and P4 are points on the tracklet. β2 and β3 are two directional change angles formed by P1, P2, P3, and P4.
Figure 5. Calculation of the directional change angle. P1, P2, P3, and P4 are points on the tracklet. β2 and β3 are two directional change angles formed by P1, P2, P3, and P4.
Remotesensing 15 00034 g005
Figure 6. Decision tree of classification based on the attributes of the tracklet.
Figure 6. Decision tree of classification based on the attributes of the tracklet.
Remotesensing 15 00034 g006
Figure 7. Creation of confidence map. Panel (a) shows a confidence map, which was created by overlapping traces with different confidence values. Panel (b) shows a confidence map generated from true vehicle tracklets, where the jet colourmap indicates confidence value from 0 (blue) to 1 (red).
Figure 7. Creation of confidence map. Panel (a) shows a confidence map, which was created by overlapping traces with different confidence values. Panel (b) shows a confidence map generated from true vehicle tracklets, where the jet colourmap indicates confidence value from 0 (blue) to 1 (red).
Remotesensing 15 00034 g007
Figure 8. Enlarged parts of the video data and manually annotated ground-truth vehicles. Image (a) shows the SkyBox satellite data [13], and (b) shows the ChangGuang satellite data.
Figure 8. Enlarged parts of the video data and manually annotated ground-truth vehicles. Image (a) shows the SkyBox satellite data [13], and (b) shows the ChangGuang satellite data.
Remotesensing 15 00034 g008
Figure 9. Short tracklets of noise from the tops of buildings. Green dots indicate true vehicles and red dots depict false ones in the current frame.
Figure 9. Short tracklets of noise from the tops of buildings. Green dots indicate true vehicles and red dots depict false ones in the current frame.
Remotesensing 15 00034 g009
Figure 10. An example of a long tracklet of a false target from the top of a building.
Figure 10. An example of a long tracklet of a false target from the top of a building.
Remotesensing 15 00034 g010
Figure 11. An example of a confidence map generated from tracklets.
Figure 11. An example of a confidence map generated from tracklets.
Remotesensing 15 00034 g011
Figure 12. Detection of Dataset 1: yellow indicates correct targets, red shows false targets, and cyan depicts missed targets. Panels (a) Ground-truth, (b) Diff3, (c) MOG2, (d) ViBE, (e) AMS-DAT, (f) NPBSM, (g) LCNN, and (h) TFC (the proposed method).
Figure 12. Detection of Dataset 1: yellow indicates correct targets, red shows false targets, and cyan depicts missed targets. Panels (a) Ground-truth, (b) Diff3, (c) MOG2, (d) ViBE, (e) AMS-DAT, (f) NPBSM, (g) LCNN, and (h) TFC (the proposed method).
Remotesensing 15 00034 g012
Figure 13. Detection of Dataset 2: yellow indicates correct targets, red shows false targets, and cyan depicts missed targets. Panels (a) Ground-truth, (b) Diff3, (c) MOG2, (d) ViBE, (e) AMS-DAT, (f) NPBSM, (g) LCNN, and (h) TFC (the proposed method).
Figure 13. Detection of Dataset 2: yellow indicates correct targets, red shows false targets, and cyan depicts missed targets. Panels (a) Ground-truth, (b) Diff3, (c) MOG2, (d) ViBE, (e) AMS-DAT, (f) NPBSM, (g) LCNN, and (h) TFC (the proposed method).
Remotesensing 15 00034 g013
Figure 14. Noise removal of NPBSM detection results under different area thresholds: yellow indicates correct targets, red shows false targets, and cyan depicts missed targets.
Figure 14. Noise removal of NPBSM detection results under different area thresholds: yellow indicates correct targets, red shows false targets, and cyan depicts missed targets.
Remotesensing 15 00034 g014
Figure 15. Treading curves of precision, recall, and F1 score under different area thresholds (in pixels). Increasing precision is accompanied by decreasing recall.
Figure 15. Treading curves of precision, recall, and F1 score under different area thresholds (in pixels). Increasing precision is accompanied by decreasing recall.
Remotesensing 15 00034 g015
Figure 16. Comparison of the tracklets extracted using different methods. (a) The initially detected candidates of moving objects; (b) Motion trajectories extracted by accumulation; (c) Tracklets extracted using our method, in which incorrect tracklets in the rectangles in (b) were removed.
Figure 16. Comparison of the tracklets extracted using different methods. (a) The initially detected candidates of moving objects; (b) Motion trajectories extracted by accumulation; (c) Tracklets extracted using our method, in which incorrect tracklets in the rectangles in (b) were removed.
Remotesensing 15 00034 g016
Table 1. Statistics of DCAs for each tracklet type.
Table 1. Statistics of DCAs for each tracklet type.
Type of TrackletDCAminDCAmax DCA ¯ σ DCA
(A) Straight0.0030.4720.1660.121
(B) Bend0.0150.8910.2790.191
(C) U-turn0.0380.7150.3150.218
(D) Non-vehicle0.2932.5111.1780.909
Table 2. Evaluation of the detection results from different methods.
Table 2. Evaluation of the detection results from different methods.
DataMethodPrecision (%)Recall (%)F1-Score (%)
Dataset 1Diff348.887.262.5
MOG268.072.069.0
ViBE57.062.760.0
AMS-DAT85.178.381.5
NPBSM48.696.464.6
LCNN90.384.287.2
TFC (ours)93.785.689.5
Dataset 2Diff330.177.743.4
MOG289.966.076.1
ViBE75.960.667.4
AMS-DAT77.967.672.4
NPBSM54.896.669.9
LCNN84.085.384.6
TFC (ours)81.891.886.5
Note: Bold indicates the highest score, and underline the lowest.
Table 3. Evaluation under different area thresholds.
Table 3. Evaluation under different area thresholds.
Threshold (Pixels)13579111315
Precision (%)30.661.374.482.285.888.488.487.7
Recall (%)98.594.490.394.477.667.853.942.2
F1 score (%)46.774.381.683.381.576.867.057.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, R.; Ferreira, V.G.; Li, X. Detecting Moving Vehicles from Satellite-Based Videos by Tracklet Feature Classification. Remote Sens. 2023, 15, 34. https://doi.org/10.3390/rs15010034

AMA Style

Chen R, Ferreira VG, Li X. Detecting Moving Vehicles from Satellite-Based Videos by Tracklet Feature Classification. Remote Sensing. 2023; 15(1):34. https://doi.org/10.3390/rs15010034

Chicago/Turabian Style

Chen, Renxi, Vagner G. Ferreira, and Xinhui Li. 2023. "Detecting Moving Vehicles from Satellite-Based Videos by Tracklet Feature Classification" Remote Sensing 15, no. 1: 34. https://doi.org/10.3390/rs15010034

APA Style

Chen, R., Ferreira, V. G., & Li, X. (2023). Detecting Moving Vehicles from Satellite-Based Videos by Tracklet Feature Classification. Remote Sensing, 15(1), 34. https://doi.org/10.3390/rs15010034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop