1. Introduction
Recent technological advancements have made it possible to collect large amounts of data on the movement patterns of people [
1], vehicles [
2], and animals. These data can be collected through sensors such as the global system for mobile communications (GSM), the global positioning system (GPS), and the Wi-Fi found in smartphones and cars, and it is useful in transportation planning to understand travel behavior [
3], which can be grouped into patterns or clusters [
4]. Data mining techniques, such as classification, clustering, and association rule mining, are used to analyze data and find valuable patterns related to time and space [
5]. Clustering is used to group similar data points based on their characteristics. In the case of trajectory data, this means grouping together sets of spatiotemporal data points (i.e., location and time) with similar movement patterns. For example, clustering vehicle trajectory data can reveal patterns of rush hour traffic or common routes taken by vehicles [
6].
Trajectory clustering allows researchers to discover information about the movement patterns of people, animals, and vehicles that would not be easily observable from individual data points [
7]. By grouping similar data, researchers can identify patterns and trends in the data that would otherwise be difficult to detect. Additionally, trajectory clustering can also be used to extract meaning from the data. For example, by identifying movement patterns, researchers can gain insights into how people, animals, or vehicles use a particular area, such as which routes they prefer or which areas they avoid [
8]. This information can then be used to make informed decisions, such as in transportation planning, where it can be used to optimize infrastructure and reduce traffic congestion.
However, calculating the similarity between trajectories is a challenging task, as different trajectories have different lengths and sampling frequencies. Researchers often use Longest Common Subsequence (LCSS) [
9] and Dynamic Time Warping (DTW) [
10] to calculate the similarity between unequal-length trajectories, but this can be time-consuming. In addition to that, the principle of minimum total curvature techniques is used for 2D interpolation of geophysical data in order to generate maps using a computer, which may not always be perfect but are generally sufficient to analyze a spatially distributed data [
11]. Ridge regression, proposed by [
12], can be used to estimate the coefficients for a linear model that predicts the similarity between two trajectories based on their features or attributes. Moreover, the authors of a study on Polish urban areas, as cited in [
13], discussed the challenges involved in sustainable transport development and offered potential solutions to mitigate short-term and long-term adverse impacts on urbanization.
In trajectory compression, the trajectory clustering method reduces the consumption of computing and storage resources for trajectory data mining. Trajectory similarity measurement is the central part of the trajectory clustering algorithms. Before trajectory clustering, the similarity or distance of two distinguished trajectory data must be compared. Different similarity measurement methods have been used in further research with their aspects. Some of the methods are used for underlaying the road network space. The Euclidean distance, the similarity measurement method, is no longer suitable for calculating the distance between two points in road network space. Therefore, in the road network space, we use the shortest path of two points to describe the distance between the two points in the road network. Similarity measure methods using the Euclidean distance lead to bad results, mainly because trajectories have different lengths [
14]. Different trajectory similarity measurement methods with road network space are also introduced, such as Hausdorff and Frechet distances. They are used for trajectories clustering similarity (distance) measures but fail to compare them as a whole for their high computational cost. Spatial–temporal trajectories are composed of spatial location points with time information. Different trajectories have different lengths and sampling frequencies. Therefore, the method suitable for calculating the similarity between trajectories of equal length, such as average of pair distance (APD), is no longer applicable. In previous articles, researchers often used LCSS and DTW to calculate the similarity between unequal-length trajectories, because the distance calculation of the road network is time-consuming, and most of the similarity calculations of spacetime trajectories, such as LCSS, require calculating the shortest path between each spacetime point to obtain the similarity between the spacetime trajectories. This needs to be more efficient in the era of big data. Our main goal of this research is to provide an efficient cluster of the trajectories in the network space, because this trajectory clustering in the road network space in the traditional method could be more efficient. Here, we introduce a novel method in which we construct similarity measures using spatiotemporal proximity analysis with LCSS.
Map matching, a preprocessing step, is needed to map each point in a road network route. Above all, concentrating just on the available vehicle trajectory data is tremendously detrimental if the associated network structure is not prepared. In this study, we focus on the clustering problem and its eventual resolution, which calls for an understanding of the underlying road network. In developing a similarity measure, we take into account that there exist road network limits for vehicle movements even if our goal is to map trajectories to known road segments. We assume that trajectories on the same route overlap due to the network constraint (at a macroscopic scale). This helps in locating the LCSS between two portions when using the method. To assess the proximity, compatibility, and amount of synchronization between two trajectories, we analyze how near they overlap. This computation of similarity based on the LCSS is conceptually equivalent to the shortest route for fully network-mapped items and the network distance for free-moving objects. On the other side, to enhance over time computation to identify the similarity, we use the spatiotemporal buffering approach. The cluster of comparable trajectories is then created using two clustering techniques: hierarchical clustering and density-based clustering.
The main objective of this research is to provide efficient clustering-based trajectories in the network space. Most existing methods require a complete similarity matrix to quantify the similarities among users’ trajectories in the dataset [
15]. The size of such a similarity matrix increases exponentially with the number of users in the trajectory dataset, leading to significant computational overhead.
The main challenge with trajectory clustering is the significant computational overhead required to calculate the similarity matrix for a big dataset with many user trajectories. This becomes even more severe for network constraint trajectories, which require additional shortest-path calculations to calculate trajectory similarities in road networks. As a result, most existing methods can perform the clustering analysis only for a tiny dataset of network constraint trajectories. Therefore, efficiently performing clustering analysis for network constraint trajectories is a critical problem that needs to be solved, given the explosive growth of network constraint trajectory data in real-time applications.
The main contributions of this work are threefold:
We propose a method to speed up region queries in DBSCAN and reduce the time complexity to generate the cluster.
Our proposed method uses a network-constrained spacetime path with a spacetime buffer and network spacetime buffer in CLR space.
We describe the proposed density-based clustering algorithm to overcome the time complexity and speed up the region query.
The remainder of the paper is structured as follows.
Section 2 presents a literature review of existing clustering algorithms and similarity measures for moving objects.
Section 3 describes the problem that the research aims to address, which is the inefficiency of existing methods for calculating the similarity between trajectories of different lengths
Section 4 introduces the density-based spatiotemporal proximal analysis method, which is used to construct similarity measures for spatiotemporal data. The density-based clustering algorithm, which utilizes the method from
Section 4, is presented in
Section 5. In
Section 6, the process of data preparation and preprocessing is discussed. The effectiveness of the proposed methods is evaluated through experiments on trajectory data in
Section 7. Finally, in
Section 8, we conclude the findings of this study.
2. Literature Review
There are three main domains of research in the field of movement data analysis that involve trajectory clustering: time geography, moving object databases, and the quantitative data analysis method for movement [
16]. Many clustering methods have been proposed throughout the years that can be used in trajectory clustering. The key elements in trajectory clustering analysis are cluster algorithms and similarity measures [
16]. Yuan et al. [
17] presented a review on the development and trends of moving objects clustering algorithms. Yuan et al. [
17] and Han et al. [
18] presented clustering algorithms in five categories, including partitioning methods, hierarchical methods (e.g., BIRCH), density-based methods (e.g., DBSCAN), grid-based methods (e.g., STING), and model-based clustering algorithms.
Partition-based methods divide the trajectory dataset into a predetermined number of clusters and assign each trajectory to the cluster with the closest centroid, e.g., k-means and k-medoids [
19]. Hierarchical-based algorithms build a hierarchy of clusters, where each cluster is formed by merging two or more smaller clusters [
18]. They are simple, but they can be difficult to use because it is not always clear whether to combine or split points. These are further divided into three types [
20]: bottom-up or condensation algorithms [
21], top-down or decomposition algorithms [
22], and compound algorithms [
23]. On the other hand, density-based clustering algorithms group data points that are densely packed together and separated from other points by a low-density region. These algorithms can handle clusters of any shape, unlike distance-based clustering algorithms, which can only handle spherical clusters [
24]. Examples of density-based clustering algorithms include DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and OPTICS (Ordering Points To Identify the Clustering Structure) [
24].
Grid-based clustering techniques employ a multiresolution grid data structure to quantize the data space into a grid structure. On the grid, clustering is done, and the effectiveness of the clustering is based on how well the data is compressed there. These algorithms execute quickly and only rely on the number of cells in each dimension of the quantized space; processing time is independent of the quantity of data points. STING (STatistical INformation Grid) and CLIQUE (CLustering In QUEst) are two examples of grid-based clustering algorithms [
25].
Model-based clustering algorithms use statistical models to describe the data and cluster the data according to the model [
26]. Examples of model-based clustering algorithms include Gaussian mixture models and hidden Markov models.
Each of these algorithms have their own strengths and weaknesses. The use of them depends on the problem statement and requirements of the users. Several similarity measures have been proposed to determine the similarity between two trajectories and can be used as the basis of clustering algorithms. Some of the common similarity measures are Euclidean distance, DTW, Frechet distance, Hausdorff distance, and LCSS [
17].
Euclidean distance calculates the straight-line distance between two points in space. It can be used to calculate the distance between two trajectories by summing the distances between corresponding points on the two trajectories [
27]. However, DTW calculates the similarity between two time series by aligning the time points of the series and minimizing the accumulated distance between them. It is often used for trajectory clustering because it is able to handle variations in speed and direction [
28]. Frechet distance calculates the minimum distance required to travel from one trajectory to another, taking into account the path and order of the points on the trajectories. It is often used for trajectory clustering because it is able to handle nonlinear movements [
29].
Hausdorff distance calculates the maximum distance between two trajectories and is often used for trajectory clustering because it is able to handle large variations in shape [
30], and LCSS calculates the similarity between two sequences by identifying the LCSS shared by the two sequences. It is often used for trajectory clustering because it is able to handle variations in the order of the points on the trajectories [
31].
This study addresses trajectory clustering without road network information. Although we do map trajectories to known road segments, while creating a similarity measure we take into consideration the fact that vehicle movements are limited by the road network. To be more specific, we gauge the degree of overlap between two trajectories to gauge how similar they are to one another by assuming that trajectories on the same direction entirely overlap at a macroscopic size. This method sits between between the shortest route for completely network-mapped items and the network distance for freely moving objects. To improve the efficiency of our similarity calculation over time, we utilize the spatiotemporal buffering technique. After calculating the similarity between trajectories using LCSS, we use hierarchical and density-based clustering to similarly group trajectories into clusters. This approach does not require the nontrivial effort of map matching, which is the process of mapping each point in a road network route to the corresponding network topology.
3. Problem Statement
Figure 1 illustrates the research method proposed in this research to address the clustering-based trajectories problem. The proposed method constructs similarity measures using spatiotemporal proximity analysis (SP-Analysis) [
16] with LCSS.
In spatially organized road networks e.g.,
space, individuals cannot travel freely in metropolitan regions. A directed graph,
, may be used to model a road network.
and
represent the nodes and connections, respectively. Each link,
, has a unique set of properties, such as the link identity,
, travel time,
, and link length,
. A link,
, is made up of a beginning node
, an ending node
, and a collection of intermediate vertices
. According to a set of linked segments that join each of the link’s two neighboring vertices, the geometry of the connection may be characterized (See
Figure 2).
As a collection of control points and route segments in a road network, a network-constrained spacetime path
may be seen. For the road network to retain its route topology, control points should offer both observations and network nodes. The individual position at any
using the linear interpolation is
can be computed as
where
is the location’s relative position on link
.
The similarity distance metric is one of the clustering algorithm’s most important components. The similarity or distance between two different trajectory data must be assessed before they can be grouped into clusters. According to their necessity for usage, multiple similarity measure metrics exist; some are appropriate for varying durations of trajectories.
With the constraint of the shortest path across the road network, several definitions of the similarity of trajectories (such as destination-based, origin-based, route-based, etc.) may be made in traffic analysis. While the movement of trajectories is restricted by the road network, one method is to look at whether the two trajectories overlap or intersect; if they traveled the same piece of the route, parts of trajectories would be fully overlapping (at a macroscopic scale). We calculate the similarity distance between the trajectories and the overlapping portions of the trajectories using the LCSS technique. The longest possible subsequence of two trajectories may be found using the LCSS technique, which was developed in the related string domain. Two strings of varying lengths are provided, and the goal is to identify a collection of strings that appear in both strings from left to right. By allowing two sequences to extend without changing the series of components yet allowing certain elements to be mismatched, the LCSS model can match two sequences [
32]. Many research have also used the LCSS technique in the context of trajectory clustering. Typically, the LCSS is solved recursively. The shortest distance is defined as
NTdist if it is smaller than the provided distance threshold and the time threshold between two network points of trajectories;
The LCSS between two trajectory segments and , whose lengths are and , is . A pair of trajectory points are deemed similar and the LCSS value is raised by 1 when the network distance between two trajectories and is smaller than the specified threshold. if the number of points of trajectories equals and is 0. The highest common subsequence may be determined iteratively if the points of both trajectories are not 0.
The two user-provided input parameters used in density-based clustering determine the number of clusters. These two input parameters are and , where the former specifies the neighborhood’s minimal similarity and the latter describes the minimum number of neighbors that the trajectory must have in order to be considered as the core object within the given range. All of the trajectories in the vicinity of the core object are edge objects, while the other trajectories are noise objects. Using the LCSS, the distance between trajectories is determined by comparing how similar they are. The LCSS operation is used to assess the -neighbourhood of the trajectory that is indicated by to find the trajectory whose corresponding similarity value is higher than 0. Moreover, we discover that a trajectory’s -neighbourhood is the trajectory whose similarity value is bigger than . If and , which is a fundamental need for a trajectory, then is directly density-reachable from . If there is a chain of trajectories such that the is directly density-reachable from , then all trajectories in the chain must be core trajectories, with the potential exception of . If there is a trajectory and both and are density-reachable from in terms of and , then a trajectory and a trajectory are density-connected to one another.
Most proposed methods require computing a complete similarity matrix to measure the similarities between the trajectories of all users in the dataset. However, with the number of users in the trajectory dataset, the size of such a similarity matrix increases exponentially. This results in significant computational overhead when calculating the similarity matrix for a large dataset with many user trajectories. This overhead computation is even more severe for trajectories of network constraints, in which additional shortest-path calculations are needed to measure trajectory similarities in road networks. Consequently, most current approaches can only perform clustering analysis for a small dataset of trajectories of network constraint. Therefore, efficiently implementing clustering analysis for trajectories of network constraints becomes an urgent problem to solve, given the exponential growth of network constraint trajectory data in real applications.
4. Density-Based Spatiotemporal Proximal Analysis
The proposed density-based clustering algorithm is developed based on spatiotemporal buffering and overlapping operations. So, first, we need the spatiotemporal buffering on the network constraint. The proximal spatiotemporal trajectories in network distance can be effectively calculated to determine their similarities for a provided trajectory. In contrast, the similarity calculation can neglect other trajectories with large spatiotemporal distances.
4.1. Spatiotemporal Buffer on the Network Constraint
The process of continually building forward
and backward
spatial buffers for every
along the journey across the road network results in the network spacetime buffer
. Network locations produce a forward spatial buffer defined by
by meeting the starting condition. The shortest forward route search may be used to determine this buffer using
as the node of origin. After satisfying the latter requirement, network sites that constitute a backward spatial buffer are specified by
. This value may be determined by utilizing
as the node of destination in the shortest forward route search, as illustrated in
Figure 3. Creating the forward and backward spacetime buffers,
and
, only requires the forward and backward spatial buffers for
at two control points,
and
. The network spacetime buffer can be accurately generated after
and
are generated for any segment of spacetime path
. The constructed
. A composed set of polygons upon the link generated as
, where
shows the spacetime polygon over the link
[
16].
4.2. Spatiotemporal Analysis in CLR Space
We manage the shifting object indexes using the Compressed Linear Reference (
CLR) [
33] approach and convert 3D road network-constrained time geographic entities in
space to
entities in
CLR space. The link identity
is the link identity, a distinct positive integer number, integrated into a single real value
. The linear reference
is an analogous one-to-one representation of geographical position
in the road network. The
value may be readily converted to a location using dynamic subdivision and is also identical to
in
. Every
point in the road network may thus be represented as a singular 2D point
in
space, also known as the
CLR space.
The spacetime path is transformed in
CLR space as
. It comprises a set of disjoint elements of
LineString that is polyline
, each line string of a set of disjoint elements corresponding to
in
in 3D space, representing the continuous movement of an individual on a particular link. A set of segments
is a collection of further
LineString , where each segment
in the set is a straight line in
CLR space that is connecting two sequential control points
and
on the same link (See
Figure 4).
In
CLR space, the network-constrained spacetime path is an equivalent representation of
in
space [
34]. We transform the spacetime buffer by following previous work [
34]:
into
CLR space by converting spacetime location
to
. As shown in
Figure 4, any
can be transformed by converting each
of
into
CLR space.
For the path segment upon link is the union of and similarity in the network spacetime path with buffer. Firstly, transform into CLR space using different cases. represent the unique polygon. Thus, is transformed into unique in CLR space.
Secondly, transform into CLR space using different cases. represent the unique polygon. Thus, is transformed into unique in CLR space.
Unique spacetime buffer in CLR is transformed by the . The network spacetime buffer in CLR space is an equal representation of in space.
In CLR space, it is simple to implement the spacetime buffering and overlapping for SP-Analysis of movement data. In the current spatial database, all the spacetime pathways are transformed into CLR space, stored, indexed, and searched as polyline entities (SQLServer). The multipolygon that represents the spacetime route in CLR space, , is created and represented in CLR space. The complex 3D spatiotemporal searches of buffer—path intersection are immediately implemented by the spatial query, i.e., polygon—polyline intersection in the spatial database.
To efficiently calculate the similarity between spatiotemporal trajectory data and moving object buffers, we use SP-Analysis and LCSS. First, we calculate the distance matrix D between all trajectory data using SP-Analysis. To calculate LCSS for each trajectory, we use the overlapping spacetime operation to reduce computation time. We convert trajectory 3D space into space using CLR and create a buffer with given spatial tolerance along each trajectory. Then, we find the similarity measure between trajectories using spatial–temporal buffer query. We compare the time duration of LCSS, not the number of elements, as we perform similarity analysis on spatiotemporal trajectory data.
5. Proposed Density-Based Clustering Algorithm
The density-based clustering method can well handle the clustering of point sets. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a more commonly used clustering method in many applications. The DBSCAN determines the number of clusters according to the two input parameters provided by users. These two input parameters are and , where the former specifies the minimum similarity of the neighborhood. At the same time, the latter represents the minimum number of neighbors of the trajectory as the core object (which means trajectory here) in the specified range. The trajectories within the core object’s neighborhood are all edge objects, and the remaining trajectories are noise objects. When the DBSCAN is applied to cluster a spatiotemporal moving object, the objects that need to be clustered are converted from spatial point data to spatiotemporal trajectory data. The similarity calculation between trajectories, such as LCSS, replaces the distance calculation between trajectories. The corresponding input parameters of DBSCAN are the minimum similarity threshold and the required number of trajectories that meet the minimum similarity threshold.
Unlike calculating the distance between point pairs in Euclidean space, the similarity calculation of trajectory data in network space is very time-consuming. In addition to the effect of the trajectory sequence length on the similarity calculation, the calculation of the road network distance, the shortest path, is also more time-consuming than the Euclidean distance. In the original DBSCAN algorithm, the calculation of the similarity matrix between trajectories is often completed by brute-force searching in advance, resulting in inefficiency. Therefore, the SP-Analysis is applied to calculate the similarity.
Based on the above elucidation, the following definitions are given. It should be noted that the following trajectory is a continuous spacetime path with the CLR technique, and the trajectory similarity is calculated using LCSS:
Definition 1. ε-neighbourhood of a trajectory.
First, according to the spatiotemporal buffer of trajectory , perform LCSS operation to determine the trajectory whose corresponding similarity value is greater than 0, and then further determine that the trajectory whose similarity value is greater than is the -neighbourhood of a trajectory, denoted by .
Definition 2. directly density-reachable.
A trajectory
is directly density-reachable from a trajectory
with respect to
and
if
Definition 3. density-reachable.
If there is a chain of trajectories such that is directly density-reachable from , then all of the trajectories in the chain must be the core object with the potential exception of .
Definition 4. density-connected.
A trajectory is density-connected to a trajectory with respect to and MinClrs if there is a trajectory CLR such that both and are density reachable from with respect to and .
According to the above definitions, DBSCAN with SP-Analysis can efficiently generate trajectory data clusters. The trajectory data cluster is the most significant density-connected dataset. In the clustering process, the trajectory data are divided into three types: core object, edge object, and noise object. Core objects are trajectories whose number of -neighborhood of a trajectory is greater than a given threshold . Edge objects are trajectories that are directly connected to the core object but are not the core object. The types of trajectories in the trajectory data cluster are either core objects or edge objects. When the core object is found for the first time, determine whether all trajectory data in have a trajectory as the core object. If not, determine all trajectories in as an edge object and add it to the cluster with the core object. If there is, add the trajectory to the cluster and expand the trajectory data cluster by adding its -neighborhood. The noise object is a trajectory that is neither a core nor an edge object. The specific trajectory clustering algorithm of DBSCAN based on LCSS with SP-Analysis is as follows. The distance matrix D and and are input parameters as the determination conditions of the core object.
By using the previously proposed algorithm [
16] and spatiotemporal buffer technique in
CLR space, the result optimality will increase. We do not need to calculate the similarity between any pairs of trajectories for all trajectories in the dataset. We define that the size of such a similarity matrix increases exponentially with the number of users in the trajectory dataset when we use the conventional approach, and it takes a lot of time. Similarly, we are just performing the overlapping operation in which we do not need to find the shortest path over the whole network, which is the cause of the high computation cost. We only find the trajectories within the buffer, a unique technique with
CLR space, and generate the cluster using DBSCAN. The implementation technique affecting the clustering performance is a buffer along the
CLR space, which helps to reduce the exponential increase with a number of the trajectory dataset. Using this approach, the cluster provides the speed up with region query.
Extension to the Hierarchy-Based Clustering Approach
Hierarchical clustering (also known as hierarchical cluster analysis or HCA) is a cluster analysis method that can create a cluster hierarchy of point sets. The Agglomerative (bottom-up) hierarchical clustering algorithm [
17] is commonly used in many applications. All datasets begin with their cluster, and as one moves up the hierarchy, pairs of clusters are merged based on distance/similarity. A dendrogram usually represents the results of hierarchical clustering.
Different numbers of input parameters depend on usage; currently, three input parameters are used, including the number of required clusters or minimum similarity threshold Eps, linkage type (ward, maximum or complete, and average and single), and similarity matrix. When the agglomerative hierarchical clustering algorithm is applied to the clustering of a spatiotemporal moving object, the objects that need to be clustered are converted from spatial point data to spatiotemporal trajectory data. The similarity calculation between trajectories, such as LCSS, replaces the distance calculation between trajectories.
The linkage criterion calculates the distance between cluster values as a function of the pairwise distances between values. To explain all these, suppose for all spatiotemporal trajectory similarity points in cluster u and cluster v, respectively.
Single linkage is used to find minimum distance and is known as the nearest point algorithm.
Complete linkage is used to find the maximum distance and is known as the farthest point algorithm.
Average linkage is used to find the average distance and is also known as the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) algorithm.
where
and
are the cardinalities of clusters
u and
v, respectively.
Ward linkage is used in the Ward variance minimization algorithm.
where
u is the newly joined cluster consisting of clusters
S and
is an unused cluster in the dataset, and
is the cardinality of its argument. This is also known as the incremental algorithm.
6. Data Preparation and Preprocessing
6.1. Study Area and Datasets
For this work, we used Microsoft Research Asia’s GeoLife dataset of Beijing, a metropolitan City and the Capital of China. In
Figure 5, an overview of the road network in Beijing is shown, which comprises 111,544 nodes and 152,217 directed links. Each link in the network is identified by a unique integer between 1 and 152,217. This road network serves as the reference framework for converting locations in
space to
space and vice versa. The data used in this study were collected over a period of five years (2007–2012) from 182 participants.
A GPS trajectory of this dataset is defined as a series of time-stamped points, each containing the parameters information, including latitude, longitude, and altitude. This total dataset consists of approximately 17,000 different transportation mode trajectories for an average period of 50,000 h, from which more than 30,000 trajectory transportation modes are car and taxi. Various GPS loggers and phones registered these trajectories and have a range of sampling speeds. In this dataset, 91.5% of the trajectories are recorded in a dense representation, for example, every one to five seconds or five to ten meters per point. This dataset listed a wide range of outdoor activities of people. In other study areas, these trajectory datasets are used, which show stability, correctness, and coherence, such as accessibility pattern extraction, consumer behavior identification, location-based social media platforms, information safety, and different position suggestions.
6.2. Trajectory Data Preprocessing
6.2.1. Map Matching
Due to GPS positioning and digital road network flaws, taxi GPS observations may not exactly lay on road network connections, and a taxi may have traveled over many network links. A cab may have passed over many network connections during subsequent observations due to the very low sample level. Hence, by accurately tracking GPS observations on the road network and reconstructing trajectory paths, a dynamic programming map-matching (MDP-MM) multicriteria approach is utilized to overcome this issue [
35]. Using [
35], this is built to enhance the technique of determining the missing road network links. In this way, we avoid improperly initializing the tag/ label in the path-finding process. The second strategy is establishing the candidate route by a single path-finding procedure at a candidate location rather than continuously searching the shortest routes from preceding candidate locations, as achieved in the standard Dijkstra algorithm.
6.2.2. Trajectory Segmentation
For a different method to find the similarity, trajectory segmentation splits a trajectory into chunks/fragments by time interval. Two trajectory points with considerable time intervals cannot be converted into CLR, and two discontinuous points will cause an error in the result. The two trajectory points with short time intervals are on a road segment, and the middle trajectory point can be interpolated by linear interpolation. If the time interval is large, there will be significant errors in interpolation, especially if the two points are on different roads. When the interval between the two trajectory points is more than 300 s, the trajectory points are not sampled continuously, because using this trajectory for linear interpolation is inaccurate. When the time interval between two trajectory points is less than 0.0000001, the sampling time of these two trajectories is too close, so they are redundant points. That is why they are not used for interpolation.
6.2.3. Spatiotemporal Data Model
The CLR approach handles shifting object indexes and transforms 3D road network-constrained time geographic entities in space to entities in CLR space. The geographic point in the road network is equivalently represented by the linear reference . If and are integrated into a single real value. is the link identity, a distinct positive integer number. The value may be readily converted to a location using dynamic subdivision and is also identical to in . A unique 2D point in space, also known as the CLR space, may be used to represent any 3D point in the road network.
The spacetime route is altered in CLR space, and consists of a number of disconnected LineString polyline components. Each line string set of discontinuous components in the structure corresponds to in 3D space, denoting the ongoing movement of a person along a specific connection. Each segment in the set is a straight line in CLR space connecting two consecutive control points, and , on the same link.
6.3. Prototype System Development
This research prototype is divided into two main parts: LCSS similarity and clustering. We use the LCSS approaches to find similarity distance measured in two different ways. One way is using the buffering technique, and the other is without the buffer. Before creating the buffer, we convert the spacetime path (trajectory 3D) into CLR space, which is 2D, so we generate the buffer over CLR and then apply LCSS on buffer-generated CLR space. We find the similarity measure directly (LCSS without buffer) between two spacetime paths with time and distance thresholds. The time threshold is used to find the similarity on each 20 s and the distance threshold measures the outliers. For the clustering of these similarities, we use two clustering approaches, including hierarchal-based and density-based clustering, which is a more efficient way to present the spacetime trajectory clustering.
The following describes the hardware and software configuration. In order to store, index, and query spacetime pathways (trajectories) in CLR space, we developed ArcSDE 10.2 and SQLServer as the spatial database. The geo-computational algorithm Visual C# is used to create network trajectory buffers along CLR. The ArcEngine 10.2 development kit is used to implement the specified LCSS spacetime operations in Visual C#. Python is used to create hierarchical-based clustering, whereas Visual C# is used to create the density-based clustering method DBSCAN.
7. Experimental Evaluation of Trajectory Data Clustering
Experiment 1: Impact of SP-Analysis on similarity
In this experiment, we calculate the LCSS similarity using two techniques, including LCSS with buffer and LCSS without buffer, with two different length trajectories and with two different numbers of trajectories, as shown in
Table 1.
Firstly, for LCSS with buffer, we convert the given dataset of trajectories in Compress Liner Reference (
CLR). We generate the buffer in road network space alone and the trajectory with a given buffer threshold of 200 m. Moreover, we also observe whether the trajectory lies in the buffer or not. We calculate the time duration of the target trajectory in the buffer. Secondly, we use traditional methods to calculate LCSS. This method uses dynamic programming algorithms to solve LCSS. In solving the LCSS, the matched elements are obtained, and the time duration is represented by the LCSS of the two trajectories obtained from the matched elements in the experiment. Among them, the time threshold
and space threshold
for judging whether the elements match are 10 s and 200 m, respectively.
Table 2 describes the result of LCSS similarity-based time consumption with buffer and without buffer.
Based on the previous result, we analyze that when we give the dataset of 17 trajectories, the buffer is created within approximately 14 s and, for 50 trajectories, within 37 s, meaning it takes less than 1 s for each trajectory. Additionally, then, if we find the LCSS similarity, it takes approximately 34 s and the overall time consumption for a total of 17 trajectories whose length is no more than 2 h. For the same dataset, the computation time of LCSS without a buffer is too much compared with finding LCSS with a buffer, even though the dataset is 2 h at maximum. This is because the traditional LCSS method needs to calculate LCSS for each pair of trajectories. In contrast, the SP-Analysis-based method does not need to do so. At the same time, the traditional LCSS method needs to calculate a large number of road network distances, which can be avoided in the SP-Analysis-based method. The same point shows in the dataset of 50 trajectories; we observe that the time consumption of 460.008 s with buffer is much less than the time consumption of 253,049.97 s without buffer. This is why we use the buffering technique to speed up LCSS similarity analysis.
Experiment 2: Analysis of DBSCAN with Hierarchical Clustering
The parameter’s settings are reasonable and dramatically influence the clustering quality in our work. The input parameters of the DBSCAN clustering method based on LCSS
and
were determined in the experiment according to the method provided by the author of the original DBSCAN algorithm. First,
is determined according to
, where
D represents the trajectory dataset and
. Second, calculate the LCSS of each trajectory to its
k-th nearest trajectory. When sorting the trajectories in descending order of the
k-th nearest similarity value, the sorted
k-th similarity graph will give us some directions about the density distribution in the database. As shown in
Figure 6, select the
k-th similarity value corresponding to the first trajectory in the first “valley” (red line in the graph) of the sorted
k-th similarity graph as the input parameter
. The trajectory dataset D2 is used for analysis in
Figure 6, where the
k value is set to 4.
After visual analysis of the sorted k-th similarity graph, the input parameter of the trajectory Dataset 1 is set to 0; the input parameter of the trajectory Dataset 1 is set to 2. Furthermore, the of the trajectory Dataset 2 is set to 0.002; the of the trajectory Dataset 2 is set to 4.
Whether the parameter configuration becomes suitable or not has a significant impact on the accuracy of the clustering result. There are three input parameters of the hierarchical clustering method based on LCSS: the spatial threshold , linkage type (ward, maximum or complete, and average and single), and similarity matrix. The similarity between two trajectories between two points is calculated by LCSS using the buffer and with the buffering technique along the road network space.
The main work during hierarchical clustering of linkage types is determining linkage type. We use the ward linkage type, in which the algorithm updates the distance matrix at every iteration to represent the distance of the newly formed cluster.
Figure 7 represents the hierarchical clustering of the 50 trajectories dataset by the dendrogram. On the x-axis, the trajectory label is displayed, which combines with another trajectory on each level. A similarity value between trajectories is displayed on the y-axis of the dendrogram. Each level of the trajectory cluster depends on the similarity value, which we calculate by LCSS and store in the distance matrix.
Table 3 shows the time consumption of clustering with LCSS similarity measure. Total time consumption over creating clustering with computing the LCSS of the 17 trajectories dataset and then DBSCAN and hierarchical clustering equals 33.690 s. For the over 50 trajectories dataset with more than 4 h, the time consumption to find the LCSS with buffer and then clustering is almost the same, which is approximately 460 s. We also show the comparison of our work with Chen et al. [
34] and Yuan et al. [
16], and it can be seen that our method takes less time when combined with DBSCAN and HCA.
Visual Analysis of Trajectory Clustering
Here, we present a visualization of the results of trajectory clustering in 3D
space. The x-axis represents the latitude, the y-axis represents the longitude, and the z-axis represents the time. This 3D visualization of the clustering result is based on
Figure 7. The hierarchical clustering dendrogram of 50 trajectories at the y-axis shows the similarity values. These three clusterings create 0.078 similarity values that contain different numbers of trajectories in the cluster. These 3D representations are shown in the dendrogram (See
Figure 7).
Figure 8 shows cluster 1 of trajectory D1, which contains the set of 50 trajectories with an average length of 150.3 km and an average duration of 4 h. This cluster is based on the value of similarity we find using the LCSS.
Figure 9 shows the cluster of remaining trajectories from the D2 of 44 trajectories with different lengths, and
Figure 10 and
Figure 11 show the single trajectory. Distinct colors represent unique trajectories from the data in
Figure 8,
Figure 9,
Figure 10 and
Figure 11.
8. Conclusions
This study proposes an efficient density-based clustering method for network constraint trajectories using Beijing’s GeoLife Project dataset. The approach enhances the spatiotemporal cluster by using the buffering technique and CLR space, and we find valid trajectories from raw input data using map matching to locate missing points. We use minimum and maximum thresholds for segmentation, with interpolation providing more accuracy after these thresholds. Our spatiotemporal data model transforms the road network space into CLR space, and we create a buffer along CLR, finding the similarity between trajectories within the buffer using the LCSS similarity metric. We apply different clustering techniques, including DBSCAN and hierarchical clustering, based on LCSS similarity, and the results show a significant reduction in time consumption using the buffering technique.
Our proposed method has potential applications in transportation planning and smart vehicle mobility, helping optimize traffic on roads and improving infrastructure in smart planning. Future work could improve clustering efficiency by exploring different similarity techniques or using F-heap data structures, indexing techniques, or non-SQL databases to reduce computation costs. Overall, this study offers a promising approach for clustering trajectory data with network constraints, paving the way for further research in this field.