Nothing Special   »   [go: up one dir, main page]

Next Article in Journal
Integrated Decision Support Framework of Optimal Scaffolding System for Construction Projects
Next Article in Special Issue
Weather Condition Clustering for Improvement of Photovoltaic Power Plant Generation Forecasting Accuracy
Previous Article in Journal
A Physicist’s View on Partial 3D Shape Matching
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Similarity Measurement and Classification of Temporal Data Based on Double Mean Representation

School of Computer Science, China University of Geosciences (Wuhan), 388 Lumo Road, Wuhan 430074, China
*
Author to whom correspondence should be addressed.
Algorithms 2023, 16(7), 347; https://doi.org/10.3390/a16070347
Submission received: 16 June 2023 / Revised: 14 July 2023 / Accepted: 14 July 2023 / Published: 19 July 2023
(This article belongs to the Special Issue Algorithms for Time Series Forecasting and Classification)

Abstract

:
Time series data typically exhibit high dimensionality and complexity, necessitating the use of specific approximation methods to perform computations on the data. The currently employed compression methods suffer from varying degrees of feature loss, leading to potential distortions in similarity measurement results. Considering the aforementioned challenges and concerns, this paper proposes a double mean representation method, SAX-DM (Symbolic Aggregate Approximation Based on Double Mean Representation), for time series data, along with a similarity measurement approach based on SAX-DM. Addressing the trade-off between compression ratio and accuracy in the improved SAX representation, SAX-DM utilizes the segment mean and the segment trend distance to represent corresponding segments of time series data. This method reduces the dimensionality of the original sequences while preserving the original features and trend information of the time series data, resulting in a unified representation of time series segments. Experimental results demonstrate that, under the same compression ratio, SAX-DM combined with its similarity measurement method achieves higher expression accuracy, balanced compression rate, and accuracy, compared to SAX-TD and SAX-BD, in over 80% of the UCR Time Series dataset. This approach improves the efficiency and precision of similarity calculation.

1. Introduction

Time series data refer to a sequence of data points arranged in chronological order. The widespread use of smartphones, various sensors, RFID, and other devices in recent years has laid a solid foundation for the generation of massive amounts of time series data. For efficient data mining and querying of time series data, it is crucial to employ a rational and effective representation method. Time series data representation methods can be broadly classified into four categories: data-adaptive representation methods, non-data-adaptive representation methods, model-based representation methods, and data-indicative representation methods, as described in Table 1. The content of this overview is derived from the literature [1].
(1) Data-adaptive representation methods express the original time series data as a combination of arbitrary-length segments, aiming to minimize the error in global representation. For example, the commonly used Singular Value Decomposition (SVD) method [4] is a typical data-adaptive representation method. It seeks c representative orthogonal vectors of dimensionality k (c ≤ k) and maps the original time series data into a smaller space, achieving dimensionality reduction. Another approach, the Piecewise Linear Approximation (PLA) method proposed by Keogh et al. in 1998 [11], fits the original time series data using line segments. Furthermore, the improved Piecewise Constant Approximation (PCA) method [12] approximates each time series data segment using a constant. Similarly, the Adaptive Piecewise Constant Approximation (APCA) method [13] divides the original time series data into variable-length segments and represents each segment using mean and time-scale values. Symbolic Aggregate Approximation (SAX) [14,15] partitions the original data and uses means to represent each segment. Then, based on the normal distribution of the values in the time series data, the means are mapped to symbols, transforming the complete time series data into a string.
Data-adaptive representation methods effectively capture the characteristics of the original time series data. However, due to the unequal lengths of the segments, similarity measurement based on this representation method becomes challenging.
(2) In non-data-adaptive representation methods, the non-data-adaptive representation methods based on frequency domain transformation convert time series data from the time domain to the frequency domain and represent the time series data using the spectral information in the frequency domain. Commonly used methods include Discrete Fourier Transform (DFT) [2], Discrete Wavelet Transform (DWT) [3], Discrete Cosine Transform (DCT) [4], and others. These methods approximate the original time series data by taking the first k coefficients after a certain transformation, achieving compression and reducing the dimensionality of the time series data. Although parameter determination for domain-transformation-based representation methods is challenging and requires extensive parameter tuning experiments, non-data-adaptive methods based on piecewise approximation have also emerged. Examples include Piecewise Aggregate Approximation (PAA) [5], Indexable Piecewise Linear Aggregation (IPLA) [8], and other methods. Non-data-adaptive representation methods use equally sized segments to approximate the time series data. While the approximation quality may not be as good as that of data-adaptive representation methods, using such methods for similarity measurement is relatively straightforward and simple.
Non-data-adaptive representation methods use equally sized segments to approximate the time series data. Although the approximation quality may not be as good as that of data-adaptive representation methods, using these methods for similarity measurement is more direct and simpler.
(3) Model-based representation methods assume that time series data are stochastic and use models such as AutoRegressive (AR) [24], AutoRegressive Moving Average (ARMA) [25], and Hidden Markov Models (HMMs) [26] for fitting. This approach requires the data to conform to certain assumptions and mathematical deduction theories; otherwise, distortion may occur. Additionally, time series data are complex in structure and contain a significant amount of noise, which poses significant challenges in constructing accurate models.
(4) The aforementioned three types of representation methods allow users to customize the data compression ratio (the ratio of the original sequence length to the processed sequence length). However, determining this parameter is often challenging and can significantly affect the quality of the approximation. In contrast, data-indicative representation methods can automatically define the compression ratio based on the original time series data. Common methods in this category include data pruning [21] and tree-based representation methods [22].
Among these, the SAX (Symbolic Aggregate Approximation) family of representation methods has received significant attention from researchers. As a data-adaptive representation method, SAX is known for its simplicity and comprehensibility. Its implementation mainly involves three steps: Z-normalizing the dataset, applying Piecewise Aggregate Approximation (PAA) for dimensionality reduction, and finally, performing symbolic representation. SAX has two important parameters: the dimensionality reduction parameter (w) and the alphabet size (α). Figure 1 illustrates the process of symbolization approximation for time series data of dimensionality 128. The original time series data are transformed into a string of length 8, represented as “baabccbc,” which contains three characters (a, b, c).
However, considering the complexity and diversity of time series data, using only the mean of time series data segments to represent their information may overlook some important features, resulting in limited expressive power of SAX. Therefore, several improved representation methods based on SAX have been proposed, such as SAX-TD [16] and SAX-BD [17]. These methods enhance the accuracy of similarity measurement by incorporating additional key feature information for each time series data subsegment during compression and dimensionality reduction.
The SAX-TD method represents the trend distance using the distance from the endpoint value to the mean, but it has limitations for sequences with complex variations. The SAX-BD method represents the trend distance using the distance from the maximum and minimum values to the mean, but it requires more symbol representations and has poor dimensionality reduction performance.
In this paper, we propose a new method called SAX-DM, where “DM” stands for Double Mean. To better quantify the trend of the subsegments, we further divide the segments obtained by the PAA algorithm into two parts. We use the distance from the mean of the left part to the overall mean as the symbol of trend distance, which intuitively represents the trend of the subsegments. However, considering the issue of positive and negative values, we represent the trend direction by the difference between the left mean and the time average. Since the PAA algorithm has already normalized the time series data, the range of the trend distance is Δq ∈ [−1, 1].
Compared to SAX-TD and SAX-BD, SAX-DM provides a better representation of the overall trend of time series data segments with complex variations. Additionally, SAX-DM only adds an extra character on top of SAX, requiring fewer symbols. In our experiments, we propose a novel similarity measurement method based on the SAX-DM representation. The experimental results demonstrate that SAX-DM outperforms SAX-BD and SAX-TD in terms of effectiveness. Furthermore, we prove that our novel distance measurement method guarantees a lower bound on the Euclidean distance while maintaining a more compact lower bound than the original SAX method.

2. Related Work

Assuming that time series data follow a normal distribution, according to mathematical principles, we can transform the original sequence data into a sequence that conforms to the standard normal distribution. The SAX representation method draws inspiration from this idea. It first segments the transformed sequence data and calculates the mean for each segment. Then, these means are mapped to corresponding probability intervals based on their magnitudes. If we assign a letter to each interval, we obtain a sequence of letters, which forms the basis of the SAX. Through this process, the SAX is able to transform the original continuous time series data into a discrete symbol sequence, thereby simplifying the representation and processing of the data.

2.1. Distance Calculation by SAX

The implementation of SAX mainly consists of three steps: Z-normalization of the dataset, dimensionality reduction using PAA, and finally symbolization. First of all, it is common to normalize the time series data so that each time series has a mean of 0 and a variance of 1 because time series data with different offsets and amplitudes are not directly comparable. Then, using a sliding-window approach, the original time series data are divided into w equal-length subsequences, and each subsequence is represented by its mean value. The normalized time series data follow a normal distribution, which provides mathematical support for dividing the probability distribution corresponding to the time series data into equal area regions and can ensure that the probability of the time series data falling into each interval is equal. The number of regions is determined by the input parameter α, which also represents the size of the character set. Each region is then represented by a symbol. Finally, based on the region in which the mean value of each subsequence falls, the corresponding symbol for that region is used to replace the subsequence, resulting in the symbolization approximation representation of the time series data.
Assuming we have two time series data Q and C with the same length n, which is divided into w time segments represented as q and c, Q ^ and C ^ are the symbol strings obtained after applying the SAX algorithm. The SAX distance between Q and C can be calculated as the sum of the distance between each corresponding symbol and can be expressed as follows:
M I N D I S T ( Q ^ , C ^ ) = n w i = 1 w ( d i s t ( q i ^ , c i ^ ) ) 2

2.2. Two Improvements of SAX Distance Measure for Time Series

2.2.1. SAX-TD

In order to enhance the accuracy of the SAX representation, it is important to preserve the trend information of the time series data during the dimensionality reduction process. For instance, the authors of reference [16] propose storing a value and a symbol in SAX to improve the distance calculation. They introduce an improvement method called SAX-TD, which utilizes the start and end points of segments to calculate the trend distance. There are several scenarios for the trend variation of time series data segments, as illustrated in Figure 2. In this figure, tl represents the left endpoint, which is the starting endpoint, and tr represents the right endpoint, which is the end endpoint.
The trend is indeed an important feature of time series data and plays a crucial role in their classification and analysis. For instance, if the endpoint value is greater than the starting point value, it indicates an upward trend, while the opposite suggests a downward trend. To describe this trend more accurately, it is necessary to use the actual values instead of the symbolized mapping when calculating the trend distance.
For the given time series data q and c, their trend distance is calculated as follows:
t d ( q , c ) = ( q ( t s ) c ( t s ) ) 2 + ( q ( t e ) c ( t e ) ) 2
where t s and t e represents the left and right endpoint values of the time series segment, ∆q(t) represents the distance from the endpoint value to the average value of the line segment in the sequence, and q(∆c(t) represents the corresponding distance of another sequence), calculated using Formula (3):
q ( t ) = q ( t ) q ¯
Although each segment has a starting point and an endpoint, in practice, the starting point of the next segment is actually the endpoint of the previous segment. Therefore, it is possible to embed the trend distance into the SAX symbolized sequence. As a result, the time series data Q and C can be expressed using the following representation:
Q : q ( 1 ) q 1 ^ q ( 2 ) q 2 ^ q ( w ) q w ^ q ( w + 1 ) C : c ( 1 ) c 1 ^ c ( 2 ) c 2 ^ c ( w ) c w ^ c ( w + 1 )
q 1 ^ , q 2 ^ q w ^ represents a sequence symbolized by SAX, and Δq(1), Δq(2), …, Δq(x) represents the trend of the time series data represented by the distance between the endpoint values and the mean values. Δq(x + 1) represents the change in the last point.
The distance measure formula for the time series data Q and C can be expressed as follows:
T D I S T ( Q ^ , C ^ ) = n w i = 1 w ( d i s t ( q i ^ , c i ^ ) ) 2 + w n ( t d ( q i , c i ) ) 2
where d i s t ( q i ^ , c i ^ ) represents the distance calculated using the SAX distance measure algorithm, t d ( q i , c i ) represents the distance calculated using Equation (2), n represents the length of q and c, and w represents the number of time segments.

2.2.2. SAX-BD

Reference [17] and others suggest adding boundary distance as a new consideration instead of trend distance and propose the algorithm SAX-BD. This algorithm considers that, for each segmented time series fragment, there are maximum and minimum points, and their distances to the mean value are referred to as the boundary distance. The average of the segment’s boundary distances contributes to a more accurate measurement of the different trends in the time series data. Details are provided below.
From Figure 3, we can observe that the maximum and minimum values within each time segment serve as the boundaries. The boundary distance of a is denoted as q ( t ) m a x and q ( t ) m i n , as shown in Equations (5) and (6):
q ( t ) m a x = q ( t ) m a x q ¯
  q ( t ) m i n = q ( t ) m i n q ¯
In fact, the SAX-BD algorithm computes the trend changes (i.e., the boundary distance) of a as q ( t m i n ) and q t m a x , and these values are equivalent to q t s and q t e . Therefore, it is evident that SAX-BD can also effectively distinguish between them. For cases a and b, the distance calculated using SAX-TD is 0. However, in the SAX-BD approach, the distance calculated using SAX-BD is not equal to 0, indicating the potential for differentiation between these two sequences. Regarding the situations c and d, according to the TD and BD methods, it is as follows:
  • for case c,
      q t s = q ( t ) m a x   a n d   q t e = q ( t ) m i n
  • for case d,
    q t m i n q t e 0   a n d   q t m a x q t s 0
Similar to the SAX-TD distance measurement concept, the following SAX-BD distance measurement formula can be used for temporal data Q and C:
B D I S T ( Q ^ , C ^ ) = n w i = 1 w ( d i s t q ^ , c ^ 2 + w n b d q i , c i 2 )

3. Our Method, SAX-DM

3.1. SAX-DM Design and Analysis

In the SAX algorithm, expressing the information of a time series segment solely based on its mean would overlook some important features, resulting in limited expressive power. The SAX-TD method represents trend distance using the distance from the endpoints to the mean, which has certain limitations for sequences with complex variations. The SAX-BD method represents trend distance using the distance from the maximum and minimum values to the mean, requiring more symbols for representation and resulting in poor dimensionality reduction.
In this paper, we propose a new approach that utilizes the mean and trend of time series subsegments as key information. To better quantify the trend of the subsegments, we further divide the segments obtained through the PAA algorithm into two parts. The distance from the mean of the left part to the overall mean is used as the trend distance, which intuitively represents the trend of the subsegment. However, considering the issue of positive and negative values, we use the difference between the left mean and the time average mean to represent the trend. Since the PAA algorithm already normalizes the time series data, the trend distance range is Δq ∈ [−1, 1]. Figure 4 illustrates some examples of determining the trend distance. When the left mean in the segment is smaller than the overall mean, the trend is considered increasing, and the trend distance is positive, as shown in example in Figure 4a. When the left mean in the segment is larger than the overall mean, the trend is considered decreasing, and the trend distance is negative, as shown in example in Figure 4b.
Compared to SAX-TD and SAX-BD, SAX-DM can better represent the overall trend of time series data segments with complex variations, while requiring fewer symbols. For a segment with the mean value q ^ , its trend distance Δq can be represented within the range [−1, 1]. Therefore, when dividing time series data into n parts, it can be expressed as follows:
q 1 q ^ 1 q 2 q ^ 2 q 3 q ^ 3 q n q ^ n

3.2. Similarity Measurement Based on SAX-DM Expression Method

In this paper, the Euclidean distance is employed as the fundamental method for measuring similarity. For a sequence approximated using SAX-DM, it can be viewed as a point in a w-dimensional space. Therefore, the computation of similarity between time series data can be transformed into calculating the distance between different points in this w-dimensional space.
Before proceeding, let us review the method for calculating Euclidean distance on the original time series data. Suppose we have two time series data, T = t 1 , t 2 , t 3 , , t n , S = s 1 , s 2 , s 3 , , s n . The Euclidean distance between them is the straight-line distance between the points represented by the two time series data objects in an n-dimensional space. It can be calculated using the following formula:
E D T , S = i = 1 n t i s i 2 2
Due to the high-dimensional nature of time series data, calculating distances on the original time series data can lead to significant memory pressure and a substantial number of computations. This process is also susceptible to noise and deformations in the data. Therefore, it is common to compress and reduce the dimensionality of the original time series data and extract features. One popular method for compression and dimensionality reduction is using Piecewise Aggregate Approximation (PAA). After applying PAA, the similarity distance can be calculated using the following formula:
E D T ¯ , S ¯ = j = 1 w t ¯ j s ¯ j 2 2
Each time series data subsegment uses the mean as a feature information:
  t ¯ i = w n j = n w i 1 + 1 n w i t j
In this article, trend distance is used to represent temporal data fragments. Firstly, a representation based on trend distance is defined. For temporal data Q and C, the following expressions are defined:
Q : q 1 q ^ 1 q 2 q ^ 2 q 3 q ^ 3 q n q ^ n C : c 1 c ^ 1 c 2 c ^ 2 c 3 c ^ 3 c n c ^ n
According to the above expression, the formula for calculating the trend distance based on the left mean value is defined as follows:
m d q , c = q c 2 2
The distance measurement can be represented by the following equation:
M D I S T Q ^ , C ^ = n w i = 1 w d i s t q ^ i , c ^ i 2 + w n 2 × m d q ^ i , c ^ i 2
In order to address the memory pressure and computational efficiency issues associated with high-dimensional time series data, it is generally necessary to compress and reduce the dimensionality of the original data and extract features. In this paper, the PAA method is employed to segment the original time series data, and the average value and change trend of subsegments are used as the key information for each subsegment. This approach aims to improve the compression ratio of the approximate representation while preserving the essential features of the original data as much as possible. Furthermore, by considering the trend of temporal data changes, the similarity measurement method based on the approximate representation proposed in this paper can calculate similarity more accurately.

3.3. Lower Bound

The SAX algorithm, when applied for dimensionality reduction, offers one of its most important features, that is providing a lower-bound distance measurement called the boundary distance. The lower bound is useful to control errors and accelerate computations. However, when performing spatial queries on the dimensionally reduced original sequence, there is a risk of false negatives. To reduce the occurrence of false negatives after dimensionality reduction, it is important to design algorithms that satisfy a good lower-bound property.
Next, we will prove that the distance we propose also serves as a lower bound for the Euclidean distance.
The lower bound of the PAA distance for the Euclidean distance is given by the following expression:
i = 1 n q i c i 2 n w i = 1 w ( q i ¯ , c i ¯ ) 2
To prove that DMIST is also a lower bound for the Euclidean distance, we reiterate some of the proofs here. Let Q and C be the means of the time series data Q and C, respectively. Firstly, we consider the single-frame case (i.e., w = 1), and according to Equation (15), we can obtain
i = 1 n q i c i 2 n Q ¯ C ¯ 2
Recall that q is the average value of the temporal data, so q i can use q i = Q ¯ q i . This also applies to each point c in c i . Equation (15) can be rewritten as follows:
n Q ¯ C ¯ 2 + i = 1 n q i c i 2 n Q ¯ C ¯ 2
Because i = 1 n q i c i 2 0 and ( q ( t ) 1 C ( t ) 1 ) 2 + ( q ( t ) 2 C ( t ) 2 ) 2 , we can obtain the following inequality (which clearly exists in the boundary distance q i ):
i = 1 n ( q i c i ) 2 ( q ( t ) 1 C ( t ) 1 ) 2 + ( q ( t ) 2 C ( t ) 2 ) 2
Substituting Equation (16) into Equation (17), we obtain
n Q ¯ C ¯ 2 + i = 1 n ( q i c i ) 2 n Q ¯ C ¯ 2 + i = 1 n ( m d ( q i , c i ) ) 2
MINDIST conducted a lower-bound analysis of the PAA distance, namely,
n Q ¯ C ¯ 2 n Q ^ C ^ 2
In Equation (20), Q ^ and C ^ are, respectively, the symbolic representations of Q and C in the original SAX. By transitivity, the following inequality is correct:
n Q ¯ C ¯ 2 + i = 1 n ( q i c i ) 2 n ( d i s t Q ^ C ^ ) 2 + ( m d ( q i , c i ) ) 2
Recalling Equation (15), this means
i = 1 n ( q i c i ) 2 n ( ( d i s t Q ^ C ^ ) 2 + 1 n ( m d ( q i , c i ) ) 2 )
N frames can be obtained by applying a single-frame proof on each frame, namely,
i = 1 n ( q i c i ) 2 n w i = 1 w ( ( d i s t ( q ^ , c ^ ) ) 2 + w n ( m d ( q i , c i ) ) 2
The quality of the lower boundary distances is usually measured by the compactness of the lower boundaries (TLB):
T L B = L o w e r   B o u n d i n g   D i s t a n c e ( Q , C ) E u c l i d e a n   D i s t a n c e ( Q , C )
The value of TLB is within the range [0, 1]. The higher the TLB value, the better the quality. Recalling the distance metric in the equation, we can obtain that T L B B D I S T T L B ( M I N I D I S T ) , which means that SAX-DM has a tighter lower bound than the original SAX distance.

4. Experimental Validation

In this section, we compare the classification results of SAX-DM representation with other representations on time series data through experiments. Firstly, we introduce the experimental dataset, followed by an explanation of the experimental methodology and parameter settings. Finally, we evaluate the advantages of SAX-DM based on the comprehensive assessment of classification accuracy and error rate.

4.1. Datasets

The evaluation of SAX-DM’s classification performance utilized the UCR Time Series Archive [28], which is a widely used collection of time series datasets in the field of time series data mining. This dataset was introduced in 2002 and has been continuously expanded over time. Due to its inclusion of time series datasets from various domains, it has become an important resource in the time series data mining community and is recommended by researchers working with time series data.
Initially, the dataset contained only 16 datasets. However, since its introduction in 2002, it has been expanded, and it currently includes 128 datasets. These datasets cover seven different domains: Device, ECG, Image, Motion, Sensor, Simulated, and Spectro. After careful analysis and verification of the data, it was found that some datasets had missing values, with some missing data lengths exceeding half of their original data lengths. As the similarity measurement method based on the approximate expression in this paper relies on the concept of Euclidean distance and is applicable to equally sized time series data, it cannot calculate the similarity between time series data of unequal lengths. Therefore, some datasets were excluded from the analysis. In the end, this paper selected 100 datasets from the UCR Time Series Archive, covering the aforementioned seven domains, for the purpose of conducting time series data classification experiments. The list of datasets used can be found in Table 2.

4.2. Experimental Methods and Parameter Settings

Since the SAX-DM, SAX, SAX-TD, and SAX-BD methods are all based on PAA for segmenting time series data, they involve the parameter window size, denoted as “w.” Additionally, the SAX, SAX-TD, and SAX-BD methods also involve the parameter symbol table size, denoted as “alpha.” For each dataset, we defined a set of parameter settings according to the specifications shown in Table 3. The range for the parameter “w” was set from 5 to 20, and the range for the parameter “alpha” was set from 3 to 16. For each dataset, we compared the classification accuracy obtained with different parameter settings and retained the best result for comparison. Furthermore, under the condition of achieving the best classification accuracy for each method, we compared the length ratio of the original sequence to the corresponding sequence obtained by different methods.

4.3. Experimental Results and Analysis

We compared the classification accuracy and error rate of SAX-DM with those of ED, SAX, SAX-TD, and SAX-BD. The experimental results show that in most of the datasets, SAX-DM achieved comparable accuracy to SAX-TD and SAX-BD, with the additional advantage of lower classification error rates.
Figure 5 presents the comparison results between the SAX-DM and ED methods. Each point in the graph represents a dataset, where the x-axis represents the classification accuracy using the ED method, and the y-axis represents the classification accuracy using the SAX-DM method. Intuitively, the more points above the diagonal line, the more datasets in which the SAX-DM achieved higher accuracy compared to the ED method. After statistical analysis, it was found that SAX-DM had a slightly lower classification accuracy than ED in 67% of the datasets. This is because the Euclidean distance, which directly measures similarity on the original time series data without compression and dimension reduction, achieves high accuracy but puts a significant burden on computer memory and has lower computational efficiency. It is typically used as a comparative method to assess the viability of an approach and is not directly used for data mining tasks. Subsequent experiments on the data compression ratio did not require a comparison with the ED method. Although the SAX-DM may perform worse or on par with the Euclidean distance method in most datasets, the accuracy gap is within 0.1 for 80% of the datasets, indicating that it achieves effective data dimension reduction while maintaining a reasonably close accuracy level compared to the Euclidean distance method.
The comparison results between the SAX-DM and SAX methods in terms of classification accuracy are shown in Figure 6. The x-axis represents the classification accuracy using the SAX method, while the y-axis represents the classification accuracy using the SAX-DM method. The results represent the average classification accuracy for both representation methods under all α and w parameters, reflecting the overall classification performance. The results are evident, as using the SAX-DM method for approximate representation and similarity measurement in classification tasks yielded a higher accuracy in 98% of the datasets.
Compared to SAX, the SAX-DM further incorporates trend distance expression, allowing for better representation of the trend characteristics in time series data. However, this comes at the cost of sacrificing some data compression and dimension reduction rates. In practical usage, different methods can be chosen based on the application’s requirements for similarity accuracy in queries.
Figure 7 presents the comparison results between the SAX-DM and SAX-TD representation methods. From the figure, it can be observed that the two methods achieved similar accuracy in the classification task. When considering the reasons behind this, SAX-TD introduces a representation approach for capturing the trend by measuring the distance between the left and right endpoint values and the overall mean of the time series data. It combines this trend information with the mean value for symbolization. This approach is almost identical to the feature extraction methodology used in this paper, but it has certain limitations in practical applications. As the right endpoint value of one segment in time series data can be used as the left endpoint value of the next segment, when the number of segments is large, it is possible to approximate two symbols representing one segment. However, when the time series data have a small number of segments, the number of symbols required for the approximation representation increases.
In contrast, the SAX-DM method proposed in this paper can represent any segment of the time series data with two symbols, while still achieving a comparable classification accuracy to SAX-TD. Therefore, the SAX-DM method is more suitable for subsequent similarity query tasks on massive time series data. Additionally, the mean and trend distance can be further encoded to enhance compression and dimensionality reduction, thereby improving the compression ratio of the time series data approximation method.
The comparison results between the SAX-DM and SAX-BD methods are shown in Figure 8. From the figure, it can be observed that the SAX-BD method performed better in the classification task on 60% of the datasets, and its classification accuracy varied significantly across different datasets. Considering the reasons behind this, the SAX-BD method incorporates the left and right extreme values in addition to the mean value to represent the time series data, while the SAX-DM method focuses on extracting the mean feature. Therefore, there is a significant difference in the expression effect of these two methods for data with smooth or drastic changes. Although SAX-BD achieves higher classification accuracy, it uses three features for dimensionality reduction, resulting in a lower compression ratio, which makes it unsuitable for subsequent similarity query tasks on large-scale time series data.
Figure 9 illustrates the comparison of classification error rates among the SAX, SAX-TD, SAX-BD, and SAX-DM methods using the classic ECG dataset as experimental data. Different α and w parameters were set for the evaluation. From the graph, it can be observed that the α parameter had a negligible impact on the classification accuracy, while the w parameter, when too large or too small, affected the accuracy. Based on the results depicted in the graph, the optimal value for the w parameter can be chosen as 16.
Since comparing compression ratios alone cannot fully demonstrate the advantages of expression methods, Figure 10 compares the classification accuracy of SAX, SAX-TD, SAX-BD, and SAX-DM at the same compression ratio, using SAX’s compression ratio as the benchmark. A higher value in the graph indicates better overall performance in terms of compression ratio and accuracy. It is evident from the graph that SAX-DM method performed exceptionally well and had a more convenient index item conversion method. In general, the SAX-DM method is more suitable for similarity queries in time series data, especially in scenarios that require a large number of iterations, as it effectively reduces computer memory pressure and enhances data mining efficiency.

5. Conclusions

The SAX-DM algorithm is proposed in this paper, taking into account the strengths and limitations of SAX-TD and SAX-BD. SAX-TD has limitations in handling complex sequences with varying patterns, while SAX-BD requires more symbols for representation. Our novel representation method adds an additional character to the original SAX, allowing for an intuitive representation of the change trend in subsegments. SAX-DM not only uses fewer characters to represent time series but also employs a new boundary distance measure to quantify time series similarity. Furthermore, we demonstrate that our distance measure maintains a lower bound on the Euclidean distance. Using this distance measure significantly reduces classification error rate, improves efficiency, and enhances accuracy in similarity calculations.

Author Contributions

Conceptualization, Z.H.; Methodology, Z.H. and C.Z.; Software, C.Z.; validation, Z.H., C.Z. and Y.C.; formal analysis, Y.C.; investigation, Y.C.; data curation, Y.C.; writing—original draft preparation, C.Z.; Writing—review & editing, Z.H. and C.Z.; visualization, Y.C.; supervision, Z.H.; Project administration, Z.H.; funding acquisition, Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (2022YFF0801200,2022YFF0801203) and the National Natural Science Foundation of China (41972306).

Data Availability Statement

The data used in this experiment can be accessed through the URL https://www.cs.ucr.edu/~eamonn/time_series_data_2018/, accessed on 10 June 2023.

Acknowledgments

The work described in this paper was supported by the National Key Research and Development Program of China (2022YFF0801200,2022YFF0801203) and the National Natural Science Foundation of China (41972306). Thanks for the reviewers’ constructive comments; they really help us improve this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. He, Z.; Wu, C.; Liu, G.; Tian, Y.; Zhang, X.; Chen, Q. Overview of similarity measurement and indexing methods for geoscience time series Big data. Bull. Geol. Sci. Technol. 2020, 39, 44–50. [Google Scholar]
  2. Agrawal, R.; Faloutsos, C.; Swami, A. Efficient similarity search in sequence databases. In Proceedings of the Foundations of Data Organization and Algorithms: 4th International Conference, FODO’93, Chicago, IL, USA, 13–15 October 1993; Springer: Berlin/Heidelberg, Germany, 1993; pp. 69–84. [Google Scholar]
  3. Chan, K.P.; Fu AW, C. Efficient time series matching by wavelets. In Proceedings of the 15th International Conference on Data Engineering (Cat. No. 99CB36337), Sydney, Australia, 23–26 March 1999; pp. 126–133. [Google Scholar]
  4. Korn, F.; Jagadish, H.V.; Faloutsos, C. Efficiently supporting ad hoc queries in large datasets of time sequences. SIGMOD Rec. 1997, 26, 289–300. [Google Scholar] [CrossRef]
  5. Yi, B.K.; Faloutsost, C. Fast Time Sequence Indexing for Arbitrary Lp Norms. 2000, pp. 385–394. Available online: https://www.vldb.org/conf/2000/P385.pdf (accessed on 10 February 2023).
  6. Fu, T.C.; Chung, F.L.; Luk, R.; Ng, C.-M. Representing financial time series based on data point importance. Eng. Appl. Artif. Intell. 2008, 21, 277–300. [Google Scholar] [CrossRef]
  7. Cai, Y.; Ng, R.T. Indexing Spatio-Temporal Trajectories with Chebyshev Polynomials. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France, 13–18 June 2004; pp. 599–610. [Google Scholar]
  8. Chen, Q.; Chen, L.; Lian, X.; Liu, Y.; Yu, X. Indexable PLA for Efficient Similarity Search. In Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria, 23–27 September 2007; pp. 435–446. [Google Scholar]
  9. Stefan, A.; Athitsos, V.; Das, G. The Move-Split-Merge Metric for Time Series. IEEE Trans. Knowl. Data Eng. 2013, 25, 1425–1438. [Google Scholar] [CrossRef] [Green Version]
  10. Zhang, Z.; Tang, P.; Duan, R. Dynamic time warping under pointwise shape context. Inf. Sci. 2015, 315, 88–101. [Google Scholar] [CrossRef]
  11. Keogh, E.J.; Pazzani, M.J. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. InKdd 1998, 98, 239–243. [Google Scholar]
  12. Keogh, E.J. A Simple Dimensionality Reduction Technique for Fast Similarity Search in Large Time Series Databases. 2000, pp. 122–133. Available online: http://www.cs.ucr.edu/~eamonn/pakdd200_keogh.pdf (accessed on 16 February 2023).
  13. Chakrabarti, K.; Mehrotra, S.; Pazzani, M.; Keogh, E. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. ACM Trans. Database Syst. 2002, 27, 188–228. [Google Scholar] [CrossRef] [Green Version]
  14. Lin, J.; Keogh, E.J.; Lonardi, S.; Chiu, B. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, CA, USA, 13 June 2003; pp. 2–11. [Google Scholar]
  15. Lin, J.; Keogh, E.; Wei, L.; Lonardi, S. Experiencing SAX: A novel symbolic representation of time series. Data Min. Knowl. Discov. 2007, 15, 107–144. [Google Scholar] [CrossRef] [Green Version]
  16. Sun, Y.; Li, J.; Liu, J.; Sun, B.; Chow, C. An improvement of symbolic aggregate approximation distance measure for time series. Neurocomputing 2014, 138, 189–198. [Google Scholar] [CrossRef]
  17. He, Z.; Long, S.; Ma, X.; Zhao, H. A Boundary Distance-Based Symbolic Aggregate Approximation Method for Time Series Data. Algorithms 2020, 13, 284. [Google Scholar] [CrossRef]
  18. He, Z.; Zhang, C.; Ma, X.; Liu, G. Hexadecimal Aggregate Approximation Representation and Classification of Time Series Data. Algorithms 2021, 14, 353. [Google Scholar] [CrossRef]
  19. Yang, D.; Kang, Y. Distance- and Momentum-Based Symbolic Aggregate Approximation for Highly Imbalanced Classification. Sensors 2022, 22, 5095. [Google Scholar] [CrossRef] [PubMed]
  20. Huang, J.; Xu, X.; Cui, X.; Kang, J.; Yang, H. Time series data symbol aggregation approximation method for fusing trend information. Appl. Res. Comput. 2023, 40, 86–90. [Google Scholar]
  21. Ratanamahatana, C.; Keogh, E.; Bagnall, A.J.; Lonardi, S. A novel bit level time series representation with implication of similarity search and clustering. In Advances in Knowledge Discovery and Data Mining, Proceedings of the 9th Pacific-Asia Conference, PAKDD 2005, Hanoi, Vietnam, 18–20 May 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 771–777. [Google Scholar]
  22. Baydogan, M.G.; Runger, G.C. Learning a symbolic representation for multivariate time series classification. Data Min. Knowl. Discov. 2015, 29, 400–422. [Google Scholar] [CrossRef]
  23. Azzouzi, M.; Nabney, I.T. Analysing time series structure with hidden Markov models. In Neural Networks for Signal Processing VIII, Proceedings of the 1998 IEEE Signal Processing Society Workshop (Cat. No. 98TH8378), Cambridge, UK, 31 August–3 September 1998; IEEE: Piscataway, NJ, USA, 1998; pp. 402–408. [Google Scholar]
  24. Serra, J.; Kantz, H.; Serra, X.; Andrzejak, R.G. Predictability of Music Descriptor Time Series and its Application to Cover Song Detection. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 514–525. [Google Scholar] [CrossRef] [Green Version]
  25. Baydogan, M.G.; Runger, G.C. Time series representation and similarity based on local autopatterns. Data Min. Knowl. Discov. 2016, 30, 476–509. [Google Scholar] [CrossRef]
  26. Ye, Y.; Jiang, J.; Ge, B.; Dou, Y.; Yang, K. Similarity measures for time series data classification using grid representation and matrix distance. Knowl. Inf. Syst. 2019, 60, 1105–1134. [Google Scholar] [CrossRef]
  27. Su, Z.; Liu, Q.; Zhao, C.; Sun, F. A Traffic Event Detection Method Based on Random Forest and Permutation Importance. Mathematics 2022, 10, 873. [Google Scholar] [CrossRef]
  28. Dau, H.A.; Bagnall, A.; Kamgar, K.; Yeh, C.-C.M.; Zhu, Y.; Gharghabi, S.; Ratanamahatana, C.A.; Keogh, E. The UCR Time Series Archive. IEEE/CAA J. Autom. Sin. 2019, 6, 6–18. [Google Scholar] [CrossRef]
Figure 1. Symbolized approximate representation of time series data SAX.
Figure 1. Symbolized approximate representation of time series data SAX.
Algorithms 16 00347 g001
Figure 2. Some examples of SAX-TD. (af) represent the trend distances calculated by the SAX-TD method for representing sequences of various common shapes.
Figure 2. Some examples of SAX-TD. (af) represent the trend distances calculated by the SAX-TD method for representing sequences of various common shapes.
Algorithms 16 00347 g002
Figure 3. Some cases of SAX-BD. (ad) represent four sequences with the same size mean, where the left and right endpoints of (a,b) are the same, while the maximum and minimum values of (c,d) are the same, respectively.
Figure 3. Some cases of SAX-BD. (ad) represent four sequences with the same size mean, where the left and right endpoints of (a,b) are the same, while the maximum and minimum values of (c,d) are the same, respectively.
Algorithms 16 00347 g003
Figure 4. Schematic diagram of SAX-DM trend distance. (a) represents the trend distance calculated by the SAX-DM method for sequences with an upward trend. (b) represents the trend distance calculated by the SAX-DM method for sequences with a downward trend.
Figure 4. Schematic diagram of SAX-DM trend distance. (a) represents the trend distance calculated by the SAX-DM method for sequences with an upward trend. (b) represents the trend distance calculated by the SAX-DM method for sequences with a downward trend.
Algorithms 16 00347 g004
Figure 5. Comparison results of SAX-DM and ED.
Figure 5. Comparison results of SAX-DM and ED.
Algorithms 16 00347 g005
Figure 6. Comparison results of SAX-DM and SAX.
Figure 6. Comparison results of SAX-DM and SAX.
Algorithms 16 00347 g006
Figure 7. Comparison results of SAX-DM and SAX-TD.
Figure 7. Comparison results of SAX-DM and SAX-TD.
Algorithms 16 00347 g007
Figure 8. Comparison results of SAX-DM and SAX-BD.
Figure 8. Comparison results of SAX-DM and SAX-BD.
Algorithms 16 00347 g008
Figure 9. Comparison of classification error rates under different α and w conditions.
Figure 9. Comparison of classification error rates under different α and w conditions.
Algorithms 16 00347 g009aAlgorithms 16 00347 g009b
Figure 10. SAX and its improved method accuracy and compression ratio.
Figure 10. SAX and its improved method accuracy and compression ratio.
Algorithms 16 00347 g010
Table 1. Main representation methods of time series data.
Table 1. Main representation methods of time series data.
RepresentationPublication TimeTypeAlgorithm ComplexityMethod Source
Discrete Fourier Transform (DFT)1993T1O(n(log(n)))[2]
Discrete Wavelet Transform (DWT)1999T1O(n)[3]
Discrete Cosine Transform (DCT)1997T1N[4]
Partitioned Aggregation Approximation (PAA)2000T1O(n)[5]
Perceived Importance Points (PIPs)2001T1N[6]
Chebyshev Polynomials (CHEBs)2004T1N[7]
Indexable Piecewise Linear Approximation (IPLA)2007T1N[8]
Move Split Merge (MSM)2013T1N[9]
Graphical-Content-Based DTW (SC DTW)2015T1N[10]
Singular Value Decomposition (SVD)1997T2O(Mn2)[4]
Piecewise Linear Approximation (PLA)1998T2O(n(log(n)))[11]
Piecewise Constant Approximation (PCA)2000T2N[12]
Adaptive Partitioning Constant Approximation (APCA)2002T2O(n)[13]
Symbolized Aggregate Approximation (SAX)2003T2O(n)[14,15]
SAX Based on Trend Distance (SAX-TD)2014T2N[16]
SAX Based on Boundary Distance (SAX-BD)2020T2N[17]
Hexadecimal Aggregate Approximation (HAX)2021T2N[18]
Point Aggregation Approximation (PAX)2021T2N[18]
Symbol Aggregation Approximation Based on Distance and Momentum2022T2N[19]
Convergence Trend Symbol Aggregation Approximation (SAX-TI)2023T2N[20]
Clipped Data2005T3N[21]
Tree-Based Representation Method2015T3N[22]
Hidden Markov Models (HMMs)1998T4N[23]
Automatic Regression Model2012T4N[24]
Representation Method Based on Local Automatic Mode2016T4N[25]
Grid-Based Representation Method2019T4N[26]
Based on Random Forest and Ranking Importance2022T4N[27]
N: author not listed. T1: non-data-adaptive representation methods. T2: data-adaptive representation methods. T3: data-indicative representation methods. T4: model-based representation methods.
Table 2. List of time series datasets.
Table 2. List of time series datasets.
IdTypeNameTrainTestClassLength
1DeviceACSF1100100101460
2ImageAdiac39039137176
3ImageArrowHead361753251
4SpectroBeef30305470
5ImageBeetleFly20202512
6ImageBirdChicken20202512
7SimulatedBME301503128
8SensorCar60604577
9SimulatedCBF309003128
10TrafficChinatown20343224
11SensorCinCECGTorso40138041639
12SpectroCoffee28282286
13DeviceComputers2502502720
14MotionCricketX39039012300
15MotionCricketY39039012300
16MotionCricketZ39039012300
17ImageDiatomSizeReduction163064345
18ImageDistalPhalanxOutlineAgeGroup400139380
19ImageDistalPhalanxOutlineCorrect600276280
20ImageDistalPhalanxTW400139680
21SensorEarthquakes3221392512
22ECGECG200100100296
23ECGECGFiveDays238612136
24EOGEOGHorizontalSignal362362121250
25EOGEOGVerticalSignal362362121250
26SpectroEthanolLevel50450041751
27ImageFaceAll560169014131
28ImageFaceFour24884350
29ImageFacesUCR200205014131
30ImageFiftyWords45045550270
31ImageFish1751757463
32SensorFordA360113202500
33SensorFordB36368102500
34HRMFungi1818618201
35MotionGunPoint501502150
36MotionGunPointAgeSpan1353162150
37MotionGunPointMaleVersusFemale1353162150
38MotionGunPointOldVersusYoung1363152150
39SpectroHam1091052431
40ImageHandOutlines100037022709
41MotionHaptics15530851092
42ImageHerring64642512
43DeviceHouseTwenty4011922000
44MotionInlineSkate10055071882
45EPGInsectEPGRegularTrain622493601
46EPGInsectEPGSmallTrain172493601
47SensorInsectWingbeatSound220198011256
48SensorItalyPowerDemand671029224
49DeviceLargeKitchenAppliances3753753720
50SensorLightning260612637
51SensorLightning770737319
52SpectroMeat60603448
53ImageMedicalImages3817601099
54TrafficMelbournePedestrian119424391024
55ImageMiddlePhalanxOutlineAgeGroup400154380
56ImageMiddlePhalanxOutlineCorrect600291280
57ImageMiddlePhalanxTW399154680
58SensorMoteStrain201252284
59ECGNonInvasiveFetalECGThorax11800196542750
60ECGNonInvasiveFetalECGThorax21800196542750
61SpectroOliveOil30304570
62ImageOSULeaf2002426427
63ImagePhalangesOutlinesCorrect1800858280
64SensorPhoneme214189391024
65HemodynamicsPigAirwayPressure104208522000
66HemodynamicsPigArtPressure104208522000
67HemodynamicsPigCVP104208522000
68SensorPlane1051057144
69PowerPowerCons1801802144
70ImageProximalPhalanxOutlineAgeGroup400205380
71ImageProximalPhalanxOutlineCorrect600291280
72ImageProximalPhalanxTW400205680
73DeviceRefrigerationDevices3753753720
74SpectrumRock205042844
75DeviceScreenType3753753720
76SpectrumSemgHandGenderCh230060021500
77SpectrumSemgHandMovementCh245045061500
78SpectrumSemgHandSubjectCh245045051500
79SimulatedShapeletSim201802500
80ImageShapesAll60060060512
81DeviceSmallKitchenAppliances3753753720
82SimulatedSmoothSubspace150150315
83SensorSonyAIBORobotSurface120601270
84SensorSonyAIBORobotSurface227953265
85SpectroStrawberry6133702235
86ImageSwedishLeaf50062515128
87ImageSymbols259956398
88SimulatedSyntheticControl300300660
89MotionToeSegmentation1402282277
90MotionToeSegmentation2361302343
91SensorTrace1001004275
92ECGTwoLeadECG231139282
93SimulatedTwoPatterns100040004128
94SimulatedUDM361443150
95SensorWafer100061642152
96SpectroWine57542234
97ImageWordSynonyms26763825270
98MotionWorms181775900
99MotionWormsTwoClass181772900
100ImageYoga30030002426
Table 3. Approximate expression method parameters.
Table 3. Approximate expression method parameters.
MethodParameter
EDnull
SAX-DM w 5 , 20   α [ 3 , 16 ]
SAX w 5 , 20   α [ 3 , 16 ]
SAX-TD w 5 , 20   α [ 3 , 16 ]
SAX-BD w 5 , 20   α [ 3 , 16 ]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

He, Z.; Zhang, C.; Cheng, Y. Similarity Measurement and Classification of Temporal Data Based on Double Mean Representation. Algorithms 2023, 16, 347. https://doi.org/10.3390/a16070347

AMA Style

He Z, Zhang C, Cheng Y. Similarity Measurement and Classification of Temporal Data Based on Double Mean Representation. Algorithms. 2023; 16(7):347. https://doi.org/10.3390/a16070347

Chicago/Turabian Style

He, Zhenwen, Chi Zhang, and Yunhui Cheng. 2023. "Similarity Measurement and Classification of Temporal Data Based on Double Mean Representation" Algorithms 16, no. 7: 347. https://doi.org/10.3390/a16070347

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop