Open AccessArticle

PM2.5 Time Series Imputation with Moving Averages, Smoothing, and Linear Interpolation

Anibal Flores

^1,*

Hugo Tito-Chura

¹,

Osmar Cuentas-Toledo

Victor Yana-Mamani

and

Deymor Centty-Villafuerte

Departamento Académico de Ingeniería de Sistemas e Informática, Universidad Nacional de Moquegua, Urb. Ciudad Jardin-Pacocha-Ilo, Moquegua 18611, Peru

Departamento Académico de Ingeniería Civil, Universidad Nacional de Moquegua, Prolongación Calle Ancash S/N, Moquegua 18001, Peru

Departamento Académico de Ciencias Sociales y Humanidades, Universidad Nacional de Moquegua, Prolongación Calle Ancash S/N, Moquegua 18001, Peru

Author to whom correspondence should be addressed.

Computers 2024, 13(12), 312; https://doi.org/10.3390/computers13120312

Submission received: 22 October 2024 / Revised: 17 November 2024 / Accepted: 18 November 2024 / Published: 26 November 2024

Download

Browse Figures

Figure 1
Location of the environmental monitoring stations in Ilo City, Peru. "> Figure 2
The 24−day correlation of Pacocha station. "> Figure 3
The 24−day correlation of Pardo station. "> Figure 4
Two days were considered to impute missing values. "> Figure 5
The 72 estimated hours with the moving average equation for gaps of 24 h. "> Figure 6
The 72 estimated hours with the moving average and linear interpolation smoothing for gaps of 24 h. "> Figure 7
The 72 estimated hours for gaps of 24 h using LANN. "> Figure 8
The 72 estimated hours for gaps of 24 h with the proposed model. "> Figure 9
Imputations of 144 h for 24 h gaps using GRU, ARIMA, and the proposed model for Pacocha Station. "> Figure 10
Imputations of 144 h for 24 h gaps using GRU, ARIMA, and the proposed model for Pardo Station. "> Figure 11
How the benchmark models work in this study. (a) Statistical models and (b) deep learning models. NA is the not available or missing value. ">

Versions Notes

Abstract

In this work, a novel model for hourly PM2.5 time series imputation is proposed for the estimation of missing values in different gap sizes, including 1, 3, 6, 12, and 24 h. The proposed model is based on statistical techniques such as moving averages, linear interpolation smoothing, and linear interpolation. For the experimentation stage, two datasets were selected in Ilo City in southern Peru. Also, five benchmark models were implemented to compare the proposed model results; the benchmark models include exponential weighted moving average (EWMA), autoregressive integrated moving average (ARIMA), long short-term memory (LSTM), gated recurrent unit (GRU), and bidirectional GRU (BiGRU). The results show that, in terms of average MAPEs, the proposed model outperforms the best deep learning model (GRU) between 26.61% and 90.69%, and the best statistical model (ARIMA) between 2.33% and 6.67%. So, the proposed model is a good alternative for the estimation of missing values in PM2.5 time series.

Keywords:

moving averages; smoothing; linear interpolation; PM2.5 time series; univariate imputation

1. Introduction

Particulate matter, or PM2.5, are small particles in the air that are 2.5 μm or less in diameter, which is less than the thickness of a human hair [1]. The smaller the particles, the deeper they can travel into the lungs when people breathe. Particulate matter is a mixture that may include organic chemicals, dust, soot, and metals [2]. Fine particulate pollution has been shown to cause many serious health effects, including heart [3] and lung disease [4].

According to the 2023 IQAir report [5], Peru has the worst air quality in South America, with Lima ranking as the second most polluted city, following Santiago. Thus, it is important to study PM2.5 in Peru. In this work, the city of Ilo in southern Peru was taken as a study case.

One of the main challenges in analyzing PM2.5 time series is that data from environmental monitoring stations often contain missing values due to various reasons. To implement different models, such as forecasting models, first, it is necessary to estimate the missing values [6], as most models cannot process PM2.5 time series with missing values, e.g., neural networks [7]-based models.

There are various techniques for estimating missing values in time series, including statistical methods such as autoregressive integrated moving average (ARIMA) [8], simple moving average (SMA), linear weighted moving average (LWMA), and exponentially weighted moving average (EWMA) [9]; interpolation-based methods such as spline and Stineman [10]; machine learning-based methods, including k-nearest neighbors (KNN) [11] and Support Vector Regression (SVR) [12]; and deep learning-based methods, including long short-term memory (LSTM) [13], gated recurrent unit (GRU), bidirectional LSTM (BiLSTM), and bidirectional GRU (BiGRU).

This work proposes the implementation of a statistical model based on moving averages, linear interpolation smoothing [14], and linear interpolation, which are combined to estimate different gaps of missing values, including 1, 3, 6, 12, and 24 h. The proposed model was evaluated against statistical techniques, including ARIMA and EWMA, as well as deep learning approaches such as LSTM, GRU, and BiGRU.

The proposed method based on moving averages, linear interpolation smoothing, and linear interpolation is inspired by two works. First, the work [15] that structured the PM2.5 time series in a matrix form (days of 24 h); then, from the matrix, weighted averages and polynomial smoothing were used for time series forecasting. Second, the Local Average of Nearest Neighbors (LANN) was implemented in [16] for daily temperature time series imputation using the prior and next values of the gap of missing data.

This work’s main contributions are summarized as follows:

-: A comparative study of statistical and deep learning techniques for the estimation of missing values.
-: A novel ensemble model based on moving averages, smoothing, and linear interpolation for time series imputation.

This study has several limitations. First, it relies on only two PM2.5 datasets. Additionally, the proposed model is univariate, meaning it considers only the data from the time series itself. The model is also limited to estimating a maximum of 24 h of missing data. Finally, the number of benchmark models used for comparison is five, including EWMA, ARIMA, LSTM, GRU, and BiGRU.

The following sections of the paper are arranged as follows: a second section, where the literature review is briefly described; a third section, where the methodology for implementing the proposed model is detailed; a fourth section, where the results are described and discussed in comparison with benchmark models and related works; and finally, a section where the conclusions and future work are outlined.

2. Related Works

2.1. Overview of Imputation Techniques

Just like forecasting techniques, time series imputation methods have evolved. Initially, fairly simple techniques were used, such as the mean, the mode, and the last observed value (LOCF) [17], among others. Later, moving average-based methods were employed, such as simple moving average (SMA), linear weighted moving average (LWMA), exponential weighted moving average (EWMA), autoregressive integrated moving average (ARIMA), and interpolations like linear, spline, and Stineman.

More advanced techniques include machine learning-based such as k-nearest neighbors (KNN), multivariate imputation by chained equations (MICE) [18], random forest (RF) [19], multilayer perceptron (MLP) [20], local outlier factor [21], and isolation forest [21], among others. Finally, deep learning-based techniques were implemented, including long short-term memory (LSTM), bidirectional LSTM (BiLSTM), gated recurrent unit (GRU), generative adversarial networks (GANs) [22], and Transformers [23].

2.2. Related Works on PM2.5 Time Series Imputation

Related works on PM2.5 time series are described briefly below.

In [24], the authors proposed an ensemble model for PM2.5 time series imputation based on decision trees such as random forests, XGBoosting, and the generalized additive model (GAM).

In [25], LSTM was proposed for PM2.5 time series imputation, missing values were randomly inserted, and LSTM, mean, and moving average imputations were implemented. The best RMSE was achieved by LSTM, with a value of 13.43 for 1% of missing values.

In [26], the authors proposed a low-rank matrix completion (LRMC) algorithm for the PM2.5 time series, achieving an R² de 0.9561.

In [27], different techniques were implemented, including mean, median, LOCF, random, Markov, Kalman, and RMM (repeated measures model), for different ratios of random missing values. The best R² of 0.65 was achieved for 20% of missing values by Kalman.

In [28], GRU was proposed to achieve a MAPE of 11.01%, outperforming LSTM, BiLSTM, and SVM.

In [11], KNN was proposed for hourly PM2.5 time series imputation, achieving an R² of 0.89.

In [29], GAIN was proposed for hourly PM2.5 time series, achieving an R² of 0.895.

In [30], the authors proposed a mix of deep learning models and polynomial interpolation for short gaps imputation of missing values, obtaining an average MAPE of 21.43%.

In [31], the authors made a comparison of imputation techniques, including statistical and deep learning techniques, for gaps of one missing value. The best model was ARIMA.

The results of related works are summarized in Table 1.

Table 2 summarizes the key differences between the related works and the proposed model.

3. Materials and Methods

3.1. Data Collection

The data utilized in this study were obtained from the OEFA repository (Organización de Evaluación y Fiscalización Ambiental) of the government of Peru at https://pifa.oefa.gob.pe/VigilanciaAmbiental/ (accessed on 30 July 2024). The time frame spans from 1 August 2020 to 30 April 2023. Ilo City has three environmental monitoring stations, but just two of them were selected for this study (Pacocha and Pardo stations); they can be seen in Figure 1. Table 3 shows the data subsets used in this study.

3.2. Selection of Days

The proposed model works with test data, similar to other statistical techniques from imputeTS [10] library like ARIMA, LWMA, EWMA, and others. The estimations only used the available data in the test data. However, it is important to justify why only two days were used. For this, we analyzed the correlations between 24 days of the training data. Figure 2 and Figure 3 show these correlations. For this purpose, the corr() function for data frames from the Pandas library was used. This function is based on the Pearson correlation coefficient. This work is limited to using this type of correlation; for future work, other options could be explored.

Correlation matrices from Figure 2 and Figure 3 were used to estimate the average correlations between each day. Equation (1) was used.

\bar{C_{j}} = \frac{\sum_{i = 1}^{n - j} C_{i - 1, i - 1 + j}}{n - j}

(1)

where j is the day for which the average correlation is to be determined, i is the hour of day j, and n is the total number of the items found.

Based on (1), Table 4 is constructed, displaying the average correlations across 24 days.

According to Table 4, for a specific day i, an average correlation of 0.2044 is observed on day i − 1 and day i + 1, an average correlation of 0.1641 on day i − 2 and i + 2, an average correlation of 0.2856 on day i − 3 and i + 3, and so on.

Although, the correlation with the first day was not the highest compared to the third day, fourth day, sixth day, and others. It was decided to work with the first days in the implementation of the proposed model, as there was not a significant difference.

Test data were organized in matrix form, so a matrix M was built. The order of this matrix is n × 24, where n is the total number of rows of the matrix and 24 is the hours of a day. The imputation process is visualized graphically in Figure 4.

3.3. Insertion of Missing Values (NA)

According to Figure 4, on a given day i with missing values, day i − 1 and day i + 1 are used to estimate the missing values. So, for the experiments, missing values were inserted in even days (e.g., 1, 3, 5, …), and Table 5 shows some samples of missing values for each gap size; in bold are the days i with missing values.

3.4. Implementation of Moving Averages (MA)

The proposed model utilizes moving averages for the estimation of missing values. The corresponding equation for a given hour is shown in (2).

h_{j} = (M_{i - 1, j} + M_{i + 1, j}) / 2

(2)

where

h_{j}

is the jth hour of the day with missing value to be estimated, M is the test data matrix with missing values, i − 1 is the prior day, and i + 1 is the next day.

Figure 5 shows the estimation of 72 h of missing values with moving averages for gaps of 24 h.

3.5. Smoothing with Linear Interpolation (MA + LI)

In this stage, the obtained results of moving averages were smoothed. For this process, different block sizes were evaluated, e.g., blocks of three, four, five, and six; the best results were obtained from blocks of four items. Given the array of results, for each block of four items, items 1 and 2 are discarded and items 0 and 3 are linearly interpolated to re-estimate items 1 and 2; then, a new block of four items is considered, and so on. Figure 6 shows the results of this process. Two points, x₁, y_1, and x₂, y₂, can be linearly interpolated using Equation (3):

y - y_{i} = \frac{y_{2} - y_{1}}{x_{2} - x_{1}} (x - x_{1})

(3)

where y is the interpolated value for a given x.

The algorithm for moving averages and linear interpolation smoothing can be seen in Algorithm 1.

Algorithm 1 Moving averages and linear interpolation smoothing (MA + LI)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36 n = len(m) − 1
i = 1
soli = []
while (i < n)
begin
j = 0
rs = []
while j < 24:
begin
prr = i − 1
nxt = i + 1
ma = (m[prr][j] + m[nxt][j])/2
rs.append(ma)
j+ = 1
end
steps = 5
lrr = []
c = 0
fin = 24-steps
while (c < fin)
begin
prr = rs[c]
nxt = rs[c + steps]
rr = iLinear(1,2,prr,nxt,steps-1)
nrr = len(rr)
idx = c + 1
for (a = 1 → nrr)
begin
rs[idx] = rr[a]
idx = idx + 1
end
c = c + steps
end
soli.append(rs)
i = i + 2
end

The algorithm requires as input the m matrix of test data with missing values. In the first loop, the matrix m is traversed, considered a counter i to access the rows with missing values 1, 3, 5, and so on.

Between lines 6 and 15, moving averages are estimated, considering that the day i is the day with missing values, and the prior day i − 1 and the next day i + 1 do not have missing values. The result of this block is a matrix rs with rows of 24 h estimated with moving averages.

Between lines 16 and 34, the linear interpolation smoothing is implemented. The rs matrix provided in the previous block is linearly interpolated considering a number of steps; the minimum value to consider is 2 and the maximum is 22, as there must be at least two values for linear interpolation. In this case, the number of steps was set to five, as optimal results were obtained with this configuration. In this process, an iLinear function is used with the next parameters x1 = 1, x2 = 2, y1 = prr, y2 = nxt, and number of estimated values = steps − 1. This function estimates n linear values given a couple of points. The result of this block is the same matrix rs, but with values interpolated according to the number of steps. Later, the matrix rs is used to estimate missing values with MA + LI.

3.6. Implementation of Local Average of Nearest Neighbors (LANN)

LANN [16] is a simple moving average technique used for imputing short gaps, yielding good results. For gaps larger than one in missing values, it works similarly to linear interpolation. Unlike MA + LI, which uses information from days i − 1 and i + 1, LANN utilizes information from the day i with missing values, and if there were none, it uses the last observed hour and the next observed hour in the time series. It is important to highlight that LANN works with the time series with missing values. Figure 7 shows the results of LANN.

3.7. Ensembling MA + LI and LANN with Weighted Averages

Since LANN produces better results than MA + LI for short gaps, and MA + LI yields better results than LANN for large gaps, different weights were tested to determine the optimal ones. Table 6 shows the optimal weights for each model. Figure 8 shows a graphical view of the results of the proposed model. Equation (4) shows how weights are used to estimate final results.

r_{i} = w_{1} \cdot {M L}_{j} + w_{2} \cdot L_{j}

(4)

where i is the gap size, and j is the index of the estimated missing value in the corresponding array of results. w₁ and w₂ are the optimal weights for the gap size. ML is the array of MA + LI results, and L is the array of LANN results.

The process of obtaining the optimal weights was carried out through an iterative process. In the first iteration, w₁ = 0.05 and w₂ = 0.95. In the second iteration, w₁ was increased by 0.05 while w₂ was decreased by 0.05, resulting in w₁ = 0.10 and w₂ = 0.90. In the third iteration, w₁ was again increased, while w₂ decreased, resulting in w₁ = 0.15 and w₂ = 0.85. This continued until reaching w₁ = 0.95 and w₂ = 0.05.

According to Figure 8, it can be seen how MA + LI and LANN complement each other well; while MA + LI tends to estimate higher values, LANN tends to estimate lower values, and combined, they estimate values closer to the observed ones.

3.8. Evaluation

The results were evaluated through four metrics: the root mean squared error (RMSE), which measures the estimation error in ug/m³; the mean absolute percentage error (MAPE), which quantifies the estimation error as a percentage; the correlation coefficient R², which measures the correlation level between the estimated and the observed values; and the Relative RMSE (RRMSE), which measures the magnitude of the mean square error relative to the observed values. These metrics were implemented through Equations (5)–(8) as follows:

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {(P i - O i)}^{2}}{n}}

(5)

M A P E = \frac{1}{n} \sum_{i = 1}^{n} |\frac{(O_{i} - P_{i})}{O_{i}} \times 100|

(6)

R^{2} = \frac{\sum_{i = 1}^{n} (O_{i} - \bar{O}) (P_{i} - \bar{P})}{\sum_{i = 1}^{n} {(O_{i} - \bar{O})}^{2} \sum_{i = 1}^{n} {(P_{i} - \bar{P})}^{2}))}

(7)

R R M S E = \sqrt{\frac{\frac{1}{n} . \sum_{i = 1}^{n} {(O_{i} - P_{i})}^{2}}{\frac{1}{n} . \sum_{i = 1}^{n} P_{i}}}

(8)

4. Results and Discussion

4.1. Results

The obtained results from the proposed model are described in this sub-section.

For Pacocha station, according to Table 7, in terms of RMSE and RRMSE, the proposed model outperforms the standalone models in all gap sizes, with the superiority being more significant in large gap sizes such as 6, 12, and 24. In terms of MAPE, LANN outperforms the proposed model almost in all gap sizes except for 24 h of missing values. Finally, in terms of R², LANN presents similar results to the proposed model for 1 and 3 gap sizes, but the proposed model outperforms LANN for 12 and 24 gap sizes; only in the 6 h gap size does LANN outperform the proposed model.

For Pardo station, according to Table 8, in terms of RMSE, MAPE, R², and RRMSE, the proposed model outperforms LANN and MA + LI in all gap sizes. The difference is more significant in large gaps such as 6, 12, and 24.

For Pardo station, in all gap sizes and all metrics the proposed model outperforms the standalone models MA + LI and LANN. However, for Pacocha station, the proposed model is better than standalone models just in terms of RMSE. In terms of R², it is better than standalone models in most cases, except for 12 h gaps where ARIMA is better. In terms of MAPE, ARIMA is better in most cases; the ensemble model is the best just in one case (24 h gaps).

4.2. Discussions

4.2.1. Benchmark Models

To determine the contribution of the proposed ensemble model, five well-known, state-of-the-art models, namely, ARIMA, EWMA, LSTM, BiLSTM, and GRU, were implemented.

Statistical models were implemented in R language using the imputeTS 3.1 library.

EWMA is a technique used to smooth time series by calculating an average that gives more weight to more recent data and less weight to older data. This technique is especially useful in contexts where is important to capture trends and patterns in data that may be noisy or volatile.

ARIMA is a technique used for the analysis and prediction of time series. It is especially effective for data showing seasonality and trend patterns. The ARIMA model combines three main components: autoregression (AR), differencing (I), and moving average (MA).

The parameters for the statistical models EWMA and ARIMA are given in Table 9.

The window (k) selection for moving averages techniques was selected from three experiments. EWMA and ARIMA were implemented with k = 2, 4, and 6. Both techniques obtained the best results for k = 4.

Deep learning models were implemented in Python language using TensorFlow 2.9.0. These models were trained, validated, and tested with the information provided in Table 3.

LSTM is a recurrent neural network (RNN) designed to learn patterns in sequences of data. Unlike traditional RNNs, which have difficulty learning long-term dependencies due to the gradient fading problem, LSTMs are structured with memory cells that can retain information for extended periods of time.

GRU is another type of RNN designed to handle data sequences and long-term dependencies, like LSTM, but with a simpler architecture. It was introduced as a more efficient variant of LSTM, with fewer parameters and less computational complexity, sometimes making it faster and easier to train.

BiGRU is a variant of the GRU (gated recurrent unit) model that works in both directions of a data stream. While a traditional GRU processes information in only one direction, a BiGRU model uses two GRU layers, one that reads the data stream in the normal direction (from left to right) and one that reads it in the reverse direction (from right to left).

The corresponding hyperparameters of the deep learning models are given in Table 10.

For each deep learning-based model, the look-back window was configured to 24 h; each layer includes 24, 48, and 24 neurons, respectively; there are drop_rates of 0.1 (10%) after each layer to avoid overfitting; and RELU is the activation function. Each deep learning model was trained over 100 epochs, utilizing adam as the optimizer and mse as the loss function. The results were optimal with 100 epochs for all models.

A selection process for the number of neurons in each layer for LSTM and GRU models was implemented. Three combinations of neuron numbers were used: [24, 24, 24, n], [24, 48, 24, n], and [48, 48, 48, n]. Additionally, three combinations of drop rates were used: [‘’, 0.1, 0.1], [‘’, 0.2, 0.2], and [‘’, 0.3, 0.3]. The optimal results were obtained for layers [24, 48] and drop rates [‘’, 0.1, 0.1]. The BiGRU model used the optimal hyperparameters for the GRU model. For future work, other hyperparameter combinations can be explored.

The results of the benchmark models are presented in Table 11 and Table 12.

In Pacocha station, according to Table 11, in terms of RMSE, the best benchmark model is ARIMA, outperforming EWMA, LSTM, GRU, and BiGRU. From the deep learning models, the best is GRU. However, the proposed model is better than ARIMA; it outperformed ARIMA in all gap sizes. In terms of MAPE, the proposed model is the best in almost all gap sizes (1, 3, 6, and 24); only in the gap size of 12 h is ARIMA better. In terms of R² and RRMSE, the proposed model is the best.

In terms of average MAPEs, the proposed model outperforms the best deep learning model (GRU) by 26.61% and the best statistical model (ARIMA) by 2.33%. A graphical comparison between the best benchmark models and the proposed model can be seen in Figure 9.

In Pardo station, according to Table 12, similar to the Pacocha station results, in terms of RMSE, the best benchmark model is ARIMA, outperforming EWMA, LSTM, GRU, and BiGRU. From the deep learning models, the best is GRU. The proposed model is better than ARIMA; it outperformed ARIMA in all gap sizes. In terms of RMSE, MAPE, R², and RRMSE, the proposed model is the best in all gap sizes (1, 3, 6, 12, and 24).

In terms of average MAPEs, the proposed model outperforms the best deep learning model (GRU) by 90.69% and the best statistical model (ARIMA) by 6.67%. A graphical comparison between the best benchmark models and the proposed model can be seen in Figure 10.

In this study, according to Figure 11, statistical models use a window size of k = 4, which means that for estimating each block of missing values, they only use the two previous values and the two following values. On the other hand, deep learning models use only the values preceding the block of missing values according to the look-back parameter of 24 h and estimate n values according to the gap size of missing values.

Another difference between deep learning and statistical techniques is that the first have a set of strategies to improve their performance, such as the number of layers, neurons, activation functions, learning rate, drop rates, and data augmentation, among others. In contrast, the second ones have fewer parameters, such as the window size (k). However, statistical techniques require less data and simpler processes to produce significant results, which becomes their main advantage over deep learning techniques.

This work has shown how classical statistical techniques such as EWMA and ARIMA can perform tasks more precisely than deep learning techniques, especially when data is not abundant. Likewise, the proposed model, implemented with well-known statistical techniques including moving averages, smoothing, and linear interpolation, can produce superior results compared to the benchmark models, making it a good alternative for the estimation of missing values in PM2.5 time series.

Statistical Test

To determine whether there are significant differences between the proposed model and the benchmark models, the Kolgomorov—Smirnov test was implemented. For this, from SciPy 1.10.1 library [32], the stats sub-package was used in Python language. Table 13 and Table 14 show the test results.

As shown in Table 13 for the Pacocha station, when comparing the results of the proposed model with the benchmark models, a p-value of less than 0.05 is observed in most cases, indicating the rejection of the null hypothesis in favor of the alternative. This indicates that, in most cases, there is a significant difference between the proposed and the benchmark models for the different gap sizes of missing values. The proposed model does not exhibit a significant difference from EWMA when tested with 1, 3, or 6 missing values. Likewise, no significant difference is observed compared to ARIMA when dealing with a single missing value.

Similar to the previous station, according to Table 14 for the Pardo station, when comparing the proposed model to the benchmark models, the p-value is less than 0.05 in every case, indicating the rejection of the null hypothesis and the acceptance of the alternative hypothesis. This indicates that there is a significant difference between the proposed model and the benchmark models for all gap sizes of missing values.

Moving Averages Limitations

One of the main issues with models based on moving averages is the inability to capture long-range dependence. In this work, with the proposed model, this problem is partially resolved due to the way the data is organized for performing the regressions. Instead of working with an extensive one-dimensional time series, the data is organized in two dimensions, that is, in matrix form. Each row corresponds to a day, and each column corresponds to an hour of the day, hence the importance of discarding incomplete days and only using days with the available full 24 h. This allows for the estimation of up to 24 consecutive hours of missing data, as long as they all correspond to a complete day. If the 24 h span across two days, the proposed model would not be able to estimate them, which is a limitation of the proposal.

Similarly, another major limitation lies in handling gaps larger than 24 h. This should be addressed in future work, where a simple moving average may not be sufficient. Since the data is organized in a matrix format, other techniques like EWMA or ARIMA with windows (k) greater than four could be used to estimate more than one missing value.

On the other hand, not all datasets contain hourly data; some contain daily data, others half-hour, fifteen-minute intervals, and so on. For these cases, the data could be organized into matrices of order n × 30, n × 48, and n × 96, respectively.

5. Conclusions and Future Work

5.1. Conclusions

Based on the obtained results, the proposed model, which combines moving averages, smoothing, and linear interpolation, proves to be an effective alternative for estimating missing values in gap sizes of 1 h, 3 h, 6 h, 12 h, and 24 h in PM2.5 time series. In terms of average MAPEs, the proposed model outperforms the best deep learning model (GRU) between 26.61% and 90.69% and the best statistical model (ARIMA) between 2.33% and 6.67%. This highlights the importance of how data is processed. While modern techniques, such as those based on deep learning, rely solely on preceding data, statistical techniques leverage preceding and subsequent data based on the position of the missing values. Additionally, with the proposed approach, the matrix organization of the data allows the application of other techniques, such as moving averages and smoothing, improving the accuracy of the estimations for up to 24 h.

5.2. Future Work

Despite the interesting results produced by the proposed model, some aspects need improvement. For example, in terms of MAPE for the Pardo station, results between 12.17% and 23.56% have been achieved, which, according to [33], would fall between a good and reasonable model. For the Pacocha station, the MAPE is between 20.46% and 57.16%, and according to [33], the model ranges from reasonable to poor. Therefore, to improve the model, other experiments could be tried, such as using different days than i − 1 and i + 1 for moving averages. Instead of linear interpolation for smoothing, methods like spline, inverse distance weighting (IDW), or Kriging could be used. Additionally, ARIMA or EWMA might perform better than LANN. Also, the proposed model is set up to estimate a maximum of 24 values; for future work, more hours are required to be estimated according to real PM2.5 time series, and other data frequencies should be considered.

Author Contributions

Conceptualization, A.F. and H.T.-C.; methodology, A.F.; software, V.Y.-M. and O.C.-T.; validation, A.F., H.T.-C. and V.Y.-M.; formal analysis, O.C.-T.; investigation, A.F.; resources, O.C.-T.; data curation, H.T.-C.; writing—original draft preparation, A.F.; writing—review and editing, A.F. and D.C.-V.; visualization, O.C.-T.; supervision, H.T.-C.; project administration, D.C.-V.; funding acquisition, D.C.-V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding and the APC was funded by Universidad Nacional de Moquegua.

Data Availability Statement

Data are available at “https://pifa.oefa.gob.pe/VigilanciaAmbiental/ (accessed on 30 July 2024)” or by contact to the corresponding author. Source code is available in the next link: https://shorturl.at/YNBK4 (accessed on 22 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mahalakshmi; MAbdul Haq, N. Foretelling of space-time PM2.5Air pollutant using Machine Learning. In Proceedings of the 2022 4th International Conference on Advances in Computing, Communication Control and Networking, ICAC3N 2022, Greater Noida, India, 16–17 December 2022. [Google Scholar] [CrossRef]
Deng, J.; Jiang, L.; Miao, W.; Zhang, J.; Dong, G.; Liu, K.; Chen, J.; Peng, T.; Fu, Y.; Zhou, Y.; et al. Characteristics of fine particulate matter (PM2.5) at Jinsha Site Museum, Chengdu, China. Environ. Sci. Pollut. Res. 2022, 29, 1173–1183. [Google Scholar] [CrossRef]
Oh, J.; Choi, S.; Han, C.; Lee, D.W.; Ha, E.; Kim, S.; Bae, H.J.; Pyun, W.B.; Hong, Y.C.; Lim, Y.H. Association of long-term exposure to PM2.5 and survival following ischemic heart disease. Environ. Res. 2023, 216, 114440. [Google Scholar] [CrossRef]
Ni, Y.; Shi, G.; Qu, J. Indoor PM2.5, tobacco smoking and chronic lung diseases: A narrative review. Environ. Res. 2020, 181, 108910. [Google Scholar] [CrossRef]
IQAir. Interactive Global Map of 2023 PM2.5 Concentrations by City. 2023. Available online: https://www.iqair.com/world-air-quality-report (accessed on 30 August 2024).
Wen, H.; Pinson, P.; Gu, J.; Jin, Z. Wind energy forecasting with missing values within a fully conditional specification framework. Int. J. Forecast. 2024, 40, 77–95. [Google Scholar] [CrossRef]
Han, J.; Kang, S. Optimization of missing value imputation for neural networks. Inf. Sci. 2023, 649, 119668. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G. Time Series Analysis, Forecasting and Control, 4th ed.; Wiley: Hoboken, NJ, USA; Prentice-Hall: Englewood Cliffs, NJ, USA, 2008. [Google Scholar]
Moritz, S. Package imputeTS. 2022. Available online: https://cran.r-project.org/web/packages/imputeTS/imputeTS.pdf (accessed on 11 September 2024).
Moritz, S.; Bartz-Beielstein, T. imputeTS: Time series missing value imputation in R. R J. 2017, 9, 207–218. [Google Scholar] [CrossRef]
Belachsen, I.; Broday, D.M. Imputation of Missing PM2.5 Observations in a Network of Air Quality Monitoring Stations by a New kNN Method. Atmosphere 2022, 13, 1934. [Google Scholar] [CrossRef]
Dahmani, S.; Latif, S.D. Streamflow Data Infilling Using Machine Learning Techniques with Gamma Test. Water Resour. Manag. 2024, 38, 701–716. [Google Scholar] [CrossRef]
Qiu, C. A Method Using LSTM Networks to Impute Missing Temperatures in Temperature Datasets and to Predict Future Temperatures. Highlights Sci. Eng. Technol. 2023, 46, 116–124. [Google Scholar] [CrossRef]
Saini, H.; Raicar, G.; Dehzangi, A.; Lal, S.; Sharma, A. Subcellular localization for Gram positive and Gram negative bacterial proteins using linear interpolation smoothing model. J. Theor. Biol. 2015, 386, 25–33. [Google Scholar] [CrossRef]
Flores, A.; Tito-Chura, H.; Yana-Mamani, V.; Rosado-Chavez, C.; Ecos-Espino, A. Weighted Averages and Polynomial Interpolation for PM2.5 Time Series Forecasting. Computers 2024, 13, 238. [Google Scholar] [CrossRef]
Flores, A.; Tito, H.; Silva, C. Local average of nearest neighbors: Univariate time series imputation. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 45–50. [Google Scholar] [CrossRef]
Wongoutong, C. Imputation Methods in Time Series with a Trend and a Consecutive Missing Value Pattern. Thail. Stat. 2021, 19, 866–879. [Google Scholar]
Shin, Y.; Kim, D.-H.; Kim, H.-J.; Lim, C.; Woo, S.-B. Imputation of Missing SST Observation Data Using Multivariate Bidirectional RNN. J. Korean Soc. Coast. Ocean Eng. 2022, 34, 109–118. [Google Scholar] [CrossRef]
Dwivedi, D.; Mital, U.; Faybishenko, B.; Dafflon, B.; Varadharajan, C.; Agarwal, D.; Williams, K.H.; Steefel, C.I.; Hubbard, S.S. Imputation of Contiguous Gaps and Extremes of Subhourly Groundwater Time Series Using Random Forests. J. Mach. Learn. Model. Comput. 2022, 3, 22. [Google Scholar] [CrossRef]
Kim, G.B.; Choi, M.R.; Hwang, C.I. Comparison of missing value imputations for groundwater levels using multivariate ARIMA, MLP, and LSTM. J. Geol. Soc. Korea 2020, 56, 561–569. [Google Scholar] [CrossRef]
Walkowiak, T. Feature Transformations for Outlier Detection in Classification of Text Documents. In Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2022; Volume 484 LNNS. [Google Scholar] [CrossRef]
Xu, L.; Xu, L.; Yu, J. Time series imputation with GAN inversion and decay connection. Inf. Sci. 2023, 643, 119234. [Google Scholar] [CrossRef]
Yldz, A.Y.; Koc, E.; Koc, A. Multivariate Time Series Imputation With Transformers. IEEE Signal Process. Lett. 2022, 29, 2517–2521. [Google Scholar] [CrossRef]
Xiao, Q.; Chang, H.H.; Geng, G.; Liu, Y. An Ensemble Machine-Learning Model to Predict Historical PM2.5 Concentrations in China from Satellite Data. Environ. Sci. Technol. 2018, 52, 13260–13269. [Google Scholar] [CrossRef]
Yuan, H.; Xu, G.; Yao, Z.; Jia, J.; Zhang, Y. Imputation of missing data in time series for air pollutants using long short-term memory recurrent neural networks. In Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, Singapore, 8–12 October 2018. [Google Scholar] [CrossRef]
Liu, X.; Wang, X.; Zou, L.; Xia, J.; Pang, W. Spatial imputation for air pollutants data sets via low rank matrix completion algorithm. Environ. Int. 2020, 139, 105719. [Google Scholar] [CrossRef]
Hadeed, S.J.; O’Rourke, M.K.; Burgess, J.L.; Harris, R.B.; Canales, R.A. Imputation methods for addressing missing data in short-term monitoring of air pollutants. Sci. Total Environ. 2020, 730, 139140. [Google Scholar] [CrossRef] [PubMed]
Saif-ul-Allah, M.W.; Qyyum, M.A.; Ul-Haq, N.; Salman, C.A.; Ahmed, F. Gated Recurrent Unit Coupled with Projection to Model Plane Imputation for the PM2.5 Prediction for Guangzhou City, China. Front. Environ. Sci. 2022, 9, 819616. [Google Scholar] [CrossRef]
Lee, Y.S.; Choi, E.; Park, M.; Jo, H.; Park, M.; Nam, E.; Kim, D.G.; Yi, S.M.; Kim, J.Y. Feature extraction and prediction of fine particulate matter (PM2.5) chemical constituents using four machine learning models. Expert Syst. Appl. 2023, 221, 119696. [Google Scholar] [CrossRef]
Flores, A.; Tito-Chura, H.; Centty-Villafuerte, D.; Ecos-Espino, A. Pm2.5 Time Series Imputation with Deep Learning and Interpolation. Computers 2023, 12, 165. [Google Scholar] [CrossRef]
Flores, A.; Tito-Chura, H.; Ecos-Espino, A.; Flores-Quispe, E. Comparative Study of Imputation Techniques for Missing Value Estimation in Particulate Matter 2.5 µm Time Series. Pollution 2024, 10, 1117–1127. [Google Scholar] [CrossRef]
SciPy. scipy.stats.ksTest. 2024. Available online: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html (accessed on 21 October 2024).
Montaño Moreno, J.J.; Palmer Pol, A.; Sesé Abad, A.; Cajal Blasco, B. Using the R-MAPE index as a resistant measure of forecast accuracy. Psicothema 2013, 25, 500–506. [Google Scholar] [CrossRef]

Figure 1. Location of the environmental monitoring stations in Ilo City, Peru.

Figure 2. The 24−day correlation of Pacocha station.

Figure 3. The 24−day correlation of Pardo station.

Figure 4. Two days were considered to impute missing values.

Figure 5. The 72 estimated hours with the moving average equation for gaps of 24 h.

Figure 6. The 72 estimated hours with the moving average and linear interpolation smoothing for gaps of 24 h.

Figure 7. The 72 estimated hours for gaps of 24 h using LANN.

Figure 8. The 72 estimated hours for gaps of 24 h with the proposed model.

Figure 9. Imputations of 144 h for 24 h gaps using GRU, ARIMA, and the proposed model for Pacocha Station.

Figure 10. Imputations of 144 h for 24 h gaps using GRU, ARIMA, and the proposed model for Pardo Station.

Figure 11. How the benchmark models work in this study. (a) Statistical models and (b) deep learning models. NA is the not available or missing value.

Table 1. Summary of related works.

Work	Technique	Country	Frequency	Gap Size	Metric	Value
[24]	RF + XGB + GAM	China	Hourly	1	R²	0.85
[25]	LSTM	China	Hourly	Random	RMSE	13.43
[26]	LRMC	China	Hourly	Random	R²	0.926
[27]	Kalman	USA	Minute	Random	R²	0.65
[28]	GRU	China	Daily	1	MAPE	11.01
[11]	KNN	China	Half-hourly	0.5 h, 2 y	R²	[0.82–0.57]
[29]	GAIN	Korea	Hourly	Random	R²	0.89
[30]	Deep Learning + Polynomial Interpolation	Peru	Hourly	1	MAPE	21.43
[31]	ARIMA	Peru	Hourly	1	MAPE R2	10.0192 0.8247

Table 2. The related works vs. the proposed model.

Related Works	Proposed Model
Most of them are based on machine and deep learning techniques.	It is based on statistical techniques.
Some of them work with random gap sizes.	It works with different gap sizes.
Some of them work with just one gap size.	It works with 1, 3, 6, 12, and 24 gap sizes.
Most of the models proposed in the literature rely on extensive training data.	The model proposed in this work requires very little data.

Table 3. Data subsets.

Station	Total Hours	* Train (70%)	* Validation (10%)	Test (20%)
Pacocha	16,344	11,772	1308	3265
Pardo	21,960	16,260	1756	4392

* Train and validation data were used for deep learning models.

Table 4. Average correlations across 24 days.

Day	Pacocha	Pardo	Avg
1	0.2637	0.1450	0.2044 ± 0.08
2	0.1903	0.1378	0.1641 ± 0.04
3	0.3079	0.2632	0.2856 ± 0.03
4	0.2539	0.2411	0.2475 ± 0.01
5	0.2413	0.1729	0.2071 ± 0.05
6	0.2404	0.2595	0.2500 ± 0.01
7	0.3393	0.2604	0.2999 ± 0.06
8	0.3589	0.2152	0.2871 ± 0.10
9	0.1939	0.2572	0.2256 ± 0.04
10	0.2409	0.1451	0.1930 ± 0.07
11	0.1685	0.1410	0.1548 ± 0.02
12	0.2378	0.1965	0.2172 ± 0.03
13	0.2683	0.1785	0.2234 ±0.06
14	0.1554	0.2514	0.2034 ± 0.07
15	0.2515	0.2888	0.2702 ± 0.03
16	0.2057	0.1607	0.1832 ± 0.03
17	0.0924	0.0662	0.0793 ± 0.02
18	0.1612	0.2781	0.2197 ± 0.08
19	0.0845	0.0313	0.0579 ± 0.04
20	0.1950	0.2017	0.1984 ± 0.00
21	0.1806	0.1579	0.1693 ± 0.02
22	0.2255	0.3517	−0.2886 ± 0.09
23	−0.2536	0.1823	−0.0357 ± 0.31

Table 5. How missing values were set in test data.

Gap Size	Insertion of Missing Values (NA)
1	9.28, 9.6, 17.62, 18.62, 15.94, 15.51, 31.26, 20.76, 12.3, 12.54, 10.41, 7.75, 7.82, 7.55, 6.92, 6.62, 7.3, 9.39, 9.94, 12.72, 8.26, 10.58, 12.49, 7.72, NA, 15.78, NA, 38.94, NA, 28.45, NA, 25.78, NA, 20.71, NA, 16.98, NA, 9.06, NA, 8.93, NA, 11.1, NA, 24.34, NA, 23.9, NA, 58, 52.48, 38.17, 22.3, 23.6, 30.85, 33.38, 38.77, 19.4, 21.04, 18.77, 13.98, 9.78, 8.38, 8.71, 11.03, 10.36, 9.64, 8.4, 8.7, 14.13, 13.55, 15.12, 15.18, 63.83, …
3	9.28, 9.6, 17.62, 18.62, 15.94, 15.51, 31.26, 20.76, 12.3, 12.54, 10.41, 7.75, 7.82, 7.55, 6.92, 6.62, 7.3, 9.39, 9.94, 12.72, 8.26, 10.58, 12.49, 7.72, NA, NA, NA, 38.94, 40.98, 28.45, NA, NA, NA, 20.71, 22.59, 16.98, NA, NA, NA, 8.93, 8.45, 11.1, NA, NA, NA, 23.9, 21.75, 58, 52.48, 38.17, 22.3, 23.6, 30.85, 33.38, 38.77, 19.4, 21.04, 18.77, 13.98, 9.78, 8.38, 8.71, 11.03, 10.36, 9.64, 8.4, 8.7, 14.13, 13.55, 15.12, 15.18, 63.83, …
6	9.28, 9.6, 17.62, 18.62, 15.94, 15.51, 31.26, 20.76, 12.3, 12.54, 10.41, 7.75, 7.82, 7.55, 6.92, 6.62, 7.3, 9.39, 9.94, 12.72, 8.26, 10.58, 12.49, 7.72, NA, NA, NA, NA, NA, NA, 28.66, 25.78, 24.2, 20.71, 22.59, 16.98,NA, NA, NA, NA, NA, NA, 14.07, 24.34, 29.41, 23.9, 21.75, 58, 38.77, 19.4, 21.04, 18.77, 13.98, 9.78, 8.38, 8.71, 11.03, 10.36, 9.64, 8.4, 8.7, 14.13, 13.55, 15.12, 15.18, 63.83, …
12	9.28, 9.6, 17.62, 18.62, 15.94, 15.51, 31.26, 20.76, 12.3, 12.54, 10.41, 7.75, 7.82, 7.55, 6.92, 6.62, 7.3, 9.39, 9.94, 12.72, 8.26, 10.58, 12.49, 7.72, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 10.83, 9.06, 11, 8.93, 8.45, 11.1, 14.07, 24.34, 29.41, 23.9, 21.75, 58, 52.48, 38.17, 22.3, 23.6, 30.85, 33.38, 38.77, 19.4, 21.04, 18.77, 13.98, 9.78, 8.38, 8.71, 11.03, 10.36, 9.64, 8.4, 8.7, 14.13, 13.55, 15.12, 15.18, 63.83, …
24	9.28, 9.6, 17.62, 18.62, 15.94, 15.51, 31.26, 20.76, 12.3, 12.54, 10.41, 7.75, 7.82, 7.55, 6.92, 6.62, 7.3, 9.39, 9.94, 12.72, 8.26, 10.58, 12.49, 7.72, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 52.48, 38.17, 22.3, 23.6, 30.85, 33.38, 38.77, 19.4, 21.04, 18.77, 13.98, 9.78, 8.38, 8.71, 11.03, 10.36, 9.64, 8.4, 8.7, 14.13, 13.55, 15.12, 15.18, 63.83, …

In bold the 24 h day with missing values.

Table 6. Optimal weights of standalone models for each gap size.

Gap Size	Weights
Gap Size	w₁ (MA + LI)	w₂ (LANN)
1	0.05	0.95
3	0.15	0.85
6	0.25	0.75
12	0.45	0.55
24	0.65	0.35

Table 7. Results of proposed model in Pacocha station.

Model	Gap Size of Missing Values
Model	1	3	6	12	24
RMSE
MA + LI	7.6903	8.0136	7.7564	8.8304	7.5483
LANN	4.0533	5.0798	6.0958	8.5701	8.7844
Proposed Model	4.0506	5.0736	5.8703	8.0593	6.9101
MAPE
MA + LI	58.2851	56.1228	58.6521	67.4559	57.8580
LANN	20.2896	26.1345	39.5012	44.9316	69.8399
Proposed Model	20.4655	26.2066	41.2707	49.9204	57.1673
R²
MA + LI	0.1091	0.0999	0.1237	0.0745	0.1120
LANN	0.7330	0.6148	0.4742	0.1924	0.1700
RRMSE
MA + LI	2.5177	2.6120	2.5416	2.7138	2.4697
LANN	1.3088	1.6478	1.8982	2.8595	2.6367
Proposed Model	1.3083	1.6463	1.8505	2.5871	2.1901

Table 8. Results of proposed model in Pardo station.

Model	Gap Size of Missing Values
	1	3	6	12	24
RMSE
MA + LI	2.1850	2.0381	2.2226	2.2861	2.1833
LANN	1.1519	1.5624	1.7841	2.1459	2.1435
Proposed Model	1.1497	1.5231	1.6329	2.0605	2.0651
MAPE
MA + LI	25.5014	25.0964	26.1030	28.6020	25.6556
LANN	12.1816	16.1074	25.5250	25.7095	25.7792
Proposed Model	12.1715	15.9611	23.3307	24.4941	23.5613
R²
MA + LI	0.2262	0.2620	0.2483	0.2214	0.2221
LANN	0.7735	0.5914	0.5529	0.3411	0.2745
Proposed Model	0.7742	0.5990	0.5912	0.3731	0.2983
RRMSE
MA + LI	0.9907	0.9251	1.0229	1.0155	0.9922
LANN	0.5067	0.6898	0.7624	0.9653	0.9988
Proposed Model	0.5060	0.6334	0.7101	0.9216	0.9466

Table 9. Parameters for statistical models.

Technique	Window (k)	Method
EWMA	4	weighting = ‘exponential’
ARIMA	4	model = ‘auto.arima’

Table 10. Hyperparameters for deep learning models.

Model	Hyperparameters
LSTM	[24, 48, 24, n *], lr = 0.001, drop_rate = [‘’, 0.1, 0.1]
GRU	[24, 48, 24, n *], lr = 0.001, drop_rate = [‘’, 0.1,0.1]
BiGRU	[24, 48, 24, n *], lr = 0.001, drop_rate = [‘’, 0.1, 0.1]

* is the gap size to be estimated, e.g., 1, 3, 6, 12, or 24 h.

Table 11. Comparison of the benchmark models and the proposed model in Pacocha station.

Model	Gap Size
Model	1	3	6	12	24
RMSE
EWMA	4.1962	5.1643	6.6006	9.0685	9.3926
ARIMA	4.1006	5.3194	6.4243	8.5234	7.0857
LSTM	11.9599	11.8396	12.0099	11.1642	11.8206
GRU	9.3581	9.7258	9.4652	10.6280	9.2072
BiGRU	8.7196	9.0307	8.8677	9.6332	8.6770
Proposed Model	4.0506	5.0736	5.8703	8.0593	6.9101
MAPE
EWMA	22.6348	27.0066	39.9906	46.1174	73.7888
ARIMA	21.1681	30.9679	48.8931	45.5445	60.0977
LSTM	191.6827	180.9878	194.7890	154.3476	189.6667
GRU	65.5836	65.2992	66.9026	65.1567	65.1322
BiGRU	89.6948	92.8636	92.1008	86.5079	93.6101
Proposed Model	20.4655	26.2066	41.2707	49.9204	57.1673
R²
EWMA	0.7144	0.6033	0.3892	0.1485	0.1529
ARIMA	0.7262	0.5822	0.3548	0.1570	0.1480
LSTM	0.0007	0.0110	0.0030	0.0196	0.0002
GRU	0.0012	0.0018	0.0009	0.0017	0.0011
BiGRU	0.0007	0.0010	0.0014	0.0059	0.0006
Proposed Model	0.7330	0.6148	0.4694	0.2039	0.2150
RRMSE
EWMA	1.3549	1.6863	2.0806	3.0823	2.8221
ARIMA	1.3251	1.7224	2.0367	2.8844	2.2704
LSTM	2.7730	2.7463	2.7929	2.6052	2.7459
GRU	3.2697	3.4114	3.3119	3.7124	3.2238
BiGRU	2.6228	2.6294	2.6799	2.7997	2.8780
Proposed Model	1.3083	1.6463	1.8505	2.5871	2.1901

Table 12. Comparison of the benchmark models and the proposed model in Pardo station.

Model	Gap Size
Model	1	3	6	12	24
RMSE
EWMA	2.3433	2.2220	2.3704	2.3933	2.3250
ARIMA	2.0489	1.9030	2.0588	2.1779	2.0332
LSTM	5.5238	5.3756	5.6967	5.4118	5.5557
GRU	5.4047	5.3075	5.5605	5.2980	5.4114
BiGRU	5.5003	5.3793	5.6604	5.3818	5.4957
Proposed model	1.1497	1.5231	1.6329	2.0605	2.0651
MAPE
EWMA	29.1740	28.1282	28.2679	32.4197	29.0337
ARIMA	26.2820	25.1973	27.2105	28.0468	26.1228
LSTM	112.7153	111.2804	124.5182	114.2086	115.6403
GRU	109.4710	105.9471	120.5843	106.8928	110.0650
BiGRU	114.7408	112.2894	126.1085	112.9798	114.9861
Proposed model	12.1715	15.9611	23.3307	24.4941	23.5613
R²
EWMA	0.1809	0.2042	0.1909	0.2144	0.1847
ARIMA	0.3002	0.3392	0.3402	0.3064	0.3048
LSTM	0.0199	0.0138	0.0161	0.0095	0.0178
GRU	0.0231	0.0198	0.0192	0.0136	0.0220
BiGRU	0.0256	0.0206	0.0214	0.0146	0.0236
Proposed model	0.7742	0.5990	0.5912	0.3731	0.2983
RRMSE
EWMA	1.0783	1.0228	1.0911	1.0915	1.0700
ARIMA	0.9266	0.8602	0.9301	0.9822	0.9201
LSTM	1.8626	1.7956	1.9191	1.8045	1.8609
GRU	1.8401	1.8049	1.8930	1.8074	1.8433
BiGRU	1.8452	1.8036	1.9010	1.8047	1.8463
Proposed Model	0.5060	0.6334	0.7101	0.9216	0.9466

Table 13. p-values for Kolgomorov—Smirnov test in Pacocha station.

Model	Gap Size
Model	1	3	6	12	24
EWMA	0.9032	0.9945	0.0941	0.0000	0.0000
ARIMA	0.9890	0.0004	0.0000	0.0000	0.0000
LSTM	0.0000	0.0000	0.0000	0.0000	0.0000
GRU	0.0000	0.0000	0.0000	0.0000	0.0000
BiGRU	0.0000	0.0000	0.0000	0.0000	0.0000

Table 14. p-values for Kolgomorov—Smirnov test in Pardo station.

Model	p-Value According to Gap Size
Model	1	3	6	12	24
EWMA	0.0000	0.0001	0.0000	0.0000	0.0000
ARIMA	0.0004	0.0017	0.0000	0.0000	0.0014
LSTM	0.0000	0.0000	0.0000	0.0000	0.0000
GRU	0.0000	0.0000	0.0000	0.0000	0.0000
BiGRU	0.0000	0.0000	0.0000	0.0000	0.0000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Flores, A.; Tito-Chura, H.; Cuentas-Toledo, O.; Yana-Mamani, V.; Centty-Villafuerte, D. PM2.5 Time Series Imputation with Moving Averages, Smoothing, and Linear Interpolation. Computers 2024, 13, 312. https://doi.org/10.3390/computers13120312

AMA Style

Flores A, Tito-Chura H, Cuentas-Toledo O, Yana-Mamani V, Centty-Villafuerte D. PM2.5 Time Series Imputation with Moving Averages, Smoothing, and Linear Interpolation. Computers. 2024; 13(12):312. https://doi.org/10.3390/computers13120312

Chicago/Turabian Style

Flores, Anibal, Hugo Tito-Chura, Osmar Cuentas-Toledo, Victor Yana-Mamani, and Deymor Centty-Villafuerte. 2024. "PM2.5 Time Series Imputation with Moving Averages, Smoothing, and Linear Interpolation" Computers 13, no. 12: 312. https://doi.org/10.3390/computers13120312

APA Style

Flores, A., Tito-Chura, H., Cuentas-Toledo, O., Yana-Mamani, V., & Centty-Villafuerte, D. (2024). PM2.5 Time Series Imputation with Moving Averages, Smoothing, and Linear Interpolation. Computers, 13(12), 312. https://doi.org/10.3390/computers13120312

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu