Nothing Special   »   [go: up one dir, main page]

Next Article in Journal
Evolutionary Game and Simulation Analysis of New-Energy Vehicle Promotion in China Based on Reward and Punishment Mechanisms
Previous Article in Journal
A Study on Linguistic Z-Graph and Its Application in Social Networks
Previous Article in Special Issue
Laplace-Logistic Unit Distribution with Application in Dynamic and Regression Analysis
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

High-Resolution Spatiotemporal Forecasting with Missing Observations Including an Application to Daily Particulate Matter 2.5 Concentrations in Jakarta Province, Indonesia

by
I Gede Nyoman Mindra Jaya
1,* and
Henk Folmer
1,2
1
Department of Statistics, Universitas Padjadjaran, Jl. Raya Bandung Sumedang km 21 Jatinangor, Sumedang 45363, Indonesia
2
Faculty of Spatial Sciences, University of Groningen, P.O. Box 800, 9700 AV Groningen, The Netherlands
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(18), 2899; https://doi.org/10.3390/math12182899
Submission received: 15 July 2024 / Revised: 27 August 2024 / Accepted: 16 September 2024 / Published: 17 September 2024
(This article belongs to the Special Issue Advanced Statistical Application for Realistic Problems)
Figure 1
<p>Two-stage high-resolution prediction model with missing observations.</p> ">
Figure 2
<p>Map of Jakarta, with an inset highlighting its location within Indonesia and Southeast Asia (note: 0–5–10 km indicates the scale of the map).</p> ">
Figure 3
<p>Jakarta Province and the distribution of air-quality-monitoring stations (Station IDs can be found in <a href="#app1-mathematics-12-02899" class="html-app">Appendix A</a>).</p> ">
Figure 4
<p>Distribution of missing observations from 1 January to 31 December 2022.</p> ">
Figure 5
<p>Time variation of PM<sub>2.5</sub> concentrations at Gading Harmony (S3) and RespoKare Mask–Wisma 76 (S8).</p> ">
Figure 6
<p>Monthly average PM<sub>2.5</sub> concentrations (μg/m<sup>3</sup>) per site.</p> ">
Figure 7
<p>Covariates: (<b>A</b>) Population density (people/km<sup>2</sup>), (<b>B</b>) altitude (m), (<b>C</b>) precipitation (mm<sup>3</sup>).</p> ">
Figure 8
<p>Observed and MTST-predicted PM<sub>2.5</sub> concentrations for 1 November–31 December 2022, for the 13 monitoring stations.</p> ">
Figure 9
<p>MSTS-predicted (solid lines, 1–31 January 2023) and observed PM<sub>2.5</sub> concentrations (red dots, 1 January–31 December 2022) for the 13 monitoring stations.</p> ">
Figure 10
<p>Box plots of the monthly distribution of the PM<sub>2.5</sub> concentrations for 1 January–31 December 2022 per observation station in μg/m<sup>3</sup>.</p> ">
Figure 11
<p>The meshed study area.</p> ">
Figure 12
<p>Observed versus predicted high-resolution PM<sub>2.5</sub> concentrations for the period 1–31 January 2023 for models M1–M5 for three selected monitoring stations (S5, S6, and S10).</p> ">
Figure 13
<p>The global temporal pattern of the PM<sub>2.5</sub> predictions versus the global temporal pattern of the observations, January 2023.</p> ">
Figure 14
<p>Temporal pattern of PM<sub>2.5</sub> and tracer gas data (CO and NO<sub>2</sub>), January 2023.</p> ">
Figure 15
<p>Posterior means of the standard deviations of the GF innovations, 1–31 January 2023.</p> ">
Figure 16
<p>The predicted high-resolution daily PM<sub>2.5</sub> concentration per grid area in January 2023 (μg/m<sup>3</sup>).</p> ">
Figure 17
<p>PM<sub>2.5</sub> exceedance probabilities for level <math display="inline"><semantics> <mrow> <mi>c</mi> <mo>=</mo> <mn>35</mn> </mrow> </semantics></math> μg/m<sup>3</sup> per 1 km × 1 km grid area for 1–31 January 2023.</p> ">
Versions Notes

Abstract

:
Accurate forecasting of high-resolution particulate matter 2.5 (PM2.5) levels is essential for the development of public health policy. However, datasets used for this purpose often contain missing observations. This study presents a two-stage approach to handle this problem. The first stage is a multivariate spatial time series (MSTS) model, used to generate forecasts for the sampled spatial units and to impute missing observations. The MSTS model utilizes the similarities between the temporal patterns of the time series of the spatial units to impute the missing data across space. The second stage is the high-resolution prediction model, which generates predictions that cover the entire study domain. The second stage faces the big N problem giving rise to complex memory and computational problems. As a solution to the big N problem, we propose a Gaussian Markov random field (GMRF) for innovations with the Matérn covariance matrix obtained from the corresponding Gaussian field (GF) matrix by means of the stochastic partial differential equation (SPDE) method and the finite element method (FEM). For inference, we propose Bayesian statistics and integrated nested Laplace approximation (INLA) in the R-INLA package. The above approach is demonstrated using daily data collected from 13 PM2.5 monitoring stations in Jakarta Province, Indonesia, for 1 January–31 December 2022. The first stage of the model generates PM2.5 forecasts for the 13 monitoring stations for the period 1–31 January 2023, imputing missing data by means of the MSTS model. To capture temporal trends in the PM2.5 concentrations, the model applies a first-order autoregressive process and a seasonal process. The second stage involves creating a high-resolution map for the period 1–31 January 2023, for sampled and non-sampled spatiotemporal units. It uses the MSTS-generated PM2.5 predictions for the sampled spatiotemporal units and observations of the covariate’s altitude, population density, and rainfall for sampled and non-samples spatiotemporal units. For the spatially correlated random effects, we apply a first-order random walk process. The validation of out-of-sample forecasts indicates a strong model fit with low mean squared error (0.001), mean absolute error (0.037), and mean absolute percentage error (0.041), and a high R² value (0.855). The analysis reveals that altitude and precipitation negatively impact PM2.5 concentrations, while population density has a positive effect. Specifically, a one-meter increase in altitude is linked to a 7.8% decrease in PM2.5, while a one-person increase in population density leads to a 7.0% rise in PM2.5. Additionally, a one-millimeter increase in rainfall corresponds to a 3.9% decrease in PM2.5. The paper makes a valuable contribution to the field of forecasting high-resolution PM2.5 levels, which is essential for providing detailed, accurate information for public health policy. The approach presents a new and innovative method for addressing the problem of missing data and high-resolution forecasting.

1. Introduction

Spatiotemporal data are data collected across both space and time [1]. The use of spatiotemporal data is rapidly gaining popularity in various fields, including regional economics, environmental economics, regional science, environmental sciences, geography, sociology, demography, and medical science—especially epidemiology [2,3]. Spatiotemporal analysis includes mapping and prediction.
Spatiotemporal data exhibit spatial dependence and temporal correlation [4]. Given both features, spatiotemporal forecasting offers advantages over pure temporal or pure spatial forecasting, especially for out-of-sample forecasting [5].
High-resolution spatiotemporal forecasting aimed at predictions for an entire study area includes sampled and non-sampled spatial units. It uses geostatistical models integrating regression and interpolation techniques [6,7]. It is rapidly developing in many fields [4]. There are two main data challenges to high-resolution spatiotemporal forecasting: the occurrence of missing observations [8] and the big N problem [9].
Missing spatiotemporal observations occur due to limitations in data collection or equipment failure [10]. Missing observations tend to have serious consequences for spatiotemporal forecasting because of over- or under-prediction and loss of predictive power [11]. Hence, adequate handling of missing observations is crucial for accurate prediction.
Correction methods for missing spatiotemporal observations are based on the nature of the missing mechanism [10]. Three types of missing observations are distinguished: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) [12]. In the case of MCAR, the missing observations are independent of both observed and unobserved values. For MAR, the probability of a missing observation is related to the observed data [13], implying that the missing observation can be imputed from observed values [14] In the case of MNAR, the probability that an observation is missing is related to observed and unobserved values [13]. According to [15], missing data are rarely MCAR or MNAR.
Various imputation methods for missing data have been proposed. They can be classified into three categories [16]: prediction-based methods, statistical learning methods, and interpolation-based methods. The first category imputes missing observations using historical data and techniques such as multiple regression models [17], autoregressive integrated moving average (ARIMA) models [18], artificial neural network (ANN) algorithms [19], and heuristic algorithms. Statistical learning approaches impute missing observations by fitting probability distributions, such as probabilistic principal component analysis (PPCA) [20] or Markov chain Monte Carlo models [21]. Interpolation-based methods impute missing observations by considering the average or weighted average of neighboring non-missing observations [16]. A detailed discussion and comparison of these methods can be found in reference [22]. In this paper, we shall apply an interpolation-based method, specifically multivariate pure spatial time series (MSTS) analysis. The method imputes missing observations by exploiting similar temporal patterns in adjacent spatial units [9,20,23].
MSTS analysis is based on a collection of time series for different spatial units for similar time intervals [24]. It is based on the simultaneous temporal behavior of the different spatial units under consideration [25,26]. Hence, the time series for a given spatial unit not only considers its own observations but also interdependencies with the time series for other spatial units.
Because high-resolution prediction is aimed at generating predictions for the entire study area, including sampled and non-sampled spatial units, the covariance matrix analyzed is a dense, continuously indexed matrix. It tends to give rise to the “big N problem”, which is a computation and memory usage problem associated with analyzing a large amount of spatial data [27,28]. Computational challenges associated with the big N problem have been tackled through the introduction of statistical methods such as subsampling, spectral techniques, likelihood and covariance function approximations, process convolutions, and Gaussian Markov random fields (GMRFs) [29]. (Note that no single method suits all types of big N problem scenarios. Challenges may still arise even when the assumptions underlying a particular methodology are entirely fulfilled [28]).
In this paper, instances of the big N problem caused by high-resolution prediction will be handled by means of a GMRF reducing interdependencies within the spatial structure under consideration. A GMRF can be obtained from a Gaussian Field (GF) using stochastic partial differential equation (SPDE) techniques and the finite element method (FEM), applying Bayesian methods and the R software version 4.3.3 package integrated nested Laplace approximation (R-INLA) [9,30]. INLA is a computationally effective alternative to the computationally intensive Markov chain Monte Carlo (MCMC) approximation method [30].
The two-stage model is associated with two kinds of missing observations in the analysis: first, “genuine” missing observations due to limitations in data collection or equipment failure; second, forecasts specified as missing observations using INLA. The presence of the two kinds of missing data increases computational complexities [31], which are handled in this paper by the two-stage model as follows. In the first stage, the pure MSTS forecasting model treats the observed data and the missing observations as stochastic time-dependent variables. The model generates forecasts for future time periods and imputes missing data by leveraging neighboring observations based on similar temporal patterns. The second stage generates the high-resolution predictions using the MTST forecasts, the covariates at a lower spatial level, and the GF and GMRF capturing spatiotemporal dependence in the data.
The proposed methodology will be applied to provide detailed predictions of daily PM2.5 concentrations in Jakarta Province, Indonesia. Rapid industrialization and urbanization have resulted in the emergence of air pollution in Jakarta Province as a major societal issue (see among others [32]).
The paper is organized as follows. In Section 2, we specify the two-stage model, Bayesian inference, the prediction framework, and INLA. Then, in Section 3, we discuss the application of the methodology to generate high-resolution forecasts of PM2.5 concentrations in Jakarta Province. Section 4 presents the discussion and Section 5 the conclusions.

2. The Two-Stage High-Resolution Forecasting Model

In this study, we present a forecasting method designed to deliver high-resolution forecasts despite the challenge of missing observations. Our method consists of two stages. The first stage involves the multivariate spatial time series forecasting model, which also addresses the missing observations problem. The second stage focuses on generating high-resolution predictions.

2.1. The First Stage: Pure Multivariate Spatial Time Series (MSTS) Forecasting Model

Let y s i , t be a realization of the spatiotemporal process Y s i , t where s i   represents the latitude and longitude coordinates for location i ( i = 1,2 ,   n )   and t denotes time, t = 1 ,   2 ,   T , T + 1 , T + 2 ,   , T ~ , with T ~ = T + h ,   h denoting the length of the forecast period. The realizations y s i , t of the spatiotemporal process, for all s i and t, are conditionally independent Gaussian random variables with error distribution ε s i , t ~ N 0 , σ ε 2 .
Consider the collection of n spatial units with time series y s i , t where each time series consists of an intercept and temporally structured random effects. The n time series are grouped into an n -vector y t = y s 1 , t , , y s n , t , t = 1 ,   2 ,   T , T + 1 , T + 2 ,   , T ~ . We take the multivariate spatial time series (MSTS) model as a multivariate dynamic linear model (MDLM) offering a flexible modeling framework for a nonstationary time series [33]. For instance, it can capture temporal trends by incorporating a time-varying intercept, enabling the model to accommodate shifts in the mean over time [33].
The mean of the MDLM, η t = η s 1 , t , , η s n , t has intercept β 0 s = β 0 s 1 , , β 0 s n varying across spatial units throughout the entire time period. The model further incorporates a latent n -dimensional temporal trend ϑ t = ϑ s i , t , , ϑ s n , t and latent n -dimensional seasonal processes φ t = φ s i , t , , φ s n , t . Hence:
η t = β 0 s + ϑ t + φ t
We adopt the common interdependence assumption for a spatial MSTS that the multivariate outcomes arise from the same underlying temporal trend and seasonal process [34]. Assuming that the spatial units share a common temporal pattern model, (1) becomes:
η t = β 0 s + ϑ t 1 n + φ t 1 n
In (2), the unit vector 1 n and the scalars ϑ t and φ t indicate that the temporal pattern varies over time but is constant across space, given t .
We make the following assumptions. The means β 0 s 1 , , β 0 s n follow a Gaussian prior with zero mean and σ β 0 2 , that is β 0 s 1 , , β 0 s n ~ N 0 , σ β 0 2 ,   i = 1 ,   ,   n . The temporal trend ϑ t   a n d seasonal component φ t   vary over time but are constant across space. ϑ t is modelled as a random walk of order one (RW1):
ϑ t + 1 ϑ t | σ ϑ 2 ~ N 0 , σ ϑ 2 ,     s i   a n d   t = 1 , . . . , T ~ 1 ,
with variance hyperparameter σ ϑ 2 . The seasonal component φ t , with periodicity q (with q smaller than the number of time periods T ) , is defined as:
φ t + φ t + 1 + + φ t + q 1 | σ φ 2 ~ N 0 , σ φ 2 ,       s i   a n d   t = 1 , . . . , T ~ q + 1 ,
where σ φ 2 is the variance hyperparameter of φ t .

2.2. The Second Stage: High-Resolution Spatiotemporal Prediction Model

The objective of the second stage is to predict the quantity of interest y s g , t at location s g for the forecast period for t = T + 1 , . . ,   T ~ , utilizing the MTST forecasts values at location s i , i = 1 , n . Note that g = n + 1 , , n + m , with m being the total number of unsampled grid points and for t = T + 1 , . . ,   T ~ .
Consider the joint linear prediction model η s l , t for l = 1 , , n , n + 1 , , n + m and t = T + 1 , . . ,   T ~ :
η s l , t = β 0 + x s l , t β + ϱ s l , t   for   t = T + 1 , T + 2 ,   , T ~
with β 0 denoting the average (overall) intercept across space during the entire forecast period. x s l , t = x 1 s l , t , , x K s l , t is the vector of K covariates for site s l at prediction time t ~ with coefficient vector β = β 1 , , β K . ϱ s l , t is the first-order temporal autoregressive process of the spatially correlated innovations following a spatiotemporal Gaussian field (GF):
  ϱ s l , t = λ ϱ s l , t 1 + u s l , t
for every t and every s l . Equation (6) captures the change over time of innovations across space and is denoted as a structured spatiotemporal interaction in the sequel, where λ < 1 is the autoregressive parameter. The distribution of the structured spatiotemporal interaction ϱ s l , 1   i s   N 0 , σ u 2 / ( 1 λ 2 )   , with σ u 2 being the marginal variance of the innovation. The first component of Equation (6), λ ϱ s l , t 1 ,   captures the influence of past values. If model (5) fails to adequately capture temporal dependence, additional autoregressive terms (3) may be considered. However, the additional terms may lose significance when other variables, notably covariates, are present [35]. The following observations apply. First, in contrast to the first stage, the second stage does not employ spatially varying intercepts to capture spatial variations in the observed variables. Instead, it utilizes the fixed effects (covariates) and the structured spatiotemporal interaction to capture spatiotemporal information. Hence, the covariates and structured spatiotemporal interaction (in the applications, spatial and temporal random effects and their unstructured interaction are also considered) take over from the spatially varying intercepts among the sampled spatial units to account for spatial variation among the unsampled spatial units [9,36].
The spatially correlated innovations u s l , t are assumed to be temporally independent and follow a GF with zero mean and a spatial covariance function for time t [36]:
Σ u s l , t , u s l , t = 0                             f o r   t t σ u 2 R d     f o r   t = t ;   f o r   s l s l
where R d denotes the spatial correlation function for the locations s l and s l through the Euclidean distance d = s l s l R . A Matérn correlation function is commonly taken for R d [36]:
R d = 1 Γ v 2 v 1 κ d v K v κ d
where K v is the Bessel function of the second kind of order v > 0 . The parameter v controls the level of spatial smoothness. Estimating v can be challenging, and it is therefore often set to a predetermined value, notably v = 1 [30]. κ is the scaling parameter associated with the range r , which is the distance at which spatial correlation starts decreasing significantly. According to [27]:
r = 8 v κ
where r is the distance at which the spatial correlation is approximately 0.13 for each value of v . (Note that temporal independence simplifies the computation process by working with covariance matrices of size ( n + m ) × ( n + m ) , as opposed to the more cumbersome ( n + m ) T ~ × ( n + m ) T ~ matrices).
Substituting Equation (8) into Equation (7) f o r   t = t , we obtain the spatiotemporal Matérn covariance function Σ d for each time point t :
Σ d = σ u 2 Γ v 2 v 1 κ d v K v κ d  
Collecting all the observations at time t into the vector y t = y s 1 , t , , y s ( n + m ) , t , Equations (5) and (6) become:
η t = β 0 1 ( n + m ) + x t ~ β + ϱ t ϱ t = λ ϱ t 1 + u t ;   u t ~ N 0 , Σ  
where 1 ( n + m ) is the unit vector of dimension ( n + m ) , I ( n + m ) is the identity matrix of dimension ( n + m ) , x t = x s 1 , t , , x s ( n + m ) , t , and ϱ t ~ = ϱ s 1 , t , , ϱ s ( n + m ) , t with ϱ 1 being the stationary AR(1) process w i t h   d i s t r i b u t i o n   N 0 , Σ 1 λ 2 and Σ the ( n + m ) dimensional Matérn covariance matrix with the elements as defined in Equation (10).

The SPDE

Because high-resolution prediction aims to generate predictions for the entire study area, including sampled units and non-sampled spatial units, the covariance matrix to be analyzed is a dense, continuously indexed matrix (the big N problem). We propose that the big N problem be handled by transforming the covariance matrix Σ d in (10) of the spatially correlated innovations u s l , t into a sparse matrix Q (d), discretely indexed by precision, by means of the SPDE [37], with Q ( d ) defined by a local neighborhood structure. For each time t ~ = T + 1 , , T ~ , the same covariance function Σ is used for all the observations and u s l , t ~ is specified in terms of an SPDE as follows:
κ 2 Δ α 2 u s l , t = W s l , t ,   s R 2 ,   l = 1 , , n , n + 1 , , n + m   a n d   t = T + 1 , . . ,   T ~
where Δ = 2 s 1 2 + 2 s 2 2 is the Laplacian operator, α = v + 1 controls the smoothness of Σ , and W s l , t is Gaussian spatiotemporal white noise with unit variance. As mentioned above, v is usually fixed at 1 [38] so that (12) reduces to:
κ 2 Δ u s l , t = W s l , t ,   s R 2   f o r   l = 1 , , n , n + 1 , , n + m
The exact solution to the SPDE (13) is the stationary and isotropic Matérn covariance m a t r i x   Σ . It can be approximated by employing the finite element method (FEM) defining u s l , t ~ as a linear combination of basic functions on a triangulated representation of the domain R 2 . Triangulation involves subdividing R 2 into a collection of non-overlapping triangles, meeting at most at one shared edge or corner. The initial triangles have vertices positioned at locations s 1 , , s n , s n + 1 , , s n + m . Additional vertices are added as needed for the desired precision level of the predictions at hand. With the triangulated framework in place, the representation of u s l , t ~ can be expressed in terms of basis functions as follows [36]:
u s l , t = z = 1 Z γ z s l , t ς z   f o r   l = 1 , , n , n + 1 , , n + m   a n d   t = T + 1 , . . ,   T ~  
where Z represents the number of vertices within the triangulation; γ z s l , t denotes a piecewise linear basis function, which is assigned a value 1 at vertex z and a value 0 at all other vertices; and ς z is a GF with mean 0 and sparse precision matrix Q . The SPDE and FEM transform the GFs with a dense and continuously indexed covariance matrix, into GMRFs with a sparse and discretely indexed precision matrix.

2.3. Bayesian Inference with INLA

INLA is an approximate Bayesian inference procedure for latent Gaussian models (LGMs) and is available in the R-INLA package. An LGM is a statistical framework establishing connections between a general response variable that can be continuous, discrete, or a combination of both, and an additive linear predictor. LGMs include GMRFs, as elucidated in [39]. LGMs for INLA inference typically exhibit a hierarchical structure consisting of three components: the likelihood model, the vector of model parameters with GFs as prior distributions, and the vector of hyperparameters of the prior distributions, called hyperpriors. Formally, these are written as:
Likelihood   model :   y | Φ , Ψ ~ p y | Φ , Ψ = t = 1 T ~ l = 1 n + m p y l t | Φ l t , Ψ
Vector   of   model   parameters   with   GFs   as   priors :   Φ | Ψ ~ p Φ | Ψ = N 0 , Q 1 Ψ
Vector   of   hyperparameters   with   hyperpriors :   Ψ ~ p Ψ
where y = y 11 , , y ( n + m ) T ~ denotes the vector of response variable, and Φ = Φ 11 , , Φ ( n + m ) T ~ is the latent vector of parameters in the linear prediction model, which may include the fixed and random effects. The vector of parameters follows a multivariate GF with a conditional independence structure, which is transformed into a sparse precision matrix Q 1 Ψ .   Ψ = Ψ 1 , , Ψ G is the vector of hyperparameters with g = 1 , , G .
The latent vector for the first stage is Φ l t = β 0 , ϑ l t , φ l t for l = 1 , . . , n and t = 1 , , T ~ , while for the second stage, we have Φ l t = β 0 , β 1 , , β K , ϱ l t for l = 1 , . . , n , n + 1 , , n + m and t = T + 1 , , T ~ . The approximate posterior marginal distributions p ~ . of the GFs for estimating statistics, such as the mean and standard deviation, can be estimated utilizing the Bayesian hierarchical model as follows:
p Φ l t | y = p Φ l t , Ψ | y d Ψ = p Φ l t | Ψ , y p Ψ | y d Ψ ;
The integral can be approximated by INLA as a finite weighted sum of B grid points Ψ ( 1 ) , , Ψ ( B )   that are systematically arranged to cover the entire parameter Ψ . Each grid point Ψ ( b ) has an associated weight Δ b . For relevant grid points { Ψ ( b ) }, the integral (18) is approximated by (19). Hence:
p ~ Φ l t | y b B p ~ Φ l t | Ψ b , y p ~ Ψ b | y Δ b .
The first part of Equation (19), p ~ Φ l t | Ψ ( b ) , y ,   is a Laplace approximation, while the second part, p ~ Ψ ( b ) | y , relies on the specific implementation chosen in the INLA program. This can be a full Laplace approximation, a Gaussian approximation, or a simplified Laplace approximation, as detailed in [30]. Note that the simplified Laplace approximation is the default choice and employed in the application presented in this paper.
To finalize the Bayesian estimation procedure, we define the hyperpriors for the standard deviations of the varying intercepts of the first stage σ β 0 s 1 , , σ β 0 s n ,   the overall intercept of the second stage σ β 0 , the slopes of the covariates σ β 1 , , σ β K , the temporal trend σ ϑ , the seasonal v a r i a t i o n   σ κ , the innovations σ u , and the disturbance term ( σ ε 2 ). We also define the hyperpriors for the autocorrelation parameter ( λ )   and the range ( r ) . For the standard deviations of the intercepts and slopes, large values are typically selected, e.g., 10 5 , allowing a wide range of possible values, rendering the model flexible. For the other hyperparameters, i.e., the standard deviations of the temporal trend, seasonal variation, the autocorrelation parameter , and the range, we apply penalized complexity (PC) hyperpriors, which are typically used to avoid overfitting problems [40]. PC hyperpriors are defined as P r o b σ > μ σ = α σ , where μ σ > 0 is a quantile of the PC distribution and 0 < α σ < 1   represents the probability value.

2.4. MSTS Forecasting and Imputing Missing Observations

We will first discuss the input structure for imputing missing observations. In R-INLA, forecast values are defined as missing values (‘NA’) and automatically generated during the model fitting process, leveraging the predictive posterior distribution (19). To forecast with R-INLA, the following input data structure is applied. Missing values at location i and time t   are denoted as y s i , t = N A , and forecast values at location i in period T ~ as y s i , T ~ = N A . To specify random effects in the prediction model, identifiers, ‘id’, are defined to distinguish between spatial and temporal units.
The input structure is specified in Equation (20). It contains dummy variables ( D i t ) for i = 1 , , n and   t = 1,2 , . . , T , T + 1 , , T ~ representing intercepts that vary by location. Specifically, we set D i t = 1 for β 0 s i ,   t and D i t = 0 elsewhere, which gives a matrix of dummy variables, D , with dimensions T ~ × n .   For the temporal trend and the seasonal variation in Equations (3) and (4), respectively, the vectors t = 1,2 , . . , T , T + 1 , , T ~   are introduced in Equation (20). Hence, the INLA input data structure is as follows:
y s 1 , 1 y s 1 , 2 y s 1 , t = N A y s 1 , T y s 2 , 1 y s 2 , T y s n , 1 y s n , T y s 1 , T + 1 = N A y s 1 , T + 2 = N A y s n , T ~ = N A ( y o b s , y f o r e c a s t ) =     1             0     0         1             0     0                                 1             0     0                                 1             0     0         0             1     0                                 0             1     0                                 0             0     1                                 0             0     1         1             0     0         1             0     0                                 0             0     1     i d : ( β 0 s 1         β 0 s n + 1 2 t T 1 T 1 T T + 1 T + 2 T ~ i d :   ϑ + 1 2 t T 1 T 1 T T + 1 T + 2 T ~ i d : φ  

2.5. High-Resolution Prediction: The INLA–SPDE Approach

The MSTS predictions for the sampled data points are used as input for the high-resolution model in Equation (11). The first step of the high-resolution prediction process is the transformation of the GF in Equation (11) into a GMRF employing the SPDE in conjunction with the FEM. The second step consists of generating predictions for the sampled and unsampled data points for the periods t = T + 1 ,   T + 2 ,   ,   T ~ . The INLA input data structure is presented in Equation (21) as follows:
y s 1 , T + 1 y s 1 , T + 2 y s 1 , T ~ y s 2 , T + 1 y s 2 , T ~ y s n , T ~ y s n + 1 , T + 1 = N A y s n + 1 , T + 2 = N A y s n + 1 , T ~ = N A y s n + 2 , T + 1 = N A y s n + 2 , T + 2 = N A y s n + m , T ~ = N A y o b s , y p r e d i c = 1 1 1 1 1 1 1 1 1 1 1 1 i d : β 0 + x s 1 , T + 1,1 x s 1 , T + 1 , K x s 1 , T + 2,1 x s 1 , T + 2 , K x s 1 , T + 2,1 x s 1 , T + 2 , K x s 2 , T + 1,1 x s 2 , T + 1 , K x s 2 , T ~ , 1 x s 2 , T ~ , K x s n , T ~ , 1 x s n , T ~ , K x s n + 1 , T + 1,1 x s n + 1 , T + 1 , K x s n + 1 , T + 2,1 x s n + 1 , T + 2 , K x s n + 1 , T ~ , 1 x s n + 1 , T ~ , K x s n + 2 , T + 1,1 x s n + 1 , T + 1 , K x s n + 2 , T + 2,1 x s n + 2 , T + 2 , K x s n + m , T ~ , 1 x s n + m , T ~ , K x 1                     x K + 1 1 1 2 2 n n + 1 n + 1 n + 1 n + 2 n + 2 n + m i d : ϱ

2.6. Exceedance Probability for High-Resolution Predictions

To detect spatiotemporal hotspots, we apply the notion of posterior exceedance probability to indicate whether the posterior o f   f o r e c a s t   y ^ l t ~ for area s l at time t surpasses a predefined threshold value c. We denote this probability as P r ^ y ^ l t > c | y and calculate it as:
P r ^ y ^ l t > c | y = 1 c p y ^ l t | y d y ^ l t
where c p y ^ l t | y d y ^ l t represents the cumulative probability of y ^ l t with threshold value c . An area is classified as a hotspot, indicating a high-risk area, when the estimated exceedance probability P r ^ y ^ l t > c | y > γ . Hence, to identify hotspots using the exceedance probability, two critical parameters need to be established: first, the threshold value c for y ^ l t ~ ; next, the cut-off value γ for the exceedance probability. For the threshold c , the value is determined based on the specific application. For instance, in the case of PM2.5 concentration, a common threshold value c is the WHO value of 35 μg/m3 [41]. The standard options for γ are 0.90, 0.95, and 0.99.

2.7. Model Accuracy

The log marginal likelihood (LML), along with metrics such as the deviance information criterion (DIC) and the Watanabe–Akaike information criterion (WAIC), are summary statistics for assessing model accuracy. The marginal likelihood p y M p measures the likelihood of the data given a model M p for   p = 1 , , P . It is obtained by integrating the likelihood function p y Φ j , M p with respect to the prior density p Φ j M p , which is defined as follows [30,42]:
p y M p = p y Φ j , M p p Φ j M p d Φ j
Note that the likelihood function is not a probability density function, so it can take on values larger than one [42,43,44]. Consequently, the marginal likelihood can be larger than one and the log-marginal likelihood is not necessarily negative; it can be either negative or positive. The model exhibiting the highest log marginal likelihood is preferred due to its superior fit to the data [30,35,45,46].
Another collection of commonly utilized evaluation approaches in INLA is based on the trade-off between model fit and model complexity, best known is the deviance information criterion (DIC), defined as [30,47,48,49]:
D I C = D Φ ^ + 2 p D I C
where D Φ ^ = 2 log p y Φ ^   d e n o t e s   t h e   m o d e l   d e v i a n c e   and p D I C the effective number of parameters indicating model complexity, defined as p D I C = D Φ ¯ D Φ ^ ,   w i t h   D Φ ¯ = E p o s t e r i o r log p y | Φ denoting the mean deviance. Note that [48] contains various alternative forms of the DIC for datasets with missing observations. The alternatives are not relevant here because the second-stage models in the application are based on a complete dataset. The missing values were imputed in the first stage using a multivariate time series approach.
An alternative to the DIC is the Watanabe–Akaike information criterion (WAIC), defined as [50]:
W A I C = 2 D W A I C p W A I C ,
where D W A I C is the deviance measuring the model fit, i.e., D W A I C = l = 1 n + m t = T + 1 T ~ log E Φ | y p y l t | Φ   , and p W A I C denotes the effective number of parameters, i.e., p W A I C = l = 1 n + m t = T + 1 T ~ V a r p o s t e r i o r log p y l t | Φ . Similar to the DIC, a smaller WAIC value indicates a better fit. The LML, DIC, and WAIC are implemented in the R-INLA package [30,35,51].
Additional indicators of predictive performance include the mean absolute error (MAE), the root mean square error (RMSE), and the adjusted pseudo-coefficient of determination ( R 2 ). Their definitions are provided by [52] and read as follows:
M A E = 1 n + m h l = 1 n + m t = T + 1 T ~ y ^ l t y l t
R M S E = 1 n + m h l = 1 n + m t = T + 1 T ~ y ^ l t y l t 2
R 2 = 1 l = 1 n + m t = T + 1 T ~ y ^ l t y l t 2 l = 1 n + m t = T + 1 T ~ y ^ l t y ¯ 2
where y ^ l t is the predicted outcome at location l at time t , and y ¯ is the mean of the observed values. All else being equal, lower values of MAE and RMSE and higher values of R 2 indicate a better fit.

2.8. Comprehensive Overview of the Two-Stage Model

Figure 1 presents a comprehensive overview of the two-stage high-resolution prediction model with missing observations.

3. Application: Daily PM2.5 Concentrations in Jakarta Province, Indonesia

3.1. Study Area and Descriptive Statistics

Jakarta Province (Figure 2) has a total area of 662.3 km2 and a population of 10,582,229 people. Hence, it has a population density of 15,978 people per km2. Jakarta’s population comprises 3.89% of the nation’s total population and contributes 16.3% to the Indonesian gross domestic product (GDP) [53]. The location serves as a hub for transportation and information exchange, catering to the needs of the local population and facilitating the efficient movement of goods throughout Indonesia.
Due to its extensive industrialization and urbanization, coupled with biomass burning emissions—especially during the dry season—air pollution has become a significant policy concern in Jakarta Province, as it is in many other urbanized areas in Southeast Asia. The pollution issue comprises two major aspects.
One aspect is the adverse impact on the environment in the form of acid rain and global warming [54,55]. The other is the impact on human health, notably cardiovascular disorders, central nerve system dysfunctions, cutaneous diseases, and chronic obstructive pulmonary diseases (COPDs), including asthma, bronchitis, and lung cancer [5,56,57] (see [32] for the impact on housing).
One of most serious health-damaging atmospheric pollutants is PM2.5 [54]. To reduce the health-damaging risk of PM2.5, a spatiotemporal high-resolution forecasting model is needed as a component of an early warning system (EWS) [9,23], which issues advanced warnings of hazardous pollutants. The aim of an EWS is to enable individuals, communities, and organizations to take preventive actions to mitigate the impacts of pollution [54].
The main aim of this section is to provide high-resolution spatiotemporal predictions of daily PM2.5 concentrations in Jakarta based on the methodology described in the previous sections. Spatiotemporal data on PM2.5 emissions were acquired from the Jakarta Air Quality Index (AQI) platform, encompassing data from 13 monitoring stations spanning the period from 1 January to 31 December 2022 (Figure 2).
Figure 3 shows that the observation stations are unevenly distributed across Jakarta Province. The distance between the stations varies between 663.6 m and 15,743.8 m, with a median distance of 7276.5 m.
The raw data were recorded per hour. We calculated the daily concentration by averaging the hourly concentrations. Figure 4 depicts the distribution of missing observations across the 13 monitoring stations throughout the period 1 January–3 December 2022. In the figure, black represents missing observations and gray available observations. The vertical axis represents time, from 1 to 365 days and the horizontal axis represents monitoring stations, ranging from S1 to S13. Overall, 27.7% of the observations was missing (Figure 4). For the stations Jimbaran 2 (S5), Simprug THL Area (S9), and Wisma Matahari Power (S13) more than 50% of the observations was missing.
Table 1 presents the minimum, mean, maximum, and standard deviation of the PM2.5 concentration for the 13 monitoring stations over the course of 12 months. The minimum was 1.00 μg/m3 (in January), the maximum 126.9 μg/m3 (in August), and the overall mean 37.46 μg/m3. The highest average was 51.25 μg/m3 in June. The standard deviation also peaked in June (23.11 μg/m3), followed by March (20.98 μg/m3). As noted, biomass burning creates a seasonal pattern in air pollution, with elevated pollution levels typically occurring during the dry season, particularly from June to August.
We used two stations with complete raw data, Gading Harmony (S3) and RespoKare Mask–Wisma 76 (S8), to describe the time variation of the PM2.5 concentration in detail (Figure 5). It ranged between 5.491 and 92.411 μg/m3 for Gading Harmony (S3) and between 6.527 and 95.750 μg/m3 for RespoKare Mask–Wisma 76 (S8). From January to June, the concentrations fluctuated significantly, while after June, they fluctuated relatively smoothly and descended over time.
The spatial pattern of PM2.5 concentration was measured using the average PM2.5 concentrations at the 13 stations between 1 January and 31 December (Figure 6). The northwest and northeast regions exhibited high concentrations. In the central regions, the concentrations were relatively low. Station S1 in the northwest showed the highest concentration (70.794 μg/m3), while S10 in the central area recorded the lowest concentration (34.150 μg/m3).

3.2. Predictors

Several geographic and environmental factors have been identified as predictors of PM2.5 concentration [58]. In this study, we employed three key predictors, namely population density, altitude, and rainfall (see Table 2).
Population Density: Data on population density data were obtained from the Jakarta Province website [59]. We found a positive correlation between population density and PM2.5 pollution levels.
Altitude: Altitude data were extracted from the Worldclim database at a resolution of 30 arc seconds, approximately equivalent to 1 km × 1 km grid cells. According to [60], areas at higher altitudes tend to exhibit lower PM2.5 concentrations.
Precipitation: Concentrations of airborne pollutants tend to exhibit a strong relationship with meteorological conditions [61]. As per [62], higher precipitation levels tend to cleanse the air from pollutants, resulting in lower PM2.5 concentrations. The weather factor considered in our analysis is rainfall. The data were obtained from meteorological, climatological, and geophysical agencies. Hourly rainfall data were aggregated to daily levels. Additionally, wind speed and direction are key predictors of PM2.5 concentration [63]. However, the literature highlights a strong correlation between these factors and both precipitation and altitude [64,65]. Including all these variables in our model could lead to significant multicollinearity, making it challenging to identify the most relevant predictor of PM2.5 concentration.
To align the resolution of the predictors with the resolution used for the PM2.5 predictions, we interpolated the data at a resolution of 0.5 km × 0.5 km, as illustrated in Figure 7.

3.3. MSTS PM2.5 Forecasts for the 13 Monitoring Stations

To correct for missing observations and thus improve the precision of our forecasts, we employed the MSTS approach. As explained in Section 2, we assume a consistent temporal trend and seasonal pattern across all stations. We applied a random walk model of order one for the temporal trend and a periodicity of h = 30 days for the seasonal pattern. To evaluate the forecasting performance, we divided the data into a training and a testing dataset. Data from the period 1 January–31 October 2022 were used for training and the November–December 2022 data for testing. The model was found to perform well, with a mean absolute percentage error (MAPE) of 49%, a mean absolute error (MAE) of 10%, and a correlation coefficient of 0.72, demonstrating a strong similarity between the observed and predicted values. Figure 8 shows the relationship between predicted and observed values for the testing dataset across all stations. The forecasts for the monitoring stations are displayed in Figure 9.

3.4. High-Resolution Prediction

In the second stage, we conducted high-resolution forecasting of PM2.5 levels for the period 1–31 January 2023. As a first step, we checked the distribution of the daily PM2.5 concentration for deviation from normality by means of a box plot for each monitoring station for the period 1 January–31 December 2022. Modeling a skewed data distribution of the response variable under the assumption of a normal distribution tends to result in heteroscedasticity. Log transformation is a common solution to heteroscedasticity [66]. In addition, log transformation prevents the occurrence of negative predictions.
Figure 10 shows the box plots of the monthly distribution of the PM2.5 concentrations per observation station for 2022 in μg/m3. The black central horizontal line in a box represents the median value, while the upper and lower boundaries correspond to the 25th and 75th percentiles, respectively. Figure 10 also encompasses the fifth and ninety-fifth percentiles of the distribution (the end points of the lower and upper black lines, respectively). Data points falling beyond the fifth and ninety-fifth percentiles (black dots) are outliers. In most of the box plots in Figure 10, the median value, denoted by the bold line in a plot, does not align with the central position but rather tends to be situated below it, suggesting a positively skewed distribution. Consequently, we log-transformed the PM2.5 concentrations for further analysis.
The next step was the triangulation of the study area to transform the continuous GF into a discrete GMRF using the SPDE and FEM approach, as illustrated in Figure 11. To ensure accurate prediction, the triangular mesh must be sufficiently dense to capture the spatial variability of the daily PM2.5 emissions [36]. We constructed a dense mesh over the areas covered by the monitoring stations, while keeping it sparse in the outer areas without observations. The sparse outer areas mitigate boundary effects while reducing computation costs [67]. We constructed a triangulation mesh comprising 950 triangles and 496 vertices, effectively covering an area of 661 km2 of the Jakarta region. This arrangement brought about an average triangle area of 0.6 km2.
Following [9,35], we postulated five different high-resolution prediction models on the log-transformed PM2.5. These models incorporate fixed effects, structured temporal random effects ϑ t to account for temporal autocorrelation, unstructured spatiotemporal interaction δ t to capture local heterogeneity arising from unobserved spatial and temporal covariates [54], structured spatiotemporal interaction ϱ t to address spatiotemporal autocorrelation, and an error term ε t . We identified the most appropriate model using the cross-validation measures in Section 2.4. The selection process involved training and testing. The five models are:
M1 : Model   with   covariates   (fixed effects)   only ln y t = β 0 1 n + x t β + ε t ;   ε t ~ N 0 , σ ε 2 I n
M2 : M1   plus   structured   temporal   variation   ( autocorrelation ) ln y t = β 0 1 n + x t β + ϑ t + ε t ;   ε t ~ N 0 , σ ε 2 I n
M3 : M2   plus   unstructured   spatiotemporal   interaction   effect   ( type   I ) ln y t = β 0 1 n + x t β + ϑ t + δ t + ε t ;   ε t ~ N 0 , σ ε 2 I n
M4 : M2   plus   structured   spatiotemporal   interaction   ( spatiotemporal   autocorrelation ) ln y t = β 0 1 n + x t β + ϑ t + ϱ t + ε t ;   ε t ~ N 0 , σ ε 2 I n
M5 : M4   plus   unstructured   spatiotemporal   interaction   effect   ( type   I ) ln y t = β 0 1 n + x t β + ϑ t + δ t + ϱ t + ε t ;   ε t ~ N 0 , σ ε 2 I n
We calculated the PM2.5 concentrations for square grids of size 1 × 1 km. For validation purposes, we randomly selected three monitoring stations (S5, S6, and S10) out of the 13 available stations for the period 1 January to 31 January 2023. We ran the model on an Apple M1 Pro with 16 GB of memory using R version 4.3.3 and the INLA package (INLA_24.02.09). The computational time for running the most complex model, M5, was approximately 1.472 min. We assessed the goodness of fit of the models by comparing the observed versus predicted PM2.5 concentrations using the DIC, WAIC, and LML. Additionally, we evaluated their predictive performance using the MSE, MAE, MAPE, and R2. Table 3 provides a summary of the results. Generally, the model exhibiting the smallest DIC, WAIC, MAE, and RMSE values, along with the largest LML and R2 values, is taken as the most suitable and accurate for forecasting.
Table 3 and Figure 12 show that M4 and M5 stand out as the most adequate models, displaying the lowest DIC, WAIC, MSE, MAE, and MAPE values and the highest LML and R2 values.
Note that the DIC and WAIC are negative in M4 and M5, whereas they are positive in Models 1–3. This is due to the presence in M4 and M5 of both structured temporal autocorrelation and, notably, structured spatiotemporal autocorrelation. M1 contains no random effects, M2 only structured temporal variation, and M3 structured temporal variation and unstructured spatiotemporal interaction. Hence, the inclusion of both structured temporal variation and structured spatiotemporal variation (M4) and both kinds of structured random effects and their unstructured interaction (M5) explains a substantial amount of the variation in PM2.5 concentrations leading to the residuals having a small standard deviation. As shown by [68,69], this may lead to negative DIC and WAIC values.
In selecting the ultimate optimal model, we prioritized simplicity by excluding unstructured spatiotemporal interaction, as it did not have a substantial impact on the predictive accuracy. Hence, we continued the analysis with M4. Note the strong correlation between the observed and predicted values of around 86% in Figure 12D,E.
Using the variance inflation factor (VIF), we examined multicollinearity between the covariates. All VIF values were below 10 (VIF 1 = 1.097, VIF 2 = 1.097, VIF 3 = 1.000), indicating that there was no serious collinearity between the covariates.
The summary statistics for the coefficients of the fixed effects in model M4 are presented in Table 4, along with the p-values and the exponentiated coefficients. The latter show that the fixed effects are not significant at conventional levels, which is related to the fact that the random effects account for a substantial portion of the variation in the PM2.5 concentrations [70]. The regression coefficients are consistent with the theoretical expectations in Section 3.2.
Particularly, altitude and precipitation have a negative effect on PM2.5 concentrations, while population density has a positive effect. The estimates further show that altitude and population density are the primary fixed effect predictors. An increase of one meter in altitude is associated with an approximate 7.8% decrease in PM2.5 concentration 0.078 × 100 % . An increase of one person per km2 in population density results in an approximate 7.0% increase in PM2.5 concentration, whereas an increase of one millimeter in rainfall results in an approximate 3.9% decrease in PM2.5 concentration.
Table 5 shows the summary statistics for the random effects. The average value for the range (r) in Equation (8) is 4,301.302 m, indicating a spatial dependence range of approximately 4.301 km. The mean autoregressive coefficient ( λ ) in Equation (11) is 0.974, indicating a high degree of consistency in air pollution levels between two consecutive days. The primary random component accounting for the spatiotemporal variation in PM2.5 concentrations is the standard deviation of the innovations ( σ ϱ ), with the fraction of its variance explaining approximately 65% of the variation. The second most influential random effect is the standard error of the temporal trend v , accounting for approximately 34% of the variance. The variation of the error term, on the other hand, contributes only 0.02% of the overall variation.
As Model 4 without the interaction component was selected rather than Model 5 with the interaction term, there is no specific temporal effect per region, implying that all locations follow the same temporal pattern. Hence, the temporal effect can be characterized as a global trend applicable to all locations.
After controlling for the covariates, Figure 13 presents the global temporal pattern of the PM2.5 predictions across the 13 monitoring stations (orange curve). Figure 13 also shows that the global temporal pattern of the predictions closely resembles the global temporal pattern of the observations (blue curve), with the exception of the first week.
The strong fluctuations suggest vehicular activity as the prime cause of PM2.5 concentrations in Jakarta. They are in line with [71], who showed that approximately 43–46% of the primary particle emissions in Jakarta are attributed to urban traffic. This suggestion is supported by the similarity between the temporal pattern of PM2.5 and the tracer gas data (CO and NO2) (Figure 14).
Each month begins with a significant influx of workers arriving in Jakarta to begin their duties, resulting in a significant increase in the number of vehicles on the roads. At the end of the month, there is an outflow of workers returning to their home regions. Moreover, there are additional upticks in PM2.5 concentrations during the month because of increased traffic related to shopping and recreational activities. During weekdays, traffic flows and PM2.5 levels are larger than during weekends, except the weekends at the beginning and end of the month because of commuting.
Fixed effects and global temporal trends fall short of fully elucidating the entire heterogeneity observed in point patterns. Specifically, certain aspects of the spatiotemporal structure can be further explained by the innovations, as they significantly contribute to the spatiotemporal variation in PM2.5 concentrations.
As explained in Section 2, GF-distributed innovations are characterized by two parameters—the standard deviation and the range. The average standard deviation of the GF-distributed innovations is 0.3885, surpassing the other standard deviations in Table 5. This suggests that the variability in the posterior average of GF-distributed innovations exhibits significant spatiotemporal fluctuations. Whereas the global trend accounts for temporal variations, the innovations contribute to variation across both space and time.
The average effective observation range (r) of a monitoring station is 4301.302 m. Consequently, grid areas situated more than 4.3 km from the nearest observation station are not monitored, and the posterior mean of the standard deviation of their GF-distributed innovations are arbitrarily fixed at zero (the white grid areas in Figure 15). Although the spatial concentration of PM2.5 was approximated by the lower-level predictions of the fixed effects, this approach is suboptimal because the posterior mean of the standard deviation of the GF-distributed innovations also provides important information for high-resolution prediction [27,36].
Having estimated all the parameters of the best model (M4) for predicting PM2.5 in unsampled locations, the subsequent step involves forecasting PM2.5 concentrations across all grid areas in Jakarta Province and identifying hotspots for 1–31 January 2023. The prediction of PM2.5 concentrations is achieved by integrating the various model components, including fixed effects, random effects, temporal trends, and the GF-distributed innovations. The identification of the hotspots is achieved by calculating the exceedance probability of the posterior mean of the PM2.5 concentrations.
Figure 16 shows the predicted daily PM2.5 concentrations for Jakarta Province for 1–31 January 2023, while Figure 17 depicts the areas with spatiotemporal exceedance probabilities of PM2.5. concentrations at level c = 35 μg/m3, at 1 km2 resolution. Figure 16 and Figure 17 show that the PM2.5 concentrations exhibit significant regional disparities. Predicted PM2.5 values range from a minimum of 4.548 µg/m³ to a maximum of 57.986 µg/m³, averaging at 20.261 µg/m³. The concentrations peak at the beginning and end of January.
Several factors contribute to the high predicted PM2.5 concentration. Industrial activities are a major source of PM2.5 emissions, notably via smoke, dust, and exhaust gases. Motor vehicle traffic also plays a substantial role, emitting pollutants such as fumes and road dust. High exceedance probabilities are predicted for the northern and eastern sectors of Jakarta. For the northern region, which is the main industrial hub, the PM2.5 concentration was predicted to average at 35 µg/m³. The region hosts a bustling port with intensive maritime traffic and high PM2.5 emissions from ships and loading/unloading activities. Another factor contributing to the high PM2.5 concentrations in the north is its low altitude. Eastern Jakarta is a densely populated region and a major host of industrial activities. Its PM2.5 concentrations are predicted to be high due to industrial emissions and dense vehicular traffic.

4. Discussion

High-resolution spatiotemporal forecasting is of paramount importance in many scientific and policy areas, including environmental science and policy [72]. The rapid expansion of industrialization and urbanization has led to increased atmospheric pollution in urban areas worldwide, resulting in significant air quality issues. A major atmospheric pollutant is PM2.5, originating from combustion or emitted in the atmosphere via chemical processes [73]. Exposure to PM2.5 emissions has adverse effects on public health and, although not discussed this paper, on the economy and ecology. The substantial literature on this topic has established a strong correlation between PM2.5 levels and morbidity and mortality, including cardiovascular and respiratory-related fatalities [74]. Therefore, accurate air pollutant concentration assessments are vital to effectively address public health risks.
An accurate early warning system (EWS) enables the timely dissemination of hazard-related information, aiding individuals, communities, and organizations in preparedness and mitigation efforts [75]. Accurate forecasts of PM2.5 concentrations in both time and space are crucial for an EWS.
Spatiotemporal high-resolution models involve interpolation, transforming discrete point observations into a continuum. This methodology relies on geostatistical regression models incorporating spatiotemporally correlated data and fine-scale area covariates (predictors) to forecast data for sampled and unsampled locations. However, this methodology often incurs substantial computational cost, mainly due to the handling of large numbers of spatiotemporal observations, a challenge commonly known as the “big N problem”. A variety of different methods has been used to solve this problem. Models focused on creating Gaussian Markov random fields (GMRFs) are among the most popular ones.
The stochastic partial differential equation (SPDE) method combined with Bayesian inference implemented in the integrated nested Laplace approximation (INLA) framework is an efficient approach to transform general spatiotemporal autocorrelation structures to GMRFs [27,36]. This framework provides an effective solution to managing extensive datasets with computational efficiency. An SPDE combined with the finite element method (FEM) discretizes a spatial domain into a sparse grid, thus allowing for fine spatiotemporal resolutions, even for large datasets.
A frequently occurring problem in environmental studies (and other spatial sciences as well) is missing data leading to inaccurate predictions. In this paper, we proposed that missing data can be handled by utilizing the similarities between the temporal patterns of the time series of the spatial units. By means of a multivariate spatial time series (MSTS) model, predictions for the sampled spatial units are generated, which are then used to impute the missing observations. However, imputing missing observations by means of the MSTS model and the application of GMRFs can become computationally complex and expensive to execute due to the fact that missing data are treated as unknown parameters that must be estimated [31].
To solve the computational complexity of high-resolution prediction and MSTS forecasting, we proposed a two-stage approach in this paper. The first stage is the MSTS model to generate forecasts for the sampled observations and impute missing observations. The second stage is the high-resolution forecasting model applying the first stage predictions for the sampled spatial units for the prediction period, the predictions of covariates at the lower-level grid area, as well as the random components accounting for spatiotemporal correlation.
The above methodology was applied to create high-resolution forecasts for daily PM2.5 concentrations in Jakarta Province for 1–31 January 2023. To that end, we used daily PM2.5 data collected from 13 air-quality-monitoring stations from 1 January 2022 to 31 December 2022. The first stage involved generating forecasts of PM2.5 concentrations for the sampled spatial units and correcting for missing observations applying a Bayesian MSTS forecasting model incorporating a temporal trend and seasonal variation. We generated PM2.5 predictions for 31 days in January 2023. In the second stage, a high-resolution map was generated, applying the multiple predictions of the covariate’s altitude, population density, and precipitation at lower-level grid area and temporal and spatiotemporal random effects with spatially correlated GMRF innovations. This stage utilized a grid area comprising 34 × 34 cells, each with a spatial resolution of 1 km² per cell, covering the entire study area. We applied the INLA-SPDE model with first-order autoregressive dynamics and a Matérn covariance function for the innovations. Utilizing the 95% posterior exceedance probability, we identified high-risk regions with PM2.5 concentrations surpassing the World Health Organization lower limit of 35 μg/m³ (hotspots).
The performance of the predictive model was evaluated through cross-validation, revealing a strong correlation between the predicted and observed values. The validation results included a mean squared error (MSE) of 0.001, a mean absolute error (MAE) of 0.037, a mean absolute percentage error (MAPE) of 0.041, and an R2 of 0.855. These results demonstrate the precision and dependability of the two-stage model to predict PM2.5 concentrations.
The research findings indicated high PM2.5 concentrations during the beginning and end of January 2023, which are primarily attributable to the volume of vehicles in circulation. Areas situated at higher altitudes and receiving more precipitation had lower levels of pollution. On the other hand, densely populated regions had higher PM2.5 concentrations. In addition, the northern regions of Jakarta Province consistently had higher pollution levels than the southern regions because of port and industrial activities.
The application has several limitations. First, the number of observation stations is relatively small. Moreover, they are not evenly distributed across Jakarta Province. These data features are likely to have led to suboptimal predictions. Second, missing basic covariates, such as the number of motor vehicles, probably also had a negative impact on the predictions and the external validity.

5. Conclusions

High-resolution spatiotemporal forecasting is essential across a variety of scientific and policy domains, including air pollution research, where detailed spatiotemporal predictions of airborne pollutants such as PM2.5 concentrations are critical for public health and health policy. However, this task is complex and costly, especially in the case of large numbers of missing observations, leading to inaccurate forecasts due to imprecise interpolation. Another challenge to high-resolution spatiotemporal forecasting is the big N problem. To address both problems, we developed a two-stage model approach consisting of a multivariate spatial time series (MSTS) model and a spatiotemporal high-resolution forecasting model. The MSTS model generates forecasts for the sample of observed spatial units while imputing missing observations. The spatiotemporal high-resolution model generates forecasts for the sampled and unsampled (lower-level) spatiotemporal units, applying the first-stage predictions for the sampled spatial units for the prediction period, the predictions of covariates at the lower-level grid area, as well as random components accounting for spatiotemporal correlation. The successful application of the two-stage model to predict PM2.5 concentrations for Jakarta Province for 1–31 January 2023 lends support to the efficiency and reliability of the model.

Author Contributions

Idea formulation, I.G.N.M.J. and H.F.; methodology, I.G.N.M.J. and H.F.; theory, I.G.N.M.J. and H.F.; algorithm design, I.G.N.M.J. and H.F.; result analysis, I.G.N.M.J. and H.F.; writing, I.G.N.M.J. and H.F.; reviewing the research, I.G.N.M.J. and H.F.; supervision, H.F.; project administration, I.G.N.M.J.; funding acquisition, I.G.N.M.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Padjadjaran University through the Directorate of Research, Community Service, and Innovation (Grant No: 1893/UN6.3.1/PT.00/2024).

Data Availability Statement

The R GUI tool utilized in this research is open-source software, accessible at https://www.r-project.org (access on 1 January 2021). The R program developed for this study is also open source, with the code available at https://github.com/mindra-bit/HighResolutionForecasting (accessed on 2 February 2024). The datasets analyzed during this study can be found in the same repository.

Acknowledgments

With gratitude, we acknowledge the invaluable support of the Rector of Universitas Padjadjaran for facilitating Research Grant Program, which significantly contributed to this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Station ID and Coordinates.
Table A1. Station ID and Coordinates.
IDStationLongitudeLatitude
S1AHP—Capital Place106.820−6.232
S2Angkasa-Kemayoran106.843−6.156
S3Gading Harmony106.900−6.166
S4Jakarta GBK106.803−6.215
S5Jimbaran 2106.857−6.120
S6Kemayoran106.846−6.164
S7Pantai Mutiara106.796−6.110
S8RespoKare Mask—Wisma 76106.798−6.191
S9Simprug THL Area106.794−6.228
S10US Embassy in Central Jakarta106.834−6.183
S11Wisma Barito Pacific106.798−6.197
S12Wisma Korindo106.844−6.244
S13Wisma Matahari Power106.784−6.209

References

  1. Han, J.; Kamber, M.; Pei, J. Data mining trends and research frontiers. In Data Mining: Concepts and Techniques, 3rd ed.; Elsevier: Amsterdam, The Netherlands, 2012; pp. 585–631. [Google Scholar]
  2. Alghamdi, T.; Elgazzar, K.; Sharaf, T. Spatiotemporal traffic prediction using hierarchical Bayesian modeling. Future Internet 2021, 13, 225. [Google Scholar] [CrossRef]
  3. Cai, B.; Lawson, A.B.; Hossain, M.M.; Choi, J. Bayesian latent structure models with space-time-dependent covariates. Stat. Model. 2012, 12, 145–164. [Google Scholar] [CrossRef] [PubMed]
  4. Mohammadzadeh, M.; Zahmatkesh, S. Modeling of spatio-temporal data with non-ignorable missing. J. Adv. Math. Model. 2020, 10, 39–61. [Google Scholar]
  5. Lenzi, A.; Steinsland, I.; Pinson, P. Benefits of spatiotemporal modeling for short-term wind power forecasting at both individual and aggregated levels. Environmetrics 2018, 29, e2493. [Google Scholar] [CrossRef]
  6. Wang, Y.; Huang, C.; Hu, J.; Wang, M. Development of high-resolution spatio-temporal models for ambient air pollution in a metropolitan area of China from 2013 to 2019. Chemosphere 2022, 291, 132918. [Google Scholar] [CrossRef]
  7. Jaya, I.G.N.M.; Handoko, B.; Chadidjah, A.; Andriyana, Y.; Antikasari, M. Multivariate Bayesian semiparametric regression model for forecasting and mapping HIV and TB risks in West Java, Indonesia. Mathematics 2023, 11, 3641. [Google Scholar] [CrossRef]
  8. Haworth, J.; Cheng, T. Non-parametric regression for space–time forecasting under missing data. Comput. Environ. Urban Syst. 2012, 36, 538–550. [Google Scholar] [CrossRef]
  9. Jaya, I.G.N.M.; Folmer, H. Spatiotemporal high-resolution prediction and mapping: Methodology and application to dengue disease. J. Geogr. Syst. 2022, 24, 527–581. [Google Scholar] [CrossRef]
  10. Glasbey, C. Imputation of missing values in spatiotemporal solar radiation data. Environmetrics 1995, 6, 363–371. [Google Scholar] [CrossRef]
  11. Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice; Otexts: Melbourne, Australia, 2018. [Google Scholar]
  12. Rubin, D.B. Inference and missing data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
  13. Carreras, G.; Miccinesi, G.; Wilcock, A.; Preston, N.; Nieboer, D.; Deliens, L. Missing not at random in end-of-life care studies: Multiple imputation and sensitivity analysis on data from the ACTION study. BMC Med. Res. Methodol. 2021, 21, 13. [Google Scholar] [CrossRef] [PubMed]
  14. Wardana, I.N.K.; Gardner, J.W.; Fahmy, S.A. Estimation of missing air pollutant data using a spatiotemporal convolutional autoencoder. Neural. Comput. Appl. 2022, 34, 16129–16154. [Google Scholar] [CrossRef]
  15. Dondersa, A.R.T.; van der Heijden, G.J.M.G.; Stijnend, T.; Moons, K.G. Review: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 2006, 59, 1087–1091. [Google Scholar] [CrossRef]
  16. Li, H.; Li, M.; Lin, X.; He, F.; Wang, Y. A spatiotemporal approach for traffic data imputation with complicated missing patterns. Transp. Res. Part C Emerg. 2020, 119, 102730. [Google Scholar] [CrossRef]
  17. Raghunathan, T.E.; Lepkowski, J.M.; Hoewyk, V.; Solenberger, P. A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models. Surv. Methodol. 2001, 27, 85–95. [Google Scholar]
  18. Sharma, S.; Lingras, P.; Zhong, M. Effect of Missing Value Imputations on Traffic Parameters Estimations from Permanent Traffic Counts. Transport. Res. Board. 2003, 1836, 132–142. [Google Scholar]
  19. Julie, M.D.; Kannan, B. Attribute reduction and missing value imputing with ANN: Prediction of learning disabilities. Neural Comput. Appl. 2012, 21, 1757–1763. [Google Scholar] [CrossRef]
  20. Qu, L.; Li, L. PPCA-Based Missing Data Imputation for Traffic Flow Volume: A Systematical Approach. IEEE Trans. Intell. Transp. Syst. 2009, 10, 512–522. [Google Scholar]
  21. Ni, D.; Leonard, J.D., II; Guin, A.; Feng, C. Multiple Imputation Scheme for Overcoming the Missing Values and Variability Issues in ITS Data. J. Transp. Eng. 2005, 131, 931–938. [Google Scholar] [CrossRef]
  22. Ahn, H.; Sun, K.; Kim, K.P. Comparison of Missing Data Imputation Methods in Time Series Forecasting. Comput. Mater. Contin. 2022, 70, 767–779. [Google Scholar] [CrossRef]
  23. Jaya, I.G.N.M.; Folmer, H. Bayesian Spatiotemporal Mapping of Relative Dengue Disease Risk in Bandung, Indonesia. J. Geogr. Syst. 2020, 22, 105–142. [Google Scholar] [CrossRef]
  24. Martínez-Hernández, I.; Genton, M.G. Surface Time Series Models for Large Spatio-Temporal Datasets. Spat. Stat. 2023, 53, 100718. [Google Scholar] [CrossRef]
  25. Mittnik, S. Multivariate time series analysis with state space models. Comput. Math. Appl. 1989, 17, 1189–1201. [Google Scholar] [CrossRef]
  26. Tao, P.; Delleur, J. Multistation, Multiyear Synthesisof HydrologicTime Series by Disaggregation. Water Resour. Res. 1976, 12, 1303–1312. [Google Scholar] [CrossRef]
  27. Lindgren, F.; Rue, H.; Lindström, J. An explicit link between Gaussian fields and Gaussian Markov random fields: The stochastic partial differential equation approach. J. R. Stat. Soc. Stat. Methodol. Ser. B 2011, 73, 423–498. [Google Scholar] [CrossRef]
  28. Lasinio, G.J.; Mastrantonio, G.; Pollice, A. Discussing the “big n problem”. Stat. Methods Appl. 2013, 22, 97–112. [Google Scholar] [CrossRef]
  29. Banerjee, S.; Carlin, B.; Gelfand, A. Hierarchical Modeling and Analysis for Spatial Data; Chapman and Hall: London, UK, 2004. [Google Scholar]
  30. Blangiardo, M.; Cameletti, M. Spatial and Spatio-Temporal Bayesian Models with R-INLA; John Wiley & Sons: Chennai, India, 2015. [Google Scholar]
  31. Gomez-Rubio, V.; Cameletti, M.; Blangiardo, M. Missing data analysis and imputation via latent Gaussian Markov random felds. SORT 2022, 46, 217–244. [Google Scholar]
  32. Yusuf, A.A.; Resosudarmo, B.P. Does clean air matter in developing countries’ megacities? A hedonic price analysis of the Jakarta housing market, Indonesia. Ecol. Econ. 2009, 68, 1398–1407. [Google Scholar] [CrossRef]
  33. Ravishanker, N.; Raman, B.; Soyer, R. Dynamic Time Series Models Using R-INLA; CRC Press: Boca Raton, FL, USA, 2023. [Google Scholar]
  34. Morrison, K.T.; Shaddick, G.; Henderson, S.B.; Buckeridg, D.L. A latent process model for forecasting multiple time series in environmental public health surveillance. Statist. Med. 2016, 35, 3085–3100. [Google Scholar] [CrossRef]
  35. Sahu, S.K. Hierarchical Bayesian Models for Space–Time Air Pollution Data. In Handbook of Statistics; Elsevier: Amsterdam, The Netherlands, 2012; pp. 477–495. [Google Scholar]
  36. Cameletti, M.; Lindgren, F.; Simpson, D. Spatio-temporal modeling of particulate matter concentration through the SPDE approach. AStA Adv. Stat. Anal. 2013, 97, 109–131. [Google Scholar] [CrossRef]
  37. Sun, X.L.; Minasny, B.; Wang, H.L.; Zhaob, Y.G.; Zhang, G.L.; Wug, Y.J. Spatiotemporal modelling of soil organic matter changes in Jiangsu, China between 1980 and 2006 using INLA-SPDE. Geoderma 2021, 384, 114808. [Google Scholar] [CrossRef]
  38. Rue, H.; Held, L. Gaussian Markov Random Fields Theory and Applications; Chapman & Hall/CRC Taylor & Francis Group: Boca Raton, FL, USA, 2005. [Google Scholar]
  39. Simpson, D.; Rue, H.; Riebler, A.; Martins, T.G.; Sørbye, S.H. Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors. Stat. Sci. 2017, 32, 1–28. [Google Scholar] [CrossRef]
  40. Asghar, K.; Ali, A.; Tabassum, A.; Nadeem, S.G.; Hakim, S.T.; Amin, M.; Raza, G.; Bashir, S.; Afshan, N.; Usman, N.; et al. Assessment of particulate matter (PM) in ambient air of different settings and its associated health risk in Haripur city, Pakistan. Braz. J. Biol. 2022, 84, 256190. [Google Scholar] [CrossRef] [PubMed]
  41. Dey, D.K.; Ghosh, S.K.; Mallick, B.K. Generalized Linear Models a Bayesian Perspective; Marcel Dekker, Inc.: New York, NY, USA, 2000. [Google Scholar]
  42. Gentle, J.E. Statistics and Computing; Springer: London, UK, 2009. [Google Scholar]
  43. Royal, R.M. Statistical Evidence A Likelihood Paradigm; Chapman & Hall: Boca Raton, FL, USA, 2000. [Google Scholar]
  44. Rue, H.; Martino, S.; Chopin, N. Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations. J. R. Stat. Soc. Ser. B 2009, 71, 319–392. [Google Scholar] [CrossRef]
  45. Sahu, S.K.; Böhning, D. Bayesian Spatio-Temporal Joint Disease Mapping of Covid-19 Cases and Deaths in Local Authorities of England. Spat. Stat. 2022, 49, 100519. [Google Scholar] [CrossRef]
  46. Chiuchiolo, C.; van Niekerk, J.; Rue, H. Joint Posterior Inference for Latent Gaussian Models with R-INLA. J. Stat. Comput. Simul. 2023, 93, 723–752. [Google Scholar] [CrossRef]
  47. Gomez-Rubio, V. Bayesian Inference with INLA; Taylor and Francis Group: Boca Raton, FL, USA, 2020. [Google Scholar]
  48. Spiegelhalter, D.J.; Best, N.G.; Carlin, B.P.; Van Der Linde, A. Bayesian Measures of Model Complexity and Fit. J. R. Stat. Soc. Stat. Methodol. Ser. B 2002, 64, 583–639. [Google Scholar] [CrossRef]
  49. Celeux, G.; Forbes, F.; Robert, C.; Titterington, D. Deviance Information Criteria for Missing Data Models. Bayesian Anal. 2006, 1, 651–674. [Google Scholar] [CrossRef]
  50. Watanabe, S. Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory. J. Mach. Learn. Res. 2010, 11, 3571–3594. [Google Scholar]
  51. Roos, M.; Held, L. Sensitivity Analysis in Bayesian Generalized Linear Mixed Models for Binary Data. Bayesian Anal. 2011, 6, 259–278. [Google Scholar] [CrossRef]
  52. Whittle, P. On Stationary Processes in the Plane. Biometrika 1954, 41, 434–449. [Google Scholar] [CrossRef]
  53. Tentang Jakarta. Available online: https://jakarta.go.id/tentang-jakarta (accessed on 7 July 2023).
  54. Bai, L.; Wang, J.; Ma, X. Air Pollution Forecasts: An Overview. Int. J. Environ. Res. Public Health 2018, 15, 780. [Google Scholar] [CrossRef] [PubMed]
  55. Wang, J.; Jiang, H.; Zhou, Q. China’s Natural Gas Production and Consumption Analysis Based on The Multicycle Hubbert Model and Rolling Grey Model. Renew. Sustain. Energy Rev. 2006, 53, 1149–1167. [Google Scholar] [CrossRef]
  56. Manisalidis, I.; Stavropoulou, E.; Stavropoulos, A.; Bezirtzoglou, E. Environmental and Health Impacts of Air Pollution: A Review. Front. Public Health 2020, 8, 505570. [Google Scholar] [CrossRef] [PubMed]
  57. Neiderud, C.J. How urbanization affects the epidemiology of emerging infectious diseases. Infect. Ecol. Epidemiol. 2015, 5, 27060. [Google Scholar] [CrossRef]
  58. Kusuma, W.L.; Chih-Da, W.; Yu-Ting, Z.; Hapsari, H.H.; Muhama, J.L. PM2.;5 Pollutant in Asia—A Comparison of Metropolis Cities in Indonesia and Taiwan. Int. J. Environ. Res. Public Health 2019, 16, 4924. [Google Scholar] [CrossRef]
  59. Clark, L.P.; Millet, D.B.; Marshall, J.D. Air Quality and Urban Form in U.S. Urban Areas: Evidence from Regulatory Monitors. Environ. Sci. Technol. 2011, 45, 7028–7035. [Google Scholar] [CrossRef]
  60. Song, P.; Han, H.; Feng, H.; Hui, Y.; Zhou, T.; Meng, W.; Yan, J. High altitude Relieves transmission risks of COVID-19 through meteorological and environmental factors: Evidence from China. Environ. Res. 2022, 212, 113214. [Google Scholar] [CrossRef]
  61. Grange, S.K.; Carslaw, D.C.; Lewis, A.C.; Boleti, E.; Hueglin, C. Random Forest meteorological normalisation models for Swiss PM10 trend analysis. Atmos. Chem. Phys. 2018, 18, 6223–6239. [Google Scholar] [CrossRef]
  62. Fujino, R.; Miyamoto, Y. PM2.5 Decrease with Precipitation as Revealed by Single-Point Ground-Based Observation. Atmos. Sci. Lett. 2022, 23, e1088. [Google Scholar] [CrossRef]
  63. Liu, Y.; Zhou, Y.; Lu, J. Exploring the Relationship Between Air Pollution and Meteorological Conditions in China Under Environmental Governance. Sci. Rep. 2020, 10, 14518. [Google Scholar] [CrossRef] [PubMed]
  64. Solano, J.; Montaño, T.; Maldonado-Correa, J.; Ordóñez, A.; Pesantez, M. Correlation Between the Wind Speed and the Elevation to Evaluate the Wind Potential in the Southern Region of Ecuador. Energy Rep. 2021, 7, 259–268. [Google Scholar] [CrossRef]
  65. Johansson, B.; Chen, D. The Influence of Wind and Topography on Precipitation Distribution in Sweden: Statistical Analysis and Modelling. Int. J. Climatol. 2023, 23, 1523–1535. [Google Scholar] [CrossRef]
  66. Yu, W.; Li, S.; Ye, T.; Xu, R.; Song, J.; Guo, Y. Deep Ensemble Machine Learning Framework for the Estimation of PM2:5 Concentrations. Environ. Health Perspect. 2022, 120, 37004. [Google Scholar] [CrossRef] [PubMed]
  67. Bolin, D.; Kirchner, K. The Rational SPDE Approach for Gaussian Random Fields with General Smoothness. J. Comput. Graph. Stat. 2020, 29, 274–285. [Google Scholar] [CrossRef]
  68. Lunn, D.; Jackson, C.; Best, N.; Thomas, A.; Spiegelhalter, D. The BUGS Book a Practical Introduction to Bayesian Analysis; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
  69. Vehtari, A.; Gelman, A.; Gabry, J. Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC. Stat. Comput. 2017, 27, 1413–1432. [Google Scholar] [CrossRef]
  70. Jaya, I.G.N.M.; Folmer, H. Does the inclusion of spatiotemporally confounded covariates improve the forecasting accuracy of spatiotemporal models? A simulation study of univariate and causal forecasting models. Geogr. Syst. 2024, 1–40, undereview. [Google Scholar]
  71. Lestari, P.; Arrohman, M.K.; Damayanti, S.; Klimont, Z. Emissions and spatial distribution of air pollutants from anthropogenic sources in Jakarta. Atmos. Pollut. Res. 2022, 13, 101521. [Google Scholar] [CrossRef]
  72. Keller, J.P.; Olives, C.; Kim, S.Y.; Sheppard, L.; Sampson, P.D.; Szpi, A.A. A Unified Spatiotemporal Modeling Approach for Predicting Concentrations of Multiple Air Pollutants in the Multi-Ethnic Study of Atherosclerosis and Air Pollution. Environ. Health Perspect. 2015, 123, 301–309. [Google Scholar] [CrossRef]
  73. Thangavel, P.; Park, D.; Lee, Y.C. Recent Insights into Particulate Matter (PM2.5)-Mediated Toxicity in Humans: An Overview. Int. J. Environ. Res. Public Health 2022, 19, 7511. [Google Scholar] [CrossRef]
  74. Liu, C.; Chen, R.; Sera, F.; Vicedo-Cabrera, A.; Guo, L.; Tong, S.; Coelho, M.; Saldiva, P. Ambient Particulate Air Pollution and Daily Mortality in 652 Cities. N. Engl. J. Med. 2019, 381, 705–715. [Google Scholar] [CrossRef]
  75. Mujtaba, G.; Shahzad, S.J.H. Air pollutants, economic growth and public health: Implications for sustainable development in OECD countries. Environ. Sci. Pollut. Res. 2021, 28, 12686–12698. [Google Scholar] [CrossRef]
Figure 1. Two-stage high-resolution prediction model with missing observations.
Figure 1. Two-stage high-resolution prediction model with missing observations.
Mathematics 12 02899 g001
Figure 2. Map of Jakarta, with an inset highlighting its location within Indonesia and Southeast Asia (note: 0–5–10 km indicates the scale of the map).
Figure 2. Map of Jakarta, with an inset highlighting its location within Indonesia and Southeast Asia (note: 0–5–10 km indicates the scale of the map).
Mathematics 12 02899 g002
Figure 3. Jakarta Province and the distribution of air-quality-monitoring stations (Station IDs can be found in Appendix A).
Figure 3. Jakarta Province and the distribution of air-quality-monitoring stations (Station IDs can be found in Appendix A).
Mathematics 12 02899 g003
Figure 4. Distribution of missing observations from 1 January to 31 December 2022.
Figure 4. Distribution of missing observations from 1 January to 31 December 2022.
Mathematics 12 02899 g004
Figure 5. Time variation of PM2.5 concentrations at Gading Harmony (S3) and RespoKare Mask–Wisma 76 (S8).
Figure 5. Time variation of PM2.5 concentrations at Gading Harmony (S3) and RespoKare Mask–Wisma 76 (S8).
Mathematics 12 02899 g005
Figure 6. Monthly average PM2.5 concentrations (μg/m3) per site.
Figure 6. Monthly average PM2.5 concentrations (μg/m3) per site.
Mathematics 12 02899 g006
Figure 7. Covariates: (A) Population density (people/km2), (B) altitude (m), (C) precipitation (mm3).
Figure 7. Covariates: (A) Population density (people/km2), (B) altitude (m), (C) precipitation (mm3).
Mathematics 12 02899 g007
Figure 8. Observed and MTST-predicted PM2.5 concentrations for 1 November–31 December 2022, for the 13 monitoring stations.
Figure 8. Observed and MTST-predicted PM2.5 concentrations for 1 November–31 December 2022, for the 13 monitoring stations.
Mathematics 12 02899 g008
Figure 9. MSTS-predicted (solid lines, 1–31 January 2023) and observed PM2.5 concentrations (red dots, 1 January–31 December 2022) for the 13 monitoring stations.
Figure 9. MSTS-predicted (solid lines, 1–31 January 2023) and observed PM2.5 concentrations (red dots, 1 January–31 December 2022) for the 13 monitoring stations.
Mathematics 12 02899 g009
Figure 10. Box plots of the monthly distribution of the PM2.5 concentrations for 1 January–31 December 2022 per observation station in μg/m3.
Figure 10. Box plots of the monthly distribution of the PM2.5 concentrations for 1 January–31 December 2022 per observation station in μg/m3.
Mathematics 12 02899 g010
Figure 11. The meshed study area.
Figure 11. The meshed study area.
Mathematics 12 02899 g011
Figure 12. Observed versus predicted high-resolution PM2.5 concentrations for the period 1–31 January 2023 for models M1–M5 for three selected monitoring stations (S5, S6, and S10).
Figure 12. Observed versus predicted high-resolution PM2.5 concentrations for the period 1–31 January 2023 for models M1–M5 for three selected monitoring stations (S5, S6, and S10).
Mathematics 12 02899 g012
Figure 13. The global temporal pattern of the PM2.5 predictions versus the global temporal pattern of the observations, January 2023.
Figure 13. The global temporal pattern of the PM2.5 predictions versus the global temporal pattern of the observations, January 2023.
Mathematics 12 02899 g013
Figure 14. Temporal pattern of PM2.5 and tracer gas data (CO and NO2), January 2023.
Figure 14. Temporal pattern of PM2.5 and tracer gas data (CO and NO2), January 2023.
Mathematics 12 02899 g014
Figure 15. Posterior means of the standard deviations of the GF innovations, 1–31 January 2023.
Figure 15. Posterior means of the standard deviations of the GF innovations, 1–31 January 2023.
Mathematics 12 02899 g015
Figure 16. The predicted high-resolution daily PM2.5 concentration per grid area in January 2023 (μg/m3).
Figure 16. The predicted high-resolution daily PM2.5 concentration per grid area in January 2023 (μg/m3).
Mathematics 12 02899 g016
Figure 17. PM2.5 exceedance probabilities for level c = 35 μg/m3 per 1 km × 1 km grid area for 1–31 January 2023.
Figure 17. PM2.5 exceedance probabilities for level c = 35 μg/m3 per 1 km × 1 km grid area for 1–31 January 2023.
Mathematics 12 02899 g017
Table 1. Temporal characteristics of the monthly PM2.5 concentration (μg/m3).
Table 1. Temporal characteristics of the monthly PM2.5 concentration (μg/m3).
MonthMinimumMeanMaximumStandard Deviation
January1.0030.4396.8219.07
February2.8728.7973.9015.37
March2.1729.27106.1020.98
April2.0032.3679.4718.29
May1.9233.8088.6517.79
June2.1051.25124.6123.11
July2.7649.14103.0018.17
August19.9648.78126.2915.74
September12.1646.2380.9212.58
October4.2532.3366.2212.48
November6.6827.1968.8814.70
December4.8730.3485.5315.93
Overall1.0037.46126.2919.21
Table 2. The predictors in the spatiotemporal model.
Table 2. The predictors in the spatiotemporal model.
PredictorData SourceDescriptionUnitSpatial ResolutionTemporal Resolution
Population densityhttps://data.jakarta.go.id/
(accessed on 1 January 2021)
Population density among the 267 sub-districts in 2020people/km2Average area of 2.5 km2Annual
Altitudehttps://www.worldclim.org/
(accessed on 1 January 2021)
meter1 km × 1 kmAnnual
Precipitationhttps://bmkg.go.id/
(accessedon 1 January 2021)
mm0.5 km × 0.5 kmDaily
Table 3. Model comparison and selection.
Table 3. Model comparison and selection.
ModelPDICDICPWAICWAICLMLMSEMAEMAPER2
M14.00464.9984.268405465.094−268.80582,805.424287.75941.9230.556
M217.54345.8117.558823346.541−237.399114,456.464338.31430.3860.747
M3199.88204.763135.570765185.703−244.95714,369.369119.87210.2570.748
M4306.51−2511.548154.588434−2631.043300.0490.0010.0370.0410.855
M5304.34−2493.711154.148557−2612.901300.0260.0010.0380.0440.858
Table 4. Summary statistics for the fixed effects regression coefficients of model M4 with their p-values.
Table 4. Summary statistics for the fixed effects regression coefficients of model M4 with their p-values.
CovariateMeanSDCritical Ratiop-Value
Intercept2.9730.12523.7840.000
Altitude−0.0780.093−0.8390.281
Population density 0.0700.0790.8860.269
Precipitation−0.0390.023−1.6960.095
Table 5. Summary statistics for the random effects and the innovations and their standard deviations, p-values, and fractions of variance.
Table 5. Summary statistics for the random effects and the innovations and their standard deviations, p-values, and fractions of variance.
Random Effects and InnovationsMeanSDCritical Ratiop-ValueFraction of Variance (%)
Range (r)4301.302640.43526.71620.000
Autoregressive coefficient ( λ )0.97420.0076128.18420.000
SD error ( σ ε )0.00680.00262.61540.0130.0199
SD temporal trend ( σ v )0.28470.04106.94390.00034.9319
SD innovations ( σ ϱ )0.38850.06086.38980.00065.0482
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jaya, I.G.N.M.; Folmer, H. High-Resolution Spatiotemporal Forecasting with Missing Observations Including an Application to Daily Particulate Matter 2.5 Concentrations in Jakarta Province, Indonesia. Mathematics 2024, 12, 2899. https://doi.org/10.3390/math12182899

AMA Style

Jaya IGNM, Folmer H. High-Resolution Spatiotemporal Forecasting with Missing Observations Including an Application to Daily Particulate Matter 2.5 Concentrations in Jakarta Province, Indonesia. Mathematics. 2024; 12(18):2899. https://doi.org/10.3390/math12182899

Chicago/Turabian Style

Jaya, I Gede Nyoman Mindra, and Henk Folmer. 2024. "High-Resolution Spatiotemporal Forecasting with Missing Observations Including an Application to Daily Particulate Matter 2.5 Concentrations in Jakarta Province, Indonesia" Mathematics 12, no. 18: 2899. https://doi.org/10.3390/math12182899

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop