3.1 Computation of Incidence Rate using Block-DSS
We use the incidence rate, the number of confirmed positive tests in a region, in a period of 14 consecutive days, divided by the population of the region, as a proxy for assessing the risk of infection. In particular, we consider the incidence rate per municipality
\(z_\alpha (t)\), defined by
where
\(c_\alpha (t)\) is the number of confirmed positive tests in the 14 days preceding a given day
t, at each municipality,
\(\alpha\), and
\(n_\alpha\) is the respective population size. We will use
\(z_u\) to denote the incidence rate at a specific location
u (or cell) in the country. In the sequence, we may drop the explicit dependence on
t to simplify the notation, when not required.
The daily numbers of new cases by the municipality are reported regularly by the Portuguese DGS, and they have been converted to incidence rates by dividing by the population count of each municipality, as given by official statistics, and scaled to 100,000 inhabitants (thus corresponding to the number of new cases in the most recent 14 days per 100,000 inhabitants). This will be the scale used in the maps and reports in this article.
Since not all cases are detected, this incidence rate may underestimate the real incidence rate. However, it is reasonable to assume that the ratio between the actual number of newly infected persons and the number of detected cases is approximately equal in the different regions of the country, enabling us to use the incidence rate computed in this way as a proxy to the real incidence rate. In practice, we may be underestimating the incidence rate by a significant factor, since it is known [
29] that a relevant fraction of the COVID-19 cases are asymptomatic and remain undetected. This has an impact on the value of the parameters of the dynamic models, since important characteristics of the process such as the group immunity threshold, depending on the actual rates of infection and recovery. In practice, we believe this underestimation does not have a strong impact on the accuracy of the models (other than a systematic underestimation), since all model parameters are estimated from actual data and updated as time goes by in order to reflect the dynamics of the pandemic. Therefore, the only significant impact of the fact that not all cases are detected is the underestimation of the incidence rate, by a factor that can be considered approximately constant, if one assumes that there are no significant differences in testing strategy from region to region. This assumption can be justified by the relative homogeneity of the health system in a country like Portugal.
We consider the country divided into a regular grid of square cells, which, in this work, have a dimension of 2 km by 2 km (see Figure
1 for an example of the discretization used).
The methodology can be trivially adapted to a different grid. We compute the incidence rate maps by estimating
\(z_u\) for each cell in the country, using geostatistical stochastic simulation, namely block-DSS [
30], which is based on a stationary spatial covariance model (i.e., stationary variogram model). This modeling technique allows the daily update of COVID-19 incidence rate maps at high-resolution together with the associated spatial uncertainty. The geostatistical incidence rate maps are continuous (up to the scale of the discretization) and their spatial distribution follows a given spatial pattern as revealed by a variogram model inferred from the data. The incidence rate maps do not show any sharp discontinuities and are not influenced by administrative boundaries, such as municipality borders. Since incidence rates refer to different population sizes, these must be weighted when calculating the experimental variogram such that municipalities with large populations have greater weighting.
Below, we briefly describe this geostatistical simulation method. A full description of the method can be found in previous work by the authors [
1]. We assume that the geometric centroid of each municipality
\(\alpha\) has coordinates
\((x_\alpha ,y_\alpha)\). Data are available for the 278 municipalities in mainland Portugal, as illustrated in Figure
2.
For clarity, in this section, we will drop the explicit dependency on t. The daily update of incidence rates \(z_\alpha\) known for each municipality is assigned to each centroid (weighted by population density) and are used as experimental data in the geostatistical simulation. This is a reasonable approximation given the relatively small area of each municipality when compared with the area of the country.
Within a geostatistical framework
\(z_\alpha\) is interpreted as a realization of a random variable
\(Z_\alpha\) corresponding to the true, and unknown, incidence rate in municipality
\(\alpha\). The expected value of
\(E[Z_\alpha ]\) is an approximation of the incidence rate.
Z depends on the population size of each municipality. For example, a given municipality with a small population size
\(n_\alpha\) (i.e., a small denominator) will have high variance and consequently higher uncertainty in the incidence rate estimate. We use the Poisson kriging model [
11] to define the risk variance
Block-DSS [
30] is a stochastic sequential simulation method based on the Poisson kriging model and an adequate modeling technique to deal with data with varying spatial support (i.e., municipalities with varying size and shape) and to make spatial predictions with a change of support, e.g., from the area (block) to point support. In this case, the scale related to the map cells can be referred to as point support, because it denotes a small area when compared with the municipality areas. The following sequence of steps summarizes the block-DSS algorithm [
30]:
(1)
Generate a random path that visits each cell, \({u}\), of the simulation grid;
(2)
at each location along the random path, \({u}\), search the conditioning data within a given neighborhood dependent on the variogram model. The conditioning data comprises the closest experimental data, previously simulated values and block data;
(3)
calculate the local covariance values considering spatial covariance matrices between block-to-block, point-to-block, and point-to-point. The point data represents the incidence rates assigned at the centroid of each municipality and the block data are defined as the spatial linear average of point values. These matrices are built to solve the block kriging system and obtain the local mean and kriging variance estimate at location
\({u}\) [
30];
(4)
draw a value, \(z_u\), from the global probability distribution function centered at the local mean and bounded by the local variance obtained in (3);
(5)
add the simulated value to the dataset and repeat steps (2) to (6) until all grid cells are simulated for one realization.
3.2 Dataset Generation and Characterization
Block-DSS generates alternative incidence rate maps, designated realizations, at each run, as the random path changes and consequently the conditioning data when simulating a location \({u}\). A given set of realizations provides the value of the uncertainty of incidence rate distribution in a given cell. We have simulated daily sets of one hundred realizations of incidence rates for mainland Portugal.
We computed the incidence values for each cell in the territory, for each day in the period under consideration. The final dataset, used to assess the performance of the models, was generated from data in 278 municipalities, for the period between March 1st, 2020 and February 28th, 2021. The territory was discretized into 40,608 square cells (in a \(288 \times 141\) grid), each cell corresponding to a surface of 2 km by 2 km. The resulting dataset consists of information (incidence and uncertainty estimates) for each cell during the first 365 days of the pandemic. The first 184 days (March 1st until August 31st) were used as the training set, to train or define the parameters of the models, and the following 181 days were used to test the prediction ability of the models.
To illustrate the data, Figure
3 depicts an example of the computation of the incidence rates, for a specific day in the period.
Since the method also derives confidence intervals for the incidence rate in each cell, we also make these available (see example in Figure
4).
The value of the incidence rate at each cell represents a good approximation, given existing data limitations, to the real incidence rate at that location. Both the incidence risk maps and the confidence intervals are made available as supplemental data.