Abstract
Balanced sampling is a random method for sample selection, the use of which is preferable when auxiliary information is available for all units of a population. However, implementing balanced sampling can be a challenging task, and this is due in part to the computational efforts required and the necessity to respect balancing constraints and inclusion probabilities. In the present paper, a new algorithm for selecting balanced samples is proposed. This method is inspired by simulated annealing algorithms, as a balanced sample selection can be interpreted as an optimization problem. A set of simulation experiments and an example using real data shows the efficiency and the accuracy of the proposed algorithm.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Balanced sampling refers to a class of techniques aimed at randomly selecting units from a given population. The selection of balanced samples was first introduced by Gini (1928) and later recalled by Yates (1946) and Thionnet (1953), stimulating from then on an increasing interest in addressing, by means of balancing, several practical survey sampling problems (see for example Falorsi and Righi 2008; Grafström and Tillé 2013; Brus 2015; Benedetti and Piersimoni 2017; Chauvet 2017; Marazzi and Tillé 2017; Tillé et al. 2018; Chauvet and Le Gleut 2019; Kermorvant et al. 2019). The peculiarity of balanced sampling consists of exploiting a priori auxiliary information available for all units of a population at the design stage, so that the expansion estimator (Narain 1951; Horvitz and Thompson 1952) returns the balancing variables totals known at the population level. The stronger the correlation among the variables of interest and the balancing variables, the higher the effiency of a balanced sampling design in estimating the population total.
Several contributions have been proposed in literature to select balanced samples (for a review see Valliant et al. 2000). All the methods are configured as partial solutions for a variety of reasons, the main of which are a considerable computing time when several balancing variables are used (e.g. in enumerative sampling; Ardilly 1991), and problems in respecting the original inclusion probabilities (e.g. in rejective sampling; Hájek 1981). A general solution for balanced sampling was finally proposed by Deville and Tillé (2004), whose cube method allows for the selection of balanced samples with equal or unequal inclusion probabilities and any number of auxiliary variables with fast execution time (Chauvet and Tillé 2006; Grafström and Lisic 2016). This method is based on a random transformation of the inclusion probabilities vector to draw a sample that exactly, or at least approximately, satisfies the original inclusion probabilities and balancing equations (Tillé 2011). Deville and Tillé (2004) provided a solution that allows for the selection of approximate balanced samples, while respecting the inclusion probabilities (for an exhaustive presentation of the cube method, see Deville and Tillé 2004; Tillé 2006).
The selection of balanced samples may be viewed as a constrained optimization problem with discrete variables and a solution may be borrowed from physics. In particular, a Markov chain Monte Carlo method aimed at solving complex combinatorial problems, such as e.g. the traveling salesman problem, is a well-known simulated annealing method (Kirkpatrick et al. 1983). This method uses the Metropolis-Hastings algorithm for approximating the global optimum of a given function and is a preferable alternative when the approximation of a global optimum is more important than finding a precise local one. For a review of the applications of simulated annealing, see e.g. Aarts and van Laarhoven (1987). In the present paper, we show that a deterministic version of simulated annealing may be applied in sampling context to select high quality, fixed-size balanced samples. A very fast algorithm called Simulated Annealing for Balanced Sampling (sabs) is presented and its quality in terms of sample balance and respect of inclusion probabilities is evaluated by means of Monte Carlo simulations. A comparison with the cube method is also carried out. The paper is structured as follows. In Sect. 2, preliminaries and notations are given. In Sect. 3, the new algorithm is presented and both technical and theoretical aspects are detailed. In Sect. 4, results of a simulation study exploring several scenarios are presented. Section 5 is devoted to the practical application of the algorithm to a real data-set. Finally, Sect. 6 discusses the results and concludes.
2 Preliminaries and notation
Let U be a finite population of size N composed by k units, with \(k\in \left\{ 1,...,N\right\} \). Let denote with y the variable of interest and with \(y_{k},\) \(k\in U\), the value of y in the \(k-th\) population unit. The aim is to estimate the population total \(Y=\sum _{k\in U}y_{k}\). Let S be a random fixed-size sample defined as a subset of U selected under a probability distribution \(p(\cdot )\) and according to a without replacement sampling design, such that \(Pr(S=s)=p(s)\) where \(\sum _{s\subset U}p(s)=1\), for each \(s\in U\). A random sample may also be defined as a discrete random non-negative vector \(\mathbf {a}=\left( a_{1},...,a_{k},...,a_{N}\right) ^{\text {T}}\), where \(a_{k}\) indicates the number of selections of the unit \(k\in U\) in the sample. The variable \(a_{k}\) is an inclusion indicator, which is, in without replacement sampling, equal to 1 if the unit k is selected in the sample, and 0 otherwise (i.e. following a Bernoulli distribution). The inclusion probability \(\pi _{k}\) is the probability for the unit k to be selected in the sample. This probability is derived from the chosen sampling design as \(\pi _{k}=Pr(k\in S)=E_{p}(a_{k})=\sum _{s\subset U_{s\ni k}}p(s)\), for each \(k\in U\). In a design-based perspective, it is possible to estimate the population total Y by using the expansion estimator (Narain 1951; Horvitz and Thompson 1952) of the total \(\widehat{Y}=\sum _{k\in S}\frac{y_{k}}{\pi _{k}}=\sum _{k\in U}\frac{y_{k}a_{k}}{\pi _{k}}\), with \(\pi _{k}>0\).
Let \(\mathbf {z}_{k}=(z_{k1},...z_{kj},...,z_{kJ})^{\text {T}}\) be a vector of J auxiliary variables known for all the units in the population U, such that the knowledge of \(\mathbf {z}_{k}\) allows for the computation of J totals as \(Z_{j}=\sum _{k\in U}z_{kj},j=1,...,J\). When a sample is selected, the expansion estimator of J auxiliary variables can be computed as \(\hat{Z}_{j}=\sum _{k\in S}\frac{z_{kj}}{\pi _{k}}\). A sampling design p(s) is said to be balanced on the vector of the auxiliary variables \(z_{k}^{\text {T}}=(z_{1},...,z_{J})\) if and only if it satisfies the balancing equations
The selection of a balanced sample may be a challenging task because it is a problem that cannot satisfy non-integer constraints. Indeed, there may exist a rounding problem that prevents the exact satisfaction of the balancing constraints. Therefore, the aim of this study is to find a design that exactly, or at least approximately, satisfies the balancing equations. The rounding problem becomes less relevant when the sample size is high, but this condition may not be compatible with the practical constraints of the conduction of a survey. In addition, the respect of the inclusion probabilities cannot be overlooked, as they are not always satisfied with some balanced sampling methods.
Moreover, balanced sampling may be formulated and solved as a linear programming problem. In this respect, to each sample is assigned a cost function C(s), which is equal to zero if the sample is perfectly balanced and has greater values as the sample becomes increasingly unbalanced. The aim here is to find a sampling design p(s) that minimizes the mean cost \(\sum _{s\subset U}C(s)p(s)\), subject to the constraint to respect the inclusion probabilities \(\sum _{s\subset U}p(s)=1\) and \(\sum _{s\subset U,s\ni k}p(s)=\pi _{k},\) with \(k\in U\).
Deville and Tillé (1998) demonstrated that the solution to the problem leads to the selection of a minimal support design. Unfortunately, applying linear programming is, in most cases, unfeasible, as it is necessary to enumerate all the possible samples. Therefore, the number of samples \(2^{N}\) is too big to be a suitable solution. This reveals the need for a balanced sampling design to avoid such enumeration. In this respect, the cube method proposed by Deville and Tillé (2004) allows one to consider the \(2^{N}\) possible samples as \(2^{N}\) vectors in \(\mathbb {R}^{N}\). This method provides a general solution to balance on several variables, with equal and unequal inclusion probabilities. The cube method is based on a random transformation of the inclusion probabilities vector and a sample is obtained when the inclusion probabilities are exactly satisfied and the balancing equations are satisfied as well as possible (Deville and Tillé 2004; Chauvet and Tillé 2006).
3 Simulated annealing-based algorithm
In balanced sampling, the aim, according to a multivariate distribution with conditions on the support, is to select a random sample of fixed-size. Simulated annealing may represent a suitable solution since it can be used to generate samples from any high-dimensional distribution if the probability function is known. Exploiting the idea at the basis of the method, a combinatorial optimization problem can be viewed as a stochastic process (Robert and Casella 2013), in which local conditional distributions are dependent from a global control parameter, i.e. the so-called temperature (for a readable explanation, see Geman and Geman 1984). Simulated annealing requires the generation of a finite sequence of decreasing values of the temperature (the annealing schedule) according to a probabilistic decreasing function (i.e. the Metropolis-Hastings algorithm), in order to converge toward a set of global optimal solutions (i.e. obtaining the minimum energy states). The parallelism with sampling problems easily follows. Specifically, let’s define an energy function
This represents a mean quadratic distance function among the estimated totals and the known totals of the balancing variables, for any possible configuration of the sample \(\mathbf {a}\in \left\{ 0,1\right\} ^{N}\). Note that any distance function can be used as an energy function and no restrictions exist on it. Clearly, if \(f\left( \mathbf {a}\right) =0\), the balance of the sample is exactly satisfied. The proposed sampling algorithm works to achieving this condition, by searching an optimal configuration. Hence, for configuration \(\mathbf {a}^{(i)}\) obtained at the \(i-th\) attempt of the algorithm, it may be possible to obtain another configuration \(\mathbf {a}^{(e)}\), in which the inclusion indicator label is randomly exchanged among two units at each iteration, ensuring the respect of the sample size. The second configuration is preferred to the first if \(f\left( \mathbf {a}^{(e)}\right) <f\left( \mathbf {a}^{(i)}\right) \), using a deterministic approach, following the idea of Besag (1986) in the image processing context. The maximum number of iterations is equal to MAXITER, where each iteration consists of N attempts, and where MAXITER is chosen by the user and N is the population size. Therefore, the algorithm performs a maximum of \(MAXITER\cdot N\) attempts. The procedure stops if a pre-fixed respect of minimal balancing constraints is reached. Thus, a final configuration is always obtainable and no restrictions on summary index used in the convergence (CONV) check are needed.
This procedure implies that a local minimum, not necessarily a global one, is reached. The way to avoid the entrapment in local minima lays in the use of a probabilistic decreasing rule, such as in traditional implementation of simulated annealing. Unfortunately, the main contraindication is to require strong computational efforts, because the temperature needs to be decreased very slowly so that all possible solutions may be visited. This makes this procedure difficult to be used in practice, especially with a high number of balancing variables and with large populations. In addition, the search for global optima may produce undesirable solutions in terms of the selected samples. However, according to our computational experience, this algorithm is based on a deterministic decreasing rule that ensures the balancing constraints can be satisfactorily achieved in a reasonable number of iterations and obtains first-order inclusion probabilities very close to those desired, as showed in the following section.
From a practical point of view, the sabs selects a set of fixed-size random samples without replacement, according to the initial vector of inclusion probabilities. This means that for each configuration, it is possible to resort to the expansion estimator and to exploit its properties in total estimation, variance, and variance estimation (Horvitz and Thompson 1952).
4 Simulation experiments
In order to check the properties of the proposed sampling algorithm, several simulation experiments were carried out. Twelve populations have been generated according to different distributions and sizes. Indeed, an Uniform \(\mathcal {U}\thicksim \left( 0,1\right) \), a Normal \(\mathcal {N}\thicksim \left( 0,1\right) +6\), an Exponential \(Exp\thicksim \left( 0.5\right) \), and a Bimodal defined as a mixture of normals \(Bim\thicksim 0.4\mathcal {N}(2,0.25)+0.6\mathcal {N}(4,0.25)\), each with population sizes equal to 1000, 5000 and 10000 units, have been considered. These populations are referred to as \(U_{1}\) - \(U_{12}\), and a summary of the characteristics is reported in Table 1.
For each population, \(J=3;5;10\) independent auxiliary variables have been generated, for a total of 36 scenarios. We considered a relevant number of auxiliary variables in order to evaluate the performance of the proposed method in the presence of a vast information asset.
For each of the above-mentioned scenarios, \(M=10000\) fixed-size samples have been selected with equal inclusion probabilities and with sampling fractions \(\mathrm {f}=0.01;0.05;0.1\), by means of the sabs and cube methods. Indeed, the proposed algorithm has been compared with the cube method since it is the most used sampling design for the selection of balanced samples; therefore, it is the most appropriate competitor. Moreover, in order to investigate some extreme cases as well, four small populations (\(U_{13}\) - \(U_{16}\)) of dimension \(N=100\) have been generated according to the aforementioned distributions, with \(J=3;5;10\) independent auxiliary variables. Here also \(M=10000\) fixed-size samples have been selected, this time with sampling fractions \(\mathrm {f}=0.1;0.25\). Note that the sabs algorithm has been set to stop when the minimal balancing constraints equal to 0.1% is reached, with a maximum number of iterations equal to \(10\times N\).
The simulation focuses on two very important aspects in the evaluation of a balanced sampling design: respect of first-order inclusion probabilities and respect of balancing constraints. The former is investigated by means of the relative Root Mean Squared Error, defined as:
where \(v_{k}\) is the number of times the k-th unit is selected in M Monte Carlo replicates. The difference in the respect of the balancing constraints has been defined as:
with \(\mathbf {d}=\left\{ d_{1},...,d_{m},...,d_{M}\right\} \) and \(d_{m}=max_{j}\left( \frac{\hat{Z}_{j,m}-Z_{j}}{Z_{j}}\right) \).
Moreover, a comparison in terms of computational time has been performed. In particular, as an example, we report the elapsed time to select one sample on the populations generated by \(\mathcal {U}\thicksim \left( 0,1\right) \).
4.1 Respect of the inclusion probabilities and of the balancing constraints
The most relevant results of the Monte Carlo experiments on \(U_{1}\) - \(U_{4}\) and \(U_{9}\) - \(U_{12}\) are reported in Tables 2 and 3 and are graphically summarized in Figs. 1, 2, 3 and 4, while remaining simulation results are reported in the Supplementary Material.
The simulation results motivate the following comments.
With respect to the inclusion probabilities, the cube method showed better performance than the sabs method. Furthermore, the performance difference between the two methods decreased when the sampling fraction increased. Note that the largest sampling fraction was considered equal to 10% of the population size. This behaviour was detected on all of the considered populations.
With respect to the balancing constraints, the performances of the sabs was very encouraging. The method showed errors very near to zero, for all three population sizes and for the smaller sampling fractions considered. The differences in performance among the sabs and cube methods increased with an increased population size (see results reported in the Supplementary Material).
The considerations made up until this point were valid for all of the considered distributions used to generate the populations presented so far.
With regard to the populations of size equal to 100 units, some simulation results are reported in Table 4, while the extended set-up is reported in the Supplementary Material. Sampling fractions have been increased compared to previous examples in order to provide the balancing on the same number of constraints.
The latter populations represent a very particular case due to the very small population sizes, which in turn implies very small sample sizes. Neverthless, the behaviour of the sabs method is confirmed. The performances in respect of the inclusion probabilities remain the best for the cube method, even as they become very similar with the growth of the sampling fractions. It should be noted that with the bimodal distribution, the \(rRMSE_{\pi _{k}}\) is practically identical for both methods. In respect of the balancing constraints, the performances of the sabs were again confirmed as the best for every distribution and every sampling fraction considered.
4.2 Time of computation
The computing time needed for the selection of samples for a given sampling method could be an important factor in both surveys and simulations, especially if it is highly dependent on N and n. Table 5 reports the average CPU time in seconds taken by each of the algorithms to select a sample for different populations and sample sizes. Simulations were performed on a 2.3 GHz Dual-Core Intel Core i5. Samples were selected using R software. In particular, the sabs algorithm was implemented in C++ via Rcpp; the cube method was implemented by means of the BalancedSampling package (Grafström and Lisic 2016), which is a fast implementation in C++ via Rcpp, and by means of the sampling package (Tillé and Matei 2009).
The sabs algorithm was the least computationally intensive method used, as it was sensibly quicker than the cube method implemented in both ways. It increased gradually with n and only proportionally to N, mainly because the number of attempts was set equal to N. Thus, we can be confident that the sabs algorithm can be effectively applied without extensive difficulties regarding large population sizes and without any storage or RAM limitations other than those represented by the size of the frame from which we wanted to select the sample.
5 An experiment on real data
Real data was used in order to investigate the efficiency of the sampling design when estimating a target variable in a real context. The dataset we used is freely available from www.statbel.fgov.be and concerns fiscal statistics on income subject to personal income tax. Data are related to the 581 Belgian municipalities for the period between 2005–2018. By focusing attention on information related to the last available year, we aim to estimate the total Belgian net income (Y) by exploiting the three following auxiliary variables: the number of declarations with a total net taxable income greater than zero (\(z_{1}\)), the number of declarations where the amount of total taxes is greater than zero (\(z_{2}\)), and the number of residents per municipality (\(z_{3}\)). We selected \(M=10000\) fixed-size samples with equal inclusion probabilities, for sampling fractions \(\mathrm {f}=0.01;0.05;0.1\), by means of the sabs algorithm, the cube method and simple random sampling without replacement (srswor). The latter was a traditionally used benchmark in survey sampling. Beside to the computation of \(rRMSE_{\pi _{k}}\) and CD, the population total of Y was estimated by the expansion estimator \(\hat{Y}\). We computed the relative Root Mean Square Error for the estimator, which is defined as \(rRMSE_{\hat{Y}}=\frac{\sqrt{\frac{\sum _{m=1}^{M}\left( \hat{Y_{m}}-Y\right) ^{2}}{M}}}{Y}\). Table 6 shows the results of \(rRMSE_{\pi _{k}}\), CD, and \(rRMSE_{\hat{Y}}\), obtained by selecting samples with the three sampling methods considered.
The results of \(rRMSE_{\pi _{k}}\)and CD were in line with those seen in previous simulations. Indeed, the sabs algorithm demonstrated greater efficiency in respecting balancing constraints, while also performing worse in the respect of inclusion probabilities compared to srswor and the cube method. Clearly srswor shows considerable superiority compared to other balanced sampling methods, but this advantage is outclassed by its relevant inferiority in producing efficient estimates of a target variable. In fact, the ability of the sabs and cube method to efficiently estimate the Y variable is evident, especially in regard to the former method. Hence, the efficiency of the proposed algorithm has been showed once again, together with its expendability in practical contexts.
6 Dicussion and conclusions
In the present paper, a new method of selecting balanced samples by means of a simulated annealing-based algorithm has been proposed. By exploiting the possibility of interpreting balanced sampling as an optimization problem, we showed that the sabs algorithm allows for a quicker selection of well-balanced samples on a wider set of auxiliary variables, not neglecting a rigorous respect of inclusion probabilities and a strong efficiency in estimation. An in-depth investigation about both aspects is necessary in order to prove the employability of a sampling method in practical situations. In fact, a good respect of balancing constraints results in a possible reduction of estimation variance, while the respect of inclusion probabilities means a reduction in estimation bias. Hence, a crucial issue concerns finding a sampling method capable of combining these two desirable properties. In the present paper, we proved through extensive simulation experiments, both on simulated and real data, that the sabs algorithm achieves this goal, resulting in a valid and efficient alternative to the well-known cube method.
References
Aarts EH, van Laarhoven PJ (1987) Simulated annealing: a pedestrian review of the theory and some applications. In Pattern recognition theory and applications, pages 179–192. Springer
Ardilly P (1991) Échantillonnage représentatif optimum à probabilités inégales. Annales d’Economie et de Statistique 91–113
Benedetti R, Piersimoni F (2017) A spatially balanced design with probability function proportional to the within sample distance. Biom J 59(5):1067–1084
Besag J (1986) On the statistical analysis of dirty pictures. J R Stat Soc Ser B 48(3):259–279
Brus DJ (2015) Balanced sampling: a versatile sampling approach for statistical soil surveys. Geoderma 253:111–121
Chauvet G (2017) A comparison of pivotal sampling and unequal probability sampling with replacement. Stat Prob Lett 121:1–5
Chauvet G, Le Gleut R (2019) Inference under pivotal sampling: properties, variance estimation, and application to tesselation for spatial sampling. Scand J Stat 48:108
Chauvet G, Tillé Y (2006) A fast algorithm for balanced sampling. Comput Stat 21(1):53–62
Deville J-C, Tillé Y (1998) Unequal probability sampling without replacement through a splitting method. Biometrika 85(1):89–101
Deville J-C, Tillé Y (2004) Efficient balanced sampling: the cube method. Biometrika 91(4):893–912
Falorsi PD, Righi P (2008) A balanced sampling approach for multi-way stratification designs for small area estimation. Survey Methodol 34(2):223–234
Geman S, Geman D (1984) Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741
Gini C (1928) Une application de la methode representative aux materiaux du dernier recensement de la population italienne (ler decembre 1921). Bull Int Stat Inst 23(2):198–215
Grafström A, Lisic J (2016) Balancedsampling: balanced and spatially balanced sampling. R package version 1(2):
Grafström A, Tillé Y (2013) Doubly balanced spatial sampling with spreading and restitution of auxiliary totals. Environmetrics 24(2):120–131
Hájek J (1981) Sampling from a finite population. Marcel Dekker, New York
Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Ass 47(260):663–685
Kermorvant C, Damico F, Bru N, Caill-Milly N, Robertson B (2019) Spatially balanced sampling designs for environmental surveys. Environ Monit Assess 191(8):524
Kirkpatrick S, Gelatt CD, Vecchi MP et al (1983) Optimization by simulated annealing. Science 220(4598):671–680
Marazzi A, Tillé Y (2017) Using past experience to optimize audit sampling design. Rev Quant Financ Account 49(2):435–462
Narain R (1951) On sampling without replacement with varying probabilities. J Indian Soc Agric Stat 3:169–174
Robert C, Casella G (2013) Monte Carlo statistical methods. Springer Science & Business Media, Berlin
Thionnet P (1953) La théorie des sondages. INSEE, Imprimerie Nationale
Tillé Y (2006) Sampling algorithms. Springer-Verlag, New York
Tillé Y (2011) Ten years of balanced sampling with the cube method: an appraisal. Surv Methodol 37(2):215–226
Tillé Y, Dickson MM, Espa G, Giuliani D (2018) Measuring the spatial balance of a sample: a new measure based on morans i index. Sp Stat 23:182–192
Tillé Y, Matei A (2009) Sampling: survey sampling. R package version, 2
Valliant R, Dorfman AH, Royall RM (2000) Finite population sampling and inference: a prediction approach. Wiley, New York
Yates F (1946) A review of recent statistical developments in sampling and sampling surveys. J R Stat Soc 109(1):12–43
Funding
Open access funding provided by Università degli Studi di Trento within the CRUI-CARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The article has been updated in order to correct the ORCIDs of the authors Francesco Pantalone and Federica Piersimoni.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Benedetti, R., Dickson, M.M., Espa, G. et al. A simulated annealing-based algorithm for selecting balanced samples. Comput Stat 37, 491–505 (2022). https://doi.org/10.1007/s00180-021-01113-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-021-01113-3