Nothing Special   »   [go: up one dir, main page]

Missing Data Imputation in Multivariate Data by Evolutionary Algorithms

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Computers in Human Behavior 27 (2011) 1468–1474

Contents lists available at ScienceDirect

Computers in Human Behavior


journal homepage: www.elsevier.com/locate/comphumbeh

Missing data imputation in multivariate data by evolutionary algorithms


Juan C. Figueroa García a,⇑,1, Dusko Kalenatic b, Cesar Amilcar Lopez Bello a,2
a
Universidad Distrital Francisco José de Caldas, Bogotá, Colombia
b
Universidad de la Sabana, Chía, Colombia

a r t i c l e i n f o a b s t r a c t

Article history: This paper presents a proposal based on an evolutionary algorithm to impute missing observations in
Available online 4 November 2010 multivariate data. A genetic algorithm based on the minimization of an error function derived from their
covariance matrix and vector of means is presented.
Keywords: All methodological aspects of the genetic structure are presented. An extended explanation of the
Missing data design of the fitness function is provided. An application example is solved by the proposed method.
Evolutionary optimization Ó 2010 Elsevier Ltd. All rights reserved.
Multivariate analysis
Multiple data imputation

1. Introduction and motivation 1.1. State of the art

During the last 50 years, multivariate data have been consid- Classical works on statistical estimation must be referred:
ered as interesting study cases due to its complexity and size. As Dempster, Laird, and Rubin (1977) designed the EM algorithm
always, some cases can contain missing observations, this multiple which is useful to estimate missing observations in multivariate
data loss is an important problem for analysts who need to work in data by using correlated data variables. Gaetan and Yao (2003) pro-
these conditions. pose a Metropolis version of the EM algorithm based an simulated
Genetic algorithms are powerful tools to find the solutions for annealing approach to improve its efficiency, Celeux and Diebolt
complex problems where the search space is too large and the (1993) propose a probabilistic version of the EM algorithm and
problem becomes into a NP-Hard form. Too many applications to Nielsen (2000) propose a stochastic version of the EM algorithm.
optimization and scheduling problems are shown by different Levine and Casella (2000) use Monte–Carlo methods to implement
authors. By this reason, we select those techniques since they are the EM algorithm. Arnold (2003) estimates the parameters of a
appropriate methods to find good solutions to problems that are state-dependant AR model by using the EM algorithm and Parveen
not solved efficiently by classical methods. Their flexibility and and Green (2004) used autoregressive spectral densities to esti-
computing efficiency have become as a suitable technique for mate missing data.
missing data imputation problems. For further information about missing data estimation see
An evolutive algorithm for imputing missing multivariate data Timm (2002), Agresti (2002), Box (1949), Anderson (1984),
is designed since it is an efficient computational intelligence tool Krishnaia and Lee (1980), González, Rueda, and Arcos (2008).
which provides faster explorations of the search space. To do so, Some computational intelligence works applied to the missing
a multi-criteria fitness function is extracted from their covariance data problem are proposed: Figueroa, Kalenatic, and López
and central tendency measures. (2008), Figueroa, Kalenatic, and Lopéz (2010) use evolutionary
The scope of this work is to impute missing observations on an algorithms to impute missing observations on a Time Series con-
incomplete multivariate matrix by using an evolutive structure, text. Abdella and Marwala (2005) use neural networks trained by
without generality loss. It means that original properties of avail- genetic algorithms to impute missing observations in databases.
able data do not change when the solution is imputed. An applica- Eklund (2006) use genetic algorithms to estimate confidence inter-
tion example is solved by using the proposal. vals of missing data. Siripitayananon, Hui-Chuan, and Kang-Ren
(2002) use neural networks to impute missing observations on
spatial data, Parveen and Green (2004) use neural networks to im-
⇑ Corresponding author.
prove speech enhancement with missing observations. In this way,
E-mail addresses: jcfigueroag@udistrital.edu.co (J.C. Figueroa),
duskokalenatic@yahoo.com (D. Kalenatic), clopezb@udistrital.edu.co (C.A. Lopez).
genetic algorithms in the multiple missing observations case are
1
Laboratory for Automation, Microlectronics and Computational Intelligence (LAMIC). not widely used, in fact, the multivariate data case are not covered
2
Mathematical Modelling Applied to Industry (MMAI) research group. by this method.

0747-5632/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.chb.2010.06.026
J.C. Figueroa García et al. / Computers in Human Behavior 27 (2011) 1468–1474 1469

2. Covariance, determinant and central tendency measures on ½X. In this way, ½Y is defined as a matrix of imputed data on the
location of V, where yij ¼ 0 when v ij ¼ 0 and yij has value when
In a multivariate data context, two important order statistics v ij ¼ 1. By using these two vectors it is possible to complete the
are their Mean and covariance functions. First, some definitions data, obtaining a matrix X b:
useful to construct a fitness function are given below. b
XþY ¼ X ð5Þ
Definition 1. Multivariate data is a matrix fXg of observations xij , Now, the principal issue is to find a matrix ½Y which does not
each one being recorded for an specific individual i where j is the change the properties of available individuals. To that effect, both
number of variables collected for. covariance matrix and mean vector of the available individuals
The Sample covariance and Sample correlation of data are use- are used on the fitness function.
ful statistics to estimate linear models. They are linear distances
between two points fx:j g and fx:k g. In this way, we define them Remark 1. If ½X has missing data, then it is not possible to get b
S, so
as follows. we use a subset of ½X called ½X a  which is composed by all their
complete individuals, that is:
Definition 2. The sample covariance matrix3 b
S is:

b 1 e a
e a 0 ½ X
1 e0 e Sa ¼ ½X ð6Þ
b
S¼ XX ð1Þ Na  1
n1
e is the centered data matrix, i ¼ 1; 2; . . . ; n, j ¼ 1; 2; . . . ; m. X
e where a denotes the available data and N a is the amount of com-
where X
plete row vectors. A complete row vector xj_ has no incomplete vari-
can be decomposed as:
ables (See Section 4 and X s for examples).
e ¼X 1
X 110 X ð2Þ
n1
and 1 is a (n by 1) vector. Remark 2. The vector of means ½X aj  is obtained by removing all
a
missed data from each variable. Each element of ½X aj , 
xjj is the
mean for the jth variable computed from available data per vari-
b is:
Definition 3. The sample correlation matrix R able. It is defined as follows:
b ¼ D1=2 SD1=2
R ð3Þ X
naj
a xij
xjj ¼ ð7Þ
where D is a diagonal matrix of standard deviations for each i¼1
naj
variable.
where naj is the amount of available values for the jth variable.
Now, the determinant of S; DetðSÞ is an useful measure of the
size of S. DetðSÞ is defined as follows: Thus, the main goal is to find an evolutionary estimate of the k
X missing observations which does not change their b S and X order
DetðSÞ ¼ ð1Þr s1i1 s2i2 ;    ; smim ð4Þ statistics. Now the fitness function F can be expressed as:
1 2
X m 
m X  m  
Now, two matrices A and A are similar if their covariance matrices ^a ^  X aj ^ 
are similar, that is, S1A  S2A . Therefore, A1 and A2 are similar if the F¼ sij  sij  þ xj  xj  ð8Þ
i¼1 j¼1 j¼1
determinants of their covariance matrices are similar also, that is,
DetðS1A Þ  DetðS2A Þ. For more information see Harville (2005), Peña here, ^saij and ^sij are the i; j elements of b
S a and b
S respectively and ^
xj is
(2000), Timm (2002), Agresti (2002). the mean of the jth variable after imputation. Therefore, the princi-
As always, these definitions are based on a strong supposition pal objective of the genetic structure is to minimize F, that is, the
about data: ‘‘Dataset does not contain missing observations”. absolute difference among their covariance matrices and vector of
When some observations are lost, it is not possible to obtain suffi- means.
cient statistics, causing bias from their real properties and misrep-
resentation problems. 3.2. Individuals
Now, if a multivariate matrix has multiple missing observations
then most of classical methods can not be used, forcing us to use An individual is defined as a vector of a population indexed on a
other strategy for imputing data. At a first glance, any imputation matrix of solutions where each one is a solution of the problem. As
method should preserve the statistical properties of available data, always, a genetic structure contains many individuals into a
as is defined in (1), (2) and (4), so we use these measures to design population.
a genetic algorithm for imputing missing data. Each individual represents a complete vector of missed observa-
tions which will be located in X by using ½V. Thus, each individual
3. Genetic structure has much elements (genes) as missed observations exist, indexed
by v ij .
The selected strategy to find missing data on a multi-criterion
context is an evolutionary algorithm based on a genetic structure 3.3. Population size and number of generations
programmed on MatLabÒ. There are a vast amount of bibliography
in evolutionary optimization, but we focused on the design of a fit- The population size is defined by two values: The k missing
ness function based on available complete data and their statistical observations and a pre-selected l number of individuals, creating
properties. a matrix of size l  k labeled as Pgl;k where g denotes the Generation
index. Usually, a very small size of l does not cover the search space
3.1. Fitness function operator adequately, so we observed that a size of l ¼ 100 got better results
than a larger size, due to its computational effort.
Suppose that a multivariate data ½X has k missing observations, The maximum number of generations G is a genetic operator
located by an index matrix ½V with elements v ij ¼ 1 if xij is missed which is used as stop criterion, henceforth we will use this value
3
Also known as Cov ðX; YÞ. as a controller of the iterations of the algorithm, that is max ¼ G.
g
1470 J.C. Figueroa García et al. / Computers in Human Behavior 27 (2011) 1468–1474

3.4. Population random generator All missed observations are errors on measurement devices or
unavailable information per student, so the inference about their
The selected Random Generator is called Rj . The original pdf of population is not possible without a previous imputation of those
each variable without missing data can be used as a random num- missing observations.
ber generator, but in practice an uniform generator exhibits better First, the sample covariance matrix ½b S a  and its determinant
results. Its mathematical representation is: Detðb S a Þ of the available data are computed. The vector of means
½X aj  and other useful statistics are shown in Table 1.
Rj ða; bÞ ¼ a þ r j ðb  aÞI½0;1 ðr j Þ ð9Þ
2 3
5:260 1:463 7:800 7:314 0:338 0:018
here, a is the minimum value, b is the maximum value and r j is a
6 1:463 104:012 15:602 5:894 1:232 3:662 7
random number enclosed into the interval I½0;1 . 6 7
6 7:800 15:602 24:389 12:024 1:302 0:593 7
Law and Kelton (2000), Devroye (1986) did an important dis- b 6 7
Sa ¼ 6 7
cussion about pseudo-randomly number generation showing the 6 7:314 5:894 12:024 19:208 0:661 0:666 7
6 7
uniform generator as a proper method to simulate variables. 4 0:338 1:232 1:302 0:661 3:465 0:214 5
Therefore it is possible to use it instead of the sample distribution. 0:018 3:662 0:593 0:666 0:214 1:693
Detðb
S a Þ ¼ 268650
3.5. Mutation and crossover operators
Some additional statistics about the available data are shown in
The genetic structure is designed on the base of an elite-based Table 1.
algorithm which preserves a set of best individuals for the next As an example, the following matrix ½X s  shows a subset of the
generation. In this way, the proposed algorithm is shown next: matrix ½X with multiple missing observations for the same individ-
ual in different variables.
 First, select the c ordered individuals from each generation g in 2 3
Pgl;k through F and preserve them as first individuals of P gþ1
l;k .
38:09  3 40 3:75216 3
 The Mutation strategy uses a parameter c1 which is a selection
6 36:38 47  37:86   7
6 7
6 7
from the better individuals in the current population g and its 6  50  40:71   7
6 7
description is given below: 6  49  31:43 0:39742  7
6 7
1. Select a random position of the first individual in P gl;k 6  46  37:5 2:45135 0 7
6 7
ordered by F. 6 7
6  49  32:14 2:62465 3 7
2. Replace the selected position with a new individual by 6 7
6 35:43 46  34:17 4:61745 0 7
using (9). 6 7
6 36:86 51  38:57 0:39743 0 7
3. Repeat (2.) for the c1 better individuals of each popula- 6 7
6 7
tion P gl;k . 6 40:5 50 0 40 0:60558 0 7
6 7
 The Crossover strategy uses a parameter c2 which is a selection X ¼6
s
6 35:32 47 4 30:83 6:1293 0 77
from the better individuals in the current population g and its 6 39 40   0:60558 0 7
6 7
6 7
description is given below: 6 36:14 44   0:60558 0 7
6 7
1. Select the c2 first individuals in the ordered Population 6 34:52 42   0:60558 2 7
6 7
Pgl;k by F. 6 36   30:71 2:61408 4 7
6 7
2. Generate a new individual by replacing all even genes 6 7
6 35:39   35 0:6609 0 7
with their respective genes of the next individual. 6 7
6 36:59  5 34 0:79742 0 7
3. Generate a new individual by replacing all odd genes 6 7
6 33:24  8 32:14 4:8522 4 7
with their respective genes of the next individual. 6 7
6 7
4. Repeat (3.) for the c2 better individuals of each popula- 4 35:34  5 37:86 1:22425 0 5
tion P gl;k . 35:29 35 10 32:14 2:698 0
This subset has no complete covariates for imputing any vari-
To complete the population, an additional set of individuals is
able since ½X has multiple missing observations in both different
generated by replacing worst individuals with the new individuals,
individuals and variables. For instance, only the ninth, tenth and
trying an exhaustive search. First, the best four individuals are pre-
nineteenth row vectors have complete elements; The remaining
served for the next generation and later a complementary popula-
ones have incomplete elements in different positions.
tion of size fl  c  c1  c2  kg is obtained by using (9).
Hence, the EM algorithm will replace some values with the
expectation of the available data even for discrete data which is
3.6. Finalization and stopping strategy an incorrect treatment of the problem. It leads to differences be-
tween the original an estimated covariances.
Two stopping criterions are used to stop the algorithm: A first Now, three methods are used to solve the problem: the EM
one is by defining a maximum number of iterations called G, that algorithm, a Regression’s approach and the proposed genetic ap-
is g ! G, and the second one stopping the algorithm when F has proach. All of them are shown next.
no significant improvement through a specific number of itera-
tions. Finally, the best individual is selected by F and it is imputed
in the original series, obtaining a complete dataset. Table 1
Observed Statistics.

4. Implementation in a study case Stat. xa11 xa22 xa33 xa44 xa55 xa66

aj 27 31 29 37 44 31
a
We use a multivariate matrix which collects 6 variables on 311 xjj
 35.974 44.132 6.510 35.514 2.365 1.564
students from an University in Bogotá-Colombia. 199 missing minðxjj Þ
a
31.07 14 0 15 0.0505 0
observations are presented on different individuals in different maxðxjj Þ
a
43.44 64 25 47.5 12.13423 4
variables, so it is an interesting study case.
J.C. Figueroa García et al. / Computers in Human Behavior 27 (2011) 1468–1474 1471

4.1. EM estimation The results of the fitness function F are:

Most of the optimal imputation algorithms as the EM4 algorithm F ¼ j5:260  4:363j þ    þ j1:693  1:536j
are based on conditional expectations of a random variable, obtained þ j35:974  36:048j þ    þ j1:564  1:563j ¼ 40:71 ð12Þ
from a set of auxiliary variables which gives an estimate of the
behavior of the missing data. Its principal objective is to maximize As in (13) and (14), the first row on (12) is the first sum in (8)
a likelihood or log-likelihood function of the pdf sample, obtaining and the second row is the second sum in (8). Here, (8) is evaluated
an optimal estimation of the missing observations. The most popular with the EM estimate.
is the EM algorithm given by Dempster et al. (1977), which is de-
scribed as follows. 4.2. Auxiliary regressions

Definition 4. Suppose that we are given a parametric family of Auxiliary regressions is another approach which use a set of
distributions P h and the observable vector Y is part of a so-called complete covariates to estimate missing observations on a specific
complete vector X ¼ ðY; ZÞ. Both X and Y have probability density set. In the multiple missing data case, this technique has several
functions, gðy; hÞ and f ðx; hÞ say, respectively. Here, h is a parameter disadvantages because there are no complete covariates to esti-
belonging to some subset S of the Euclidean space Rp . Let y be the mate missing data, in fact, all variables have missing observations.
observed data. The objective is to compute the maximum likeli- As in the EM method, the regression method can not estimate
missing data because it is incomplete. Taking up again the above
hood estimator b h ¼ arg max gðy; hÞ. The EM algorithm maximizes
h2S matrix, it is impossible to find a complete regressor because all
gðy; hÞ by iterations of the following state k in two steps: E-step and variables have missing data.
M-step. E-step. Given a current estimate hk1 , compute the By these reasons, the only way is to use available data to con-
conditional expectation function: struct a regressor using only complete row vectors. In this way, a
Sðh; hk1 Þ ¼ Ehk1 flog f ðX; hÞjY ¼ yg ð10Þ reduced dataset is used to get an estimator of missing observa-
tions, so only the mean and variance of available data are pre-
The averaging density will be denoted by hð:jy; hk1 Þ. served, with no guarantee of preserving the covariance structure.
M-step. Choose h ¼ hk to maximize Sðh; hk1 Þ. Even so, we applied auxiliary regressions with the following
Its popularity is large because it has a monotonic behavior: The results:
likelihood always increases. This condition is also guaranteed by its
2 3
generalized version called GEM, where hk is any value that satisfies: 4:640 0:772 6:976 6:589 0:295 0:211
6 7
Sðh; hk1 Þ P Sðhk1 ; hk1 Þ ð11Þ 6 0:772 96:298 11:343 4:123 1:595 3:921 7
6 7
6 7
6 6:976 11:343 21:952 10:45 0:828 0:127 7
b
S AR ¼6 7
Remark 3. Note that the EM algorithm must need at least one 6 7
6 6:589 4:123 10:45 18:707 0:580 0:977 7
complete auxiliary variable to compute conditional expectations. 6 7
6 0:295 1:595 0:828 0:580 3:548 0:340 7
If this vector is either not available or complete, the EM 4 5
algorithm will replace all missing values with Sðh; hk1 Þ ¼ 0:211 3:921 0:127 0:977 0:340 1:715
Ehk1 flog f ðX; hÞjX ¼ yg ¼ Ehk1 ðYÞ. It is a strong restriction to multi-
DetðbS AR Þ ¼ 217135:84
ple missing data because many cases have multiple missing
observations in different variables. ^xAR ¼ ½ 36:092 44:179 6:367 35:598 2:428 1:543 
j
When there are no complete covariates for any set on ½X, then
the EM algorithm is not optimal, that is, there is no a complete set
The results of the fitness function F are:
of covariates to estimate all missing observations.
Another disadvantage of the EM algorithm is that it is F ¼ j5:260  4:640j þ    þ j1:693  1:715j
based on a strong normality supposition. In the above matrix
þ j35:974  36:092j þ    þ j1:564  1:543j ¼ 36:655 ð13Þ
it is easy to view that there are some categorical and discrete
variables so the GEM version looks as a good option, but it has
important convergence problems when working with discrete 4.3. Evolutionary results
variables.
After the implementation of the EM algorithm, we obtain an Now, the algorithm proposed in above Section is implemented
estimate of all missing observations, so we compute the covariance by using (7)–(9). The crossover, mutation and remaining parame-
matrix after imputation b S EM , its determinant Detðb S EM Þ and its vec- a a
ters are: l ¼ 100; a ¼ minðxjj Þ, b ¼ maxðxjj Þ; c ¼ 4; c1 ¼ 4; c2 ¼
^
EM
tor of means xj , as is presented as follows: 4; k ¼ 199 and G ¼ 5000.
2 3 The complete solution is displayed in the Appendix, and their
4:363 1:159 6:704 6:358 0:312 0:124
6 1:159 90:313 12:496 4:952 0:816 3:905 7 properties are presented next.
6 7
6 7 2 3
6 6:704 12:496 20:373 10:37 1:179 0:323 7 5:7135 1:463 7:799 6:6329 0:3376 0:035
b 6 7
S EM ¼6 7 6 7
6 6:358 4:952 10:37 17:559 0:589 0:813 7 6 1:463 104:024 15:6012 5:893 1:231 3:661 7
6 7 6 7
6 7 6 7
4 0:312 0:816 1:179 0:589 3:041 0:272 5 6 7:799 15:6012 24:3978 11:485 1:301 0:593 7
b
S ¼6
GA 7
0:124 3:905 0:323 0:813 0:272 1:536 6 7
6 6:6329 5:893 11:485 19:5128 0:7073 0:653 7
6 7
Detðb
S EM Þ ¼ 105948:14 6 0:3376
4 1:231 1:301 0:7073 3:4678 0:2891 7 5
^xEM ¼ ½ 36:048 44:089 6:436 35:542 2:378 1:563 
j 0:035 3:661 0:5931 0:653 0:2891 1:7187
bGA
Detð S Þ ¼ 262770
4
^xGA ¼ ½ 36:1405 43:7138 6:5884 35:4913 2:4765 1:6495 
Acronym used for the Expectation Maximization algorithm. j
1472 J.C. Figueroa García et al. / Computers in Human Behavior 27 (2011) 1468–1474

Table 2 5.1. Results of the evolutionary approach


Hypothesis on means and variances.

Test on means Test on variances The results of the multiple comparison tests are presented in
H0 : lj ¼ l a b ba Table 3. Here, the Levene test contrasts the equality on variances
j S¼S
Ha : lj – l a
b for both the imputed and original groups, by each variable.
j S–b
Sa
With these statistical evidences, the Hypothesis of both equal
individual means and variances is accepted with a confidence level
of a ¼ 0:05. The results of the Log-Likelihood test is shown as
Four population sizes l ¼ 50; 100; 500 and 1000 are imple- follows.
mented. By using l ¼ 100, both computing time and quality of solu-
tions are improved. These results allow us to view the evolutive
qW ¼ 4:2775; v221 ¼ 32:671
strategy as an appropriate way to impute missing data since it does F ¼ 0:20368; F 21;520470 ¼ 1:558
not change their original properties. In the Appendix, the complete With these statistical evidences, the Hypothesis of equal covari-
evolutive solution indexed by each position ði; jÞ is shown. ance matrices is accepted with a confidence level of a ¼ 0:05.
The results of the fitness function F are: The genetic solution obtains a good solution which preserves
most of the original properties of ½X. All tests on means and vari-
F ¼ j5:260  5:7135j þ    þ j1:693  1:7187j þ j35:974  36:14j
ances do not have any statistical evidence to reject H0 , confirming
þ    þ j1:564  1:6495j ¼ 2:73 ð14Þ the quality of the solution.

5.2. Results of the EM algorithm and auxiliary regressions


5. Output analysis: Multivariate tests on means and variances

Returning to the results of the EM algorithm and Auxiliary


Some tests are applied to verify statistical differences between
regressions presented in the Section 4, it is clear that these tech-
both original and imputed data. Table 1 contains some additional
niques do not obtain good results since their covariance matrices
statistics useful for constructing F. The hypothesis about means b
S EM and bS AR have evident differences from bSa.
and variances considered in Section 4 are shown in Table 2.
In Table 4, some tests on means and variances to verify H0 are
Under the first Hypothesis, a multiple comparison test is made
shown. The results of the covariances tests are: The EM algorithm
for each variable at each group. Under the second Hypothesis, a
obtains F ¼ 21 and Regressions obtain F ¼ 19:87. When those are
Likelihood-ratio based test (See Anderson (1984), Timm (2002),
compared with F 0 ¼ F 21;520470 ¼ 1:558, we reject the Hypothesis
Box (1949)) is used to contrast the equality on variances. This test
that their covariances matrices are equal to b Sa.
uses a test statistic which is defined next.
Although these methods obtain good results for the mean, their
v 1 bS þ v 2 bS a variances, covariances and correlations are not similar regarding
S¼ ð15Þ their available values, moreover, all discrete missing observations
v h i
were replaced by continuous data, being an incorrect imputation.
W ¼ v ln DetðSÞ  v 1 ln Detðb
SÞ þ v 2 ln Detðb
SaÞ ð16Þ
In both cases, we have statistical evidence to reject the idea that
where v 1 ¼ n  1; v 2 ¼ na  1; v ¼ n þ na  2. Thus, q is defined classical methods are good estimators of missing observations,
as: since their variances and covariances structures are statistically
  different from their available values. In this way, the Evolutive
2m2 þ 3m  1 1 1 1 imputation is better choice than those methods.
q¼1 þ  ð17Þ
6ðm þ 1Þ v1 v2 v
Under the approximation of Box (See Timm (2002), Box (1949)), 6. Concluding remarks
the following approximate distribution for its likelihood ratio is:
After the implementation of the proposed method, some con-
qW ¼ 2q ln k1 ! v2mðmþ1Þ=2 ð18Þ cluding remarks can be suggested:
The proposed genetic algorithm obtains a solution with small
In this case, H0 is accepted iff qW 6 v2mðmþ1Þ=2 and is rejected in
modifications of the properties of the original data. It shows good
other case. In other words, there are no statistical evidence to re-
solutions for multiple missing observations in cases where optimal
ject H0 iff qW 6 v2mðmþ1Þ=2 . Now, a comparison test is composed
algorithms can not works.
by the following expressions:
The presented evolutionary strategy is a proper way to impute
f ¼ mðm þ 1Þ=2 ð19Þ missing data in a multivariate data context; its flexibility and non-
  linear capability turns it into a powerful tool to generate solutions
ðm  1Þðm þ 2Þ 1 1 1
q0 ¼ þ  ð20Þ on many contexts including time series data, signal or image pro-
6 v 21 v 22 v 2
cessing problems.
ðf þ 2Þ
f0 ¼ ð21Þ The proposed evolutionary algorithm got successful results
jq0  ð1  qÞ2 j even without conditional information, unlike to the EM algorithm
For q0  ð1  qÞ2 > 0, the test statistic: and their extensions. Moreover, it is defined as an alternative non-
parametric imputation method.
F ¼ W=a ! F f ;f0 ð22Þ A suitable selection of the population size can reduce comput-
is calculated where a ¼ f =½q  ðf =f0 Þ. If q0  ð1  qÞ < 0, then 2 ing efforts. A smaller size can reach a faster solution than a larger
size but its result could be not good. At any case it is not a guaran-
F ¼ f0 W=f ðb  WÞ ! F f ;f0 ð23Þ tee of reaching global optima.
where b ¼ f0 =ðq þ 2=f0 Þ. The Hypothesis of equal variances is re- Although the EM algorithm and the Auxiliary regressions ap-
jected if F > F 1a;f ;f0 .5 For further information see Krishnaia and Lee proaches have no statistical discrepancy on means and sample
(1980). variances in front to available data, the proposed Genetic approach
is also significant on its covariance structure, so Sa and SGA are not
5
In this case, F 1a;f ;f0 is a F distribution with f and f0 degrees of freedom. statistically different.
J.C. Figueroa García et al. / Computers in Human Behavior 27 (2011) 1468–1474 1473

Table 3
Comparison Tests on Means of the Evolutionary approach (Significance).

Test Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 Variable 6


Welch 0.371 0.614 0.844 0.948 0.476 0.428
Brown–Forsythe 0.371 0.614 0.844 0.948 0.476 0.428
T-test 0.446 0.646 0.869 0.633 0.628 0.541
Levene 0.244 0.861 0.769 0.631 0.748 0.924

Table 4
Comparison tests on means of EM and auxiliary regressions (Significance).

Test Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 Variable 6


Results of the EM algorithm
Welch 0.670 0.958 0.844 0.939 0.929 0.994
Brown–Forsythe 0.670 0.958 0.844 0.939 0.929 0.994
T-test 0.669 0.958 0.844 0.939 0.928 0.994
Levene 0.070 0.016 0.043 0.047 0.084 0.072
Results of Auxiliary Regressions
Welch 0.503 0.954 0.710 0.815 0.688 0.847
Brown–Forsythe 0.503 0.954 0.710 0.815 0.688 0.847
T-test 0.503 0.954 0.710 0.815 0.688 0.847
Levene 0.091 0.014 0.035 0.067 0.005 0.034

Finally, the reader can modify these results by constructing Appendix A. Genetic solution
different fitness operators F, computing routines or selection
operators to avoid local optima, adequately to the context of the The Table 5 presents the complete genetic solution indexed by
problem. their coordinates ði; jÞ.

Table 5
Complete Evolutionary results.

ði; jÞ ^
xij ði; jÞ ^
xij ði; jÞ ^
xij ði; jÞ ^
xij ði; jÞ ^
xij

(17, 1) 32.80 (173, 2) 36 (204, 3) 1 (285, 4) 30.23 (269, 5) 0.73


(18, 1) 38.15 (174, 2) 31 (223, 3) 14 (305, 4) 40.81 (270, 5) 4.00
(19, 1) 40.00 (175, 2) 27 (224, 3) 16 (306, 4) 41.29 (271, 5) 5.41
(35, 1) 39.37 (189, 2) 38 (254, 3) 1 (307, 4) 28.30 (272, 5) 2.84
(36, 1) 42.92 (192, 2) 44 (285, 3) 3 (3, 5) 5.70 (273, 5) 2.58
(55, 1) 37.32 (225, 2) 54 (295, 3) 3 (4, 5) 2.30 (299, 5) 3.89
(67, 1) 36.10 (237, 2) 45 (296, 3) 4 (5, 5) 2.18 (300, 5) 0.67
(101, 1) 37.28 (247, 2) 49 (27, 4) 31.07 (15, 5) 3.75 (301, 5) 4.02
(127, 1) 42.96 (248, 2) 49 (32, 4) 41.71 (26, 5) 1.19 (17, 6) 1
(128, 1) 30.73 (249, 2) 60 (70, 4) 31.57 (27, 5) 0.24 (18, 6) 4
(129, 1) 42.38 (250, 2) 56 (76, 4) 28.58 (28, 5) 4.71 (19, 6) 4
(130, 1) 30.35 (261, 2) 45 (99, 4) 33.32 (44, 5) 2.48 (40, 6) 3
(162, 1) 39.20 (262, 2) 26 (100, 4) 27.97 (45, 5) 1.22 (41, 6) 4
(189, 1) 37.73 (304, 2) 34 (101, 4) 29.80 (46, 5) 3.99 (42, 6) 0
(190, 1) 38.65 (305, 2) 30 (102, 4) 33.12 (47, 5) 5.32 (61, 6) 3
(191, 1) 34.86 (306, 2) 53 (103, 4) 40.19 (61, 5) 4.77 (62, 6) 3
(192, 1) 37.43 (307, 2) 42 (116, 4) 41.76 (75, 5) 0.93 (63, 6) 3
(193, 1) 37.54 (308, 2) 42 (143, 4) 34.38 (76, 5) 4.79 (93, 6) 3
(237, 1) 37.61 (25, 3) 0 (147, 4) 38.93 (77, 5) 2.51 (97, 6) 3
(238, 1) 36.96 (26, 3) 7 (165, 4) 26.00 (102, 5) 4.76 (126, 6) 0
(239, 1) 41.15 (27, 3) 10 (166, 4) 31.13 (103, 5) 5.02 (127, 6) 2
(269, 1) 38.28 (28, 3) 1 (167, 4) 34.51 (104, 5) 4.65 (128, 6) 0
(270, 1) 38.32 (45, 3) 2 (168, 4) 37.66 (126, 5) 5.03 (152, 6) 2
(271, 1) 34.24 (46, 3) 1 (169, 4) 36.80 (127, 5) 3.38 (153, 6) 3
(272, 1) 42.33 (47, 3) 8 (170, 4) 27.32 (145, 5) 1.12 (154, 6) 0
(273, 1) 36.72 (48, 3) 8 (209, 4) 28.69 (148, 5) 0.19 (155, 6) 1
(308, 1) 33.40 (49, 3) 1 (215, 4) 27.56 (155, 5) 1.02 (188, 6) 2
(6, 2) 46 (126, 3) 4 (225, 4) 36.23 (180, 5) 1.28 (189, 6) 3
(7, 2) 38 (127, 3) 0 (236, 4) 34.73 (181, 5) 3.80 (199, 6) 3
(40, 2) 34 (128, 3) 15 (248, 4) 27.93 (182, 5) 0.42 (217, 6) 3
(41, 2) 20 (129, 3) 1 (249, 4) 36.16 (193, 5) 1.91 (220, 6) 0
(42, 2) 44 (130, 3) 16 (250, 4) 42.24 (199, 5) 3.31 (244, 6) 1
(67, 2) 45 (131, 3) 9 (251, 4) 30.10 (217, 5) 2.12 (245, 6) 4
(101, 2) 27 (132, 3) 2 (252, 4) 33.15 (218, 5) 0.59 (246, 6) 2
(102, 2) 31 (168, 3) 4 (253, 4) 37.21 (219, 5) 5.70 (247, 6) 0
(125, 2) 38 (169, 3) 5 (271, 4) 32.36 (241, 5) 5.38 (273, 6) 2
(148, 2) 46 (170, 3) 14 (272, 4) 40.17 (242, 5) 1.16 (278, 6) 4
(152, 2) 38 (171, 3) 11 (273, 4) 37.15 (243, 5) 1.07 (283, 6) 3
(171, 2) 41 (172, 3) 5 (283, 4) 34.75 (244, 5) 5.43 (305, 6) 3
(172, 2) 42 (203, 3) 3 (284, 4) 34.92 (251, 5) 0.10
1474 J.C. Figueroa García et al. / Computers in Human Behavior 27 (2011) 1468–1474

References Figueroa, J. C., Kalenatic, D., & Lopéz, C. A. (2010). An evolutionary approach for
imputing missing data in time series. Journal Of Circuits, Systems And Computers,
19(1), 107–121.
Abdella, M., & Marwala, T. (2005). The use of genetic algorithms and neural
Gaetan, C., & Yao, J.-F. (2003). A multiple-imputation metropolis version of the em
networks to approximate missing data in database. In IEEE 3rd international
algorithm. Biometrika, 90(3), 643–654.
conference on computational cybernetics, 2005. ICCC 2005 (Vol. 3, pp. 207–212).
González, S., Rueda, M., & Arcos, A. (2008). An improved estimator to analyse
IEEE.
missing data. Statistical Papers, 49(4), 791–796.
Agresti, A. (2002). Categorical data analysis. John Wiley and Sons.
Harville, D. A. (2005). Matrix algebra on a statistician’s perspective. Springer-Verlag.
Anderson, T. (1984). An introduction to multivariate statistical analysis. John Wiley
Krishnaia, P., & Lee, J. (1980). Likelihood ratio tests for mean vectors and covariance
and Sons.
matrices. Handbook of Statistics, 1(1), 513–570.
Arnold, M. (2003). Reasoning about non-linear ar models using expectation
Law, A., & Kelton, D. (2000). Simulation system and analysis. Mc Graw Hill
maximization. Journal of Forecasting, 22(6), 479–490.
International.
Box, G. (1949). A general distribution theory for a class of likelihood criteria.
Levine, L. A., & Casella, G. (2000). Implementations of the Monte–Carlo em
Biometrika, 36(1), 317–346.
algorithm. Journal of Computational Graphic Statistics, 10(1), 422–439.
Celeux, G., & Diebolt, J. (1993). The sem algorithm: A probabilistic teacher algorithm
Nielsen, S. F. (2000). The stochastic em algorithm: Estimation and asymptotic
derived from the em algorithm for the mixture problem. Computational Statistics
results. Bernoulli, 6(1), 457–489.
Quarterly, 2(1), 73–82.
Parveen, S., & Green, P. (2004). Speech enhancement with missing data techniques
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum-likelihood from
using recurrent neural networks. In Proceedings of the IEEE international
incomplete data via the EM algorithm. Journal of Royal Statistical Society, 39(1),
conference on acoustics, speech, and signal processing (ICASSP ’04) (Vol. 1, pp.
1–38.
733–738). IEEE.
Devroye, L. (1986). Non-uniform random variate generation. New York: Springer-
Peña, D. (2000). Análisis Multivariante de Datos. Mc Graw Hill Interamericana.
Verlag.
Siripitayananon, P., Hui-Chuan, C., & Kang-Ren, J. (2002). Estimating missing data of
Eklund, N. (2006). Using genetic algorithms to estimate confidence intervals for
wind speeds using neural network. In Proceedings of the 2002 IEEE Southeast
missing spatial data. IEEE Transactions on Systems, Man and Cybernetics, Part C:
Conference (Vol. 1, pp. 343–348). IEEE.
Applications and Reviews, 36(4), 519–523.
Timm, N. H. (2002). Applied multivariate analysis. Springer-Verlag.
Figueroa, J. C., Kalenatic, D., & López, C. (2008). Missing data imputation in time
series by evolutionary algorithms. Lecture Notes in Computer Science, 5227(1),
275–283.

You might also like