Multiple Taxicab Correspondence Analysis of A Survey Related To Health Services
Multiple Taxicab Correspondence Analysis of A Survey Related To Health Services
Multiple Taxicab Correspondence Analysis of A Survey Related To Health Services
1. Introduction
The data, that will be discussed in this paper, represent a survey of 3530 individuals residing in downtown eastside Vancouver with high incidence of AIDS/
HIV related diseases. Table 1 displays the marginal distribution of 22 active
or substantive response variables or items filled by the 3530 respondents, where
each item describes a health related service offered by municipal authorities; for
instance, the first question asks whether the service offered on needle exchange,
coded by NXCHG, was used or not. Each item represents a polytomous qualitative variable having four categories: (1) = used this service, (2) = never tried, (3)
= tried with no access, (N) = non response or missing. In Europe, particularly in
France, multiple correspondence analysis (MCA) is a popular method to describe
and visually explore complex relationships among items in such a questionnaire
survey. MCA is the application of correspondence analysis (CA) to the super
indicator 0/1 matrix Z of size 3530 88. The number of columns 88 comes from
Corresponding author.
206
4 22, which represents the total number of categories of the 22 items. To see
how the matrix Z is constructed, refer to Section 3. An advantage of coding
the data as in Z is that the missing values are incorporated in data analysis
naturally without imputation, just like any other category value. Imputation for
missing categorical survey data is discussed quite in detail by Finch (2010). The
aim of this paper is to compare the MCA results with the multiple taxicab correspondence analysis (MTCA) results, MTCA being a robust L1 version of MCA
developed by Choulakian (2006; 2008a; 2008b). Because of its robustness, MTCA
will reveal that there is a clear structure in this data set based on a simple sum
score statistic. Further, we show that such a sum score characterization always
exists for any survey questionnaire data; and this will help the researcher to see if
the active items are broadly similar in objective and point to the same direction.
Table 1: The marginal distribution of frequencies of the categories of 22 health
related service items, with symbols used for their representations
Categories
Needle exchange(NXCHG)
Food bank(FB)
Pharmacy(PH)
Methadone treatment(MET)
HIV medications(HIVM)
A&D counselling(ADC)
Nursing care(NUR)
Doctor care(DOC)
Mental health unit(MHU)
Mental health worker(MHW)
Outreach worker(OWU)
Detox-residential(DETR)
Day-tox day program(DETD)
Recovery house(RH)
Other drug treatment centre(ODTC)
Ambulance pick-up(APU)
Emergency Department - sph(EDSPH)
Emergency Department - vgh(EDVGH)
Emergency Department - other(EDO)
Hospital admission - sph(HASPH)
Hospital admission - vgh(HAVGH)
Hospital admission - other(HAO)
Used this
service (1)
Never
tried (2)
1832
1404
790
2733
3119
2597
2353
802
2890
2888
2653
2931
3295
3123
3182
2588
2396
3007
3117
2874
3167
3218
1592
2021
2647
666
268
815
1055
2631
498
504
743
470
95
291
227
867
1050
412
292
542
233
184
45
47
24
107
125
88
74
18
119
119
106
84
133
108
115
49
59
99
115
96
119
124
First we present the underlying mathematics, then we discuss the case study.
This paper is organized as follows. In Section 2 we present an overview of taxicab correspondence analysis of a contingency table; Section 3 presents the main
207
theoretical results; in Sections 4 and 5 we present the analysis of the survey data
by MCA and MTCA, respectively; and we conclude in Section 6.
We suppose that the theory of multiple correspondence analysis (MCA) is
known, which can be found, among others, in Benzecri (1973; 1992), Greenacre
(1993), Gifi (1990), Nishisato (1994), Le Roux and Rouanet (2004). Note that
MCA is also known as homogeneity analysis, reciprocal averaging, dual scaling
or third method of quantification.
2. Taxicab Correspondence Analysis: An Overview
2.1 Introduction
In a series of papers Choulakian (2003; 2005; 2006a; 2006b) developed principle component analysis (PCA) based on matrix norms, thus generalizing the
classical PCA, or equivalently generalizing the well known singular value decomposition (SVD). This led to the development of taxicab principal component
analysis (TPCA) based on the most robust matrix norm named taxicab matrix
norm, and on which taxicab correspondence analysis (TCA) is based.
To see that TPCA is similar to and has the same mathematical framework of
classical PCA, we start with an overview of classical PCA, which can be described
in many ways, see Jolliffe (2002) for a comprehensive account. However, TPCA
is similar to only one of the ways, that we present it in the next subsection to
make the paper self contained and reader friendly.
2.2 Classical Principal Component Analysis
Let T be a centered or standardized data set of dimension I J, where I
observations are described by the J variables, that is, T 0 T /I is the covariance
or the correlation matrix. For a vector u RJ , we define its Euclidean or L2 1
norm to be ||u||2 = (u0 u) 2 . Let k = rank(T ). The classical principal component
analysis (PCA) consists of successive maximization of the variance or the square of
the L2 -norm of the linear combination of the variables of the matrix T subject to
a quadratic constraint; that is, it is based on the following optimization problem
max ||T u||2 subject to ||u||2 = 1;
(1)
(2)
208
Equation (1) is the dual of (2), and they can be reexpressed as matrix norms
||T u||2
||u||2
uR
||T 0 v||2
= max
vRI ||v||2
v0T u
.
= max
uRJ , vRI ||u||2 ||v||2
1 = max
J
(3)
The solution to (3), 1 , is the square root of the greatest eigenvalue of the matrix
T 0 T or T T 0 . The first principal axes, u1 and v 1 , are defined as
u1 = arg max ||T u||2 such that ||u1 ||2 = 1,
u
(4)
where u1 is the eigenvector of the matrix T 0 T associated with the greatest eigenvalue 1 ; and
v 1 = arg max T 0 v 2 such that ||v 1 ||2 = 1.
(5)
v
Let f 1 be the vector of the first principal component (pc) scores, and g 1 the
vector of the first pc loadings defined as
f 1 = T u1 and g 1 = T 0 v 1 ;
(6)
(7)
then
Equations (6) and (7) are named transitional formulas, because v 1 and f 1 , and,
u1 and g 1 , are related by
u1 = g 1 /1
and v 1 = f 1 /1 .
(8)
(9)
where T 1 = T . We note that rank(T 2 ) = rank(T 1 )1, because by (6) and (7)
T 2 u1 = 0
and T 02 v 1 = 0.
(10)
Classical PCA can be described as the sequential repetition of the above procedure for k = rank(T ) times till the residual matrix becomes 0; thus, using
= 1, , k as indices, the matrix T can be written as
T =
k
X
=1
f g 0 / ,
(11)
209
Further, we have
= ||f ||2 = ||g ||2 and s are decreasing for = 1, , k;
(13)
and
T r(T 0 T ) = T r(T T 0 ) =
k
X
k
X
=1
k
X
||f ||22 =
=1
(14)
||g ||22 ,
=1
which represents, by the Pythagorean theorem, I times the sum of the variances
of the J variables or the sum of the squared Euclidean distances of the I rows
from the origin, because we assumed that T is centered or standardized. Also
the relative cumulative explained variability by the first axes is
CEV () =
2 /
=1
k
X
for = 1, , k.
(15)
=1
(16)
1 = max
(18)
210
which is a well known and much discussed matrix norm related to Grothendieck
problem, see for instance, Alon and Naor (2006). The solution to (18), 1 , is a
combinatorial optimization problem given by
max ||T u||1 subject to u {1, +1}J .
(19)
Equation (19) characterizes the robustness of the method, in the sense that,
the weights affected to the variables (similarly to the individuals by duality) are
uniform 1. The first principal axes, u1 and v 1 , are defined as
u1 = arg max ||T u||1 such that ||u1 || = 1,
(20)
v 1 = arg max T 0 v 1 such that ||v 1 || = 1.
(21)
and
v
Let f 1 be the the vector of the first principal component (pc) scores, and g 1 the
vector of the first pc loadings. These are defined as
f 1 = T u1 and g 1 = T 0 v 1 ;
(22)
(23)
then
Equations (22) and (23) are named transitional formulas, because v 1 and f 1 ,
and, u1 and g 1 , are related by
u1 = sgn(g 1 )
and v 1 = sgn(f 1 ),
(24)
where sgn(g 1 ) = (sgn(g1 (1)), , sgn(g1 (J))0 , and sgn(g1 (j)) = 1 if g1 (j) > 0,
sgn(g1 (j)) = 1 otherwise. Note that (24) is completely different from (8).
To obtain the second pc scores f 2 , loadings g 2 , and axes u2 and v 2 , we repeat
the above procedure on the residual dataset
T 2 = T 1 T 1 u1 v 01 T 1 /1
= T 1 f 1 g 01 /1 ,
(25)
and v 01 f = 0
for = 2, , k.
(27)
211
It is important to note that (28) has the same form as (11), but it can not be
rewritten as (12), because (24) is completely different from (8).
Further, similar to (13), we have
= ||f ||1 = ||g ||1
for = 1, , k.
(29)
But the dispersion measures s in (29) will not satisfy (14), because the Pythagorean
theorem is not satisfied in L1 . Given that for the classical PCA (14) is used, so
for both methods we define the total variability to be
T otD =
k
X
2 ,
(30)
=1
X
=1
2 /
k
X
for = 1, , k.
(31)
=1
212
iterative algorithm is statistically consistent in the sense that as the sample size
increases there will be some observations in the direction of the principal axes,
so the algorithm will find the optimal solution.
For the survey dataset, the computations are done by the iterating algorithm.
2.4 Taxicab Correspondence Analysis of A Contingency Table
Often correspondence analysis (CA) is identified as categorical PCA; that is, it
is considered an adaptation of PCA to contingency tables. Similarly we consider
TCA an adaptation of TPCA to contingency tables. Here we introduce TCA of a
contingency table N = (nij ) of two nominal variables with I rows and J columns.
Let P = N /n be the associated correspondence matrix with elements pij , where
P
P
P
P
n = Jj=1 Ii=1 nij is the sample size. We define pi = Jj=1 pij , pj = Ii=1 pij ,
the vector r = (pi ) RI , the vector c = (pj ) RJ , and D r = Diag(r) a diagonal
matrix having diagonal elements pi , and similarly D c = Diag(c).
The application of TPCA algorithm to P , described in the previous subsection, is named TCA of the contingency table N . We put P 0 = P and denote
by P be the residual correspondence matrix at the -th iteration. That is, in
the calculations described in the previous subsection, we replace T by P and the
numbering of the iterations varies from 0 to k, where k = rank(P ) 1.
For = 0, P 0 = P . Row and column profiles with their masses play an
important role in both CA and TCA. Let R0 = D 1
r P 0 = (rij ) = (pij /pi )
PJ
designate the row profiles, that is for each i,
j=1 rij = 1. The cloud of row
profiles with their masses is the set {(r 0i , pi )| for i = 1, , I}, where r 0i is
the ith row of R0 ; and the cloud of column profiles with their masses is the set
0
{(c0j , pj )| for j = 1, , J}, where c0j is the jth row of C 0 = D 1
c P 0 . We shall
interpret the steps of TCA using the row profiles; however, we remind the reader
that similar interpretation can be done using the column profiles.
For = 0, the optimization problem (16) is
max ||P 0 u||1 = max ||D r R0 u||1 subject to ||u|| = 1;
= max
I
X
(32)
pi |r 0i u| subject to ||u|| = 1.
i=1
max
u{1,+1}J
||P 0 u||1
and v 0 = arg
max
v{1,+1}I
0
P 0 v ,
1
(33)
213
which can be seen to be trivially u0 = 1J , the J component vector with coordinates of 1s, and v 0 = 1I . The 0-th principal factor scores are
f 0 = D 1
r P 0 u0 = R0 u0
0
and g 0 = D 1
c P 0v0 = C 0v0,
(34)
and v 0 = sgn(f 0 ) = 1I .
(35)
And, the 0-th taxicab dispersion measure can be represented in many different
ways as
0 = P 00 v 0 1 = ||pc ||1 = ||D c g 0 ||1 = u00 D c g 0
= ||P 0 u0 ||1 = ||pr ||1 = ||D r f 0 ||1 = v 00 D r f 0
= 1.
(36)
(37)
Note that pr p0c represents the correspondence matrix under the assumption that
the row and column variables are independent. This solution is considered trivial
both in CA and in TCA.
For = 1, we define the residual row and column profiles to be: R1 = D 1
r P1
and C 1 = D 1
P
.
The
cloud
of
the
residual
row
profiles
with
their
masses
is the
1
c
set {(r 1i , pi )| for i = 1, , I}, where r 1i is the ith row of R1 ; and the cloud of
residual column profiles with their masses is the set {(c1j , pj )| for j = 1, , J},
0
where c1j is the jth row of C 1 = D 1
c P 1 . We repeat steps (20) through (25),
or (32) through (37), where P 0 is replaced by P 1 . Note that the maximization
problem is NP hard and not trivial. So in general, the -th taxicab dispersion
measure can be represented in many different ways
= ||P u ||1 = ||D r f ||1 = v 0 D r f
= P 0 v 1 = ||D c g ||1 = u0 D c g .
(38)
X
= P0
D r f g 0 D c / .
=1
(39)
214
From which one gets the data reconstitution formula both in TCA and CA
pij = pi. p.j [1 +
k
X
f (i)g (j)/ ].
(40)
=1
P
Similar to the classical CA, the total dispersion is defined to be k=1 2 , and
P
the proportion of the explained variation by the -th principal axis is 2 / k=1 2 ,
and the cumulative explained variation is
CEV () =
X
=1
2 /
k
X
2 for = 1, , k.
(41)
=1
The visual maps are obtained by plotting the points (f (i), f (i)) for i =
1, , I or (g (j), g (j)) for j = 1, , J, for 6= .
An important property of TCA and CA is that columns (or rows) with identical profiles (conditional probabilities) receive identical factor scores. One important advantage of TCA over CA is that it stays as close as possible to the
original data: It directly acts on the correspondence matrix P without calculating a dissimilarity (or similarity) measure between the rows or columns. TCA
does not admit a distance interpretation between profiles; there is no chi-square
like distance in TCA. Fichet (2009) described it as a scoring method.
More technical details about TCA and a deeper comparison between TCA and
CA is done in Choulakian (2006a). Further results can be found in Choulakian
et al. (2006), Choulakian (2008a), and Choulakian and de Tibeiro (2012).
3. Main Theoretical Results
3.1 Multiple Taxicab Correspondence Analysis
Let n individuals fill out a questionnaire survey consisting of Q items, and
each item has Jq number of answer categories. Let jq be the value of the jth
category in the qth item for q = 1, , Q and jq = 0, , Jq 1. Let Z be the
P
super indicator 0/1 matrix of order n Q
q=1 Jq . An example of a matrix Y and
its 0/1 indicator matrix Z are shown below with n = 4, Q = 3 and J1 = 3,
J2 = 3 and J3 = 2.
1 0 0
0 1 0 1 0 0 1 0
2 1 0
0 0 1 0 1 0 1 0
Y =
0 1 1 = Z = 1 0 0 0 1 0 0 1 .
2 2 0
0 0 1 0 0 1 1 0
CA of Z is named MCA of Y and TCA of Z is named MTCA of Y .
215
Theorem 1. (Choulakian, 2008b): Along the first principal axis, the projected
response patterns in MTCA of Y will be clustered and the number of cluster
points is less than or equal to 1 + Q.
This theorem shows that MTCA automatically clusters the response patterns,
that is the individuals, into at most 1 + Q clusters. This is an important feature
of the method, and an important help to the researcher. Note that some clusters
can be empty.
3.2 Characterization of the First MTCA Principal Factor as Sum Score
Statistic
The next theorem, which is new, characterizes completely the 1 + Q clusters
as a sum score statistic, more precisely as total number of first factor successes
over all the items. So the crucial point is how to define first factor success of an
item, and and its complement first factor failure. It is important to note that
the sum score statistic of items makes sense only when the nature of the items
are similar, which in Section 5 we will see is the case for the data set considered
in this paper.
First we consider the case of dichotomous items, when Jq = 2 for q = 1, , Q;
then generalize the result to polytomous items.
Theorem 2. (The first MTCA factor property for dichotomous items):
Let Y RnQ , where Yij = 0 if the response of the ith individual on the jth
dichotomous item is a failure, and Yij = 1 if the response of the ith individual
on the jth dichotomous item is a success, and consider MTCA of Y . Then the
first principal factor scores f1 (i) and subject sum scores Yi. , for i = 1, , n,
are linearly related (i.e., corr(f1 (i), Yi )= 1) if and only if the first principal
factor item weights is u1 = (u011 | u012 )0 = (10Q | 10Q )0 and, when it is the case,
f1 (i) = 2(Yi Y /n)/Q.
Proof: Let Y RnQ , where Yij = 0 or 1, Z = (Y | 1n 10Q Y ) of size
n 2Q is the 0/1 indicator matrix of
PY , and P = Z/(nQ)Pisnthe correspondence
matrix of size n 2Q. P
Then pi = 2Q
i=1 pij = Yj /(nQ) if
j=1 pij = 1/n, pj =
n
j = 1, , Q and pj = i=1 pij = (n Yj )/(nQ) if j = Q + 1, , 2Q. Equation
(37), the first residual correspondence matrix is
P1 =
(42)
The second matrix block in P 1 ((Yij Yj /n)) is the negative of the first matrix
block (Yij Yj /n), so u1 = (u011 | u012 )0 = (u011 | u011 )0 that maximizes 1 in
216
2 X
f1 (i) =
u11j (Yij Yj /n) for i = 1, , n.
Q
(43)
j=1
(44)
f1 (i) =
Q1
X
2
(Yij Yj /n) (YiQ Y.Q /n)
Q
j=1
Q
2 X
=
(Yij Yj /n) 2(YiQ Y.Q /n)
Q
f1 (i) =
j=1
2
= [(Yi Y /n + 2Y.Q /n)] if YiQ = 0
Q
2
= [(Yi Y /n 2(1 YQ /n)] if YiQ = 1.
Q
(45)
(46)
Equations (45) and (46) show that the points (f1 (i), Yi. ) will locate on two parallel
lines defined by success or failure of the ith respondent on item Q.
Definition: a) For a dichotomous item q for q = 1, , Q, we define the first
factor success of the item q to be the category of the item q with first MTCA
factor score g1 (jq ) > 0 for jq = 0, 1.
b) For a polytomous item q for q = 1, , Q, we define the first factor success
of the item q to be the category set {jq |g1 (jq ) > 0 for jq = 0, , Jq 1}.
Now, we can interpret Theorem 2 in the following way:
a) All the success (coded as 1 in Y ) categories of the Q items, u11 = 1Q ,
oppose all the failure categories (coded as 0 in Y ) of the Q items, u12 = 1Q ;
that is, the first principal axis is u1 = (u011 | u012 )0 = (10Q | 10Q )0 .
217
b) A success of item q is identical to the first factor success of item q; that is,
for each item success and first factor success coincide. If for an item, success and
first factor success are different then, depending on the subject matter, either we
delete this item from analysis or swap success by first factor success (failure).
If the condition of Theorem 2 holds, then the above two points imply that the
Q items are broadly similar in objective and point to the same direction towards
one general latent variable; further, principal dimensions of order higher than one
will reveal specific local factors conditioned by the first general latent variable
sum score, as will be seen in the analysis of the health survey data set.
The case of polytomous data follows easily from Theorem 2, if we define success of a polytomous item to be identical to the first factor success as given in
the above definition; thus by Theorem 2 each cluster will be perfectly characterized by the raw sum score of the first factor successes in the response patterns
belonging to that cluster.
For some theoretical and empirical comparisons of the sum score statistic for
binary data that point to one underlying latent variable with parametric and non
parametric models, see in particular Cox and Wermuth (2002).
4. Multiple Correspondence Analysis of the Health Survey Data
The second column in Table 2 displays the first five dispersion measures, the
standard deviations, of the first five important factors resulting from CA of Z;
in CA terminiology 2 represents the inertia (variance) of the th factor. We
see that the first three values are clearly singled out: 1 = 0.8974 being close
to 1, implies that the dataset Z has quasi 2 blocks structure; and, 2 3
implies that the principal plane 2-3 should be looked at. We did not present the
percentage of the variance explained by each principal factor, because they are
misleading; further many adjusted values have been proposed in the litterature,
see for instance Greenacre (1993).
Table 2: The first five dispersion measures
CA of Z
TCA of Z
0.8974
0.3014
0.4290
0.1910
0.4218
0.1759
0.3003
0.1703
0.2925
0.1647
218
Figures 1 and 2 show the MCA maps of the principal planes 1-2 and 2-3,
respectively. In Figure 1, we clearly see that the missing (N) and tried with no
access (3) categories dominate the map by forming two different clusters far away
from the center; and the remaining column points representing used this service
(1) and never used (2) category values are clustered around the origin; further,
the second dimension separates the missing (N) categories from the tried with no
access (3) categories. Figure 2 shows the complete separation of the four category
values (1), (2), (3) and (N). Table 1 shows that the two categories, (N) and (3),
for each of the 22 items have small weights; and it is a well known fact that
often rare elements disturb the graphical displays in CA or MCA. Another way
of interpreting Figure 1 is that, the categories (3) and (N) can be considered as
outliers, and their harmful influence should be eliminated. Different approaches
have been proposed to handle missing or outlier categories by Michailidis and de
Leeuw (1998), Le Roux and Rouanet (2004, Chapter 5), Greenacre (2006) and
Greenacre (2009). Figures 3 and 4, which display the projection of the individuals
on the principal planes 1-2 and 2-3 have the same form as Figures 1 and 2, and
they admit the same interpretation.
219
220
221
the second principal axis opposes medical services [DOC1 (doctor care), NUR1
(nursing care), PH1 (pharmacy), NXCH1 (needle exchange), MET1 (methadone
treatment), EDSPH1 (emergency department St. Paul Hospital)] to mental services [(MHU1 mental health unit), MHW1 (mental health worker), DETR (detox
residential), DETD1 (day-tox day program), OWU1 (outreach worker)].
Figure 6, which should be compared with Figures 3 and 4, shows the projection of the respondents on the first principal plane. We see a very clear pattern:
the 3530 individuals are clustered, and on the first axis there are 22 clusters. Theorem 1 in Section 4 states that the maximum number of clusters of respondents
on the first principal axis is 23 = (22 + 1) = (Q + 1), where Q is the number of
questions. What is the interpretation of the 22 clusters? Theorem 2 of Section
3 states that the 22 clusters of respondents can be completely characterized by a
discrete variable S, the simple sum score statistic of used this service (1) over all
items, because the 22 categories used this service (1) have positive first principal
factor scores. We name the category used this service (1) to be first factor success
category for each item. The complement of first factor success will be first
factor failure={(2), (3), N}. Table 3 provides some summary statistics of the
clusters that we describe in steps:
222
a) The first column provides the first principal factor scores of the respondents,
where we see 16 clusters of respondents with negative first principal factor scores
and 7 clusters of respondents with positive first principal factor scores. The
distance between two consecutive clusters on the first principal factor is constant
and equals 0.09091 except for the first two clusters which is equal to | 1.4669 +
1.2851| = 0.18159 2 0.09091.
b) The third column provides the frequency of each cluster of respondents; for
example, there are 143 individuals in the first cluster whose first principal factor
score is 1.4669, and 3 individuals in the second cluster whose first principal
factor score is 1.2851.
c) We introduce some notation to formulate mathematically the calculations
done in columns 4 to 7. Let Q = 22 be the number of items or questions,
C = 22 be the number of clusters; nc be the frequency of individuals in cluster
c, for example n1 = 143. We can express the 0/1 matrix Z as a three-way array,
ziqv for i = 1, , 3530, q = 1, , Q and v = 1, 2, 3, N . Consider the matrix
P
W = (wiv ) of size 3530 4, where wiv = Q
q=1 ziqv , and which represents the
number of times that the respondent i chose the category value v across all items.
Let W c , of size nc 4 be the subset of the matrix W whose individuals belong
to the cluster c; for instance, W 4 is of size 7 4, whose elements are given in
Table 4. In Table 4 the row identified by min = (4 17 0 0) provides the minimum values in the four columns of the matrix W 4 ; and the row identified by
223
Sum score
factor score
of (1)s
Used this
service (1)
Never
tried (2)
Tried with
no access (3)
Missing
(N)
1.467
143
(0,0)
(0,14)
(0,14)
(0,22)
1.285
(2,2)
(20,20)
(0,0)
(0,0)
1.194
(3,3)
(19,19)
(0,0)
(0,0)
1.103
(4,4)
(17,18)
(0,1)
(0,0)
1.012
(5,5)
(15,17)
(0,2)
(0,0)
0.921
12
(6,6)
(14,16)
(0,2)
(0,0)
0.840
20
(7,7)
(13,15)
(0,2)
(0,0)
0.740
18
(8,8)
(14,14)
(0,0)
(0,0)
0.649
29
(9,9)
(10,13)
(0,3)
(0,0)
0.558
10
51
(10,10)
(10,12)
(0,2)
(0,0)
0.467
11
82
(11,11)
(10,11)
(0,1)
(0,0)
0.376
12
11
(12,12)
(8,10)
(0,2)
(0,0)
0.285
13
18
(13,13)
(4,9)
(0,5)
(0,0)
0.194
14
228
(14,14)
(5,8)
(0,3)
(0,1)
0.103
15
282
(15,15)
(6,7)
(0,1)
(0,0)
0.012
16
327
(16,16)
(5,6)
(0,1)
(0,0)
0.079
17
358
(17,17)
(3,5)
(0,2)
(0,0)
0.169
18
440
(18,18)
(2,4)
(0,2)
(0,0)
0.260
19
461
(19,19)
(2,3)
(0,1)
(0,0)
0.351
20
416
(20,20)
(1,2)
(0,1)
(0,0)
0.442
21
224
(21,21)
(0,1)
(0,1)
(0,0)
0.533
22
120
(22,22)
(0,0)
(0,0)
(0,0)
224
respondents
Used this
service (1)
Never
tried (2)
Tried with
no access (3)
Missing
(N)
1
2
3
4
5
6
7
4
4
4
4
4
4
4
17
18
18
18
18
18
18
1
0
0
0
0
0
0
0
0
0
0
0
0
0
min
max
4
4
17
18
0
1
0
0
(4,4)
(17,18)
(0,1)
(0,0)
in cluster 4
(min, max)
So we see that the first principal factor of MTCA revealed that the data set
has a very clear structure based on the simple sum score statistic of first factor
success categories over all items. Further, the 22 health items are broadly similar
in objective and point to the same direction.
The second principal factor has a simple interpretation: For a fixed sum score
S = sum score of (1)s, it will show the intravariability of the response patterns
for that sum score. Which clusters have the most variability and what is the
nature of the variability? Going to Table 3, we check the (min, max) values for
each cluster: it is evident that the first cluster characterized by S = 0 is the most
heterogenous, followed by clusters S = 13 and S = 14.
The cluster (S = 0) has (min, max) = (0, 14) for the categories never tried (2)
and tried with no access (3), and (min, max) = (0, 22) for the category missing
(N). We also note that all the missing non response values, save one, are found in
this cluster. Further, Figure 6 confirms this fact, where we see 8 points aligned
vertically in the third quadrant. We also note that the relative frequency of this
group is very small, 143/3530 = 0.04051. In fact, we recall that the units in this
group were designated as outliers in MCA.
The cluster (S = 13) has (min, max) = (4, 9) for the categories never tried
(2) and (min, max) = (0, 5) for the categories tried with no access (3), and
(min, max) = (0, 0) for the category missing (N). This is natural variability,
because the sum score statistic being a sum of successes has the most variability
around its centeral values. Similar interpretation is given to the cluster (S = 14),
which has relative frequency of 228/3530 = 0.0646.
225
In Figure 5, we saw that the categories of the variables have a parabolic curved
shape. Is there a parabola in Figure 6? The answer is yes: By suppressing the
circled respondents which make less than 6% of the data, we see an inverted
V shaped band of points, which represents a taxicab parabola with a lot of
dispersion, see for instance Krause (1986).
We conclude our analysis by the following remark: The 22 health items are
broadly similar in objective and point to the same direction towards one general
latent variable (because of the parabolic shape of the projected categories in
Figure 5). Further, it is completely described by the simple sum score statistic
of used this service (1) over all items.
5.2 Gender as a Passive Variable
Usually in a survey, in addition to substantive response questions, concomitant personal information data about the respondents are gathered. These variables are named passive or exogenous. In this analysis we include one such
qualitative variable, gender having three categories male (m), female (f) and
transgendered (t). Now we describe the clusters by computing the log-odds ratio
of males to females for each cluster with respect to the marginal distribution. For
example
LOR(S = 0) = ln(
94/49
) = 0.13391,
2406/1097
and
LOR(S = 22) = ln(
103 1097
) = 1.0161.
17 2406
226
the clusters (S 13) are positively associated with females (the LOR values are
negative).
Table 5: The distribution of gender in each cluster
Sumscore
gender
frequency
of (1)s
100 LOR
male
female
transgender
0
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
143
3
3
7
7
12
20
18
29
51
82
118
181
228
282
327
358
440
461
416
224
120
94
2
1
2
3
7
7
10
14
27
51
78
113
153
194
218
252
301
319
275
182
103
49
1
2
5
4
5
12
8
14
23
30
39
66
71
87
107
105
136
136
139
41
17
0
0
0
0
0
0
1
0
1
1
1
1
2
4
1
2
1
3
6
2
1
0
Total
3530
2406
1097
27
13.4
9.2
147.9
170.2
147.9
44.9
132.4
56.2
78.5
62.5
25.5
9.2
24.8
1.8
1.7
7.3
9.0
0.9
6.7
10.3
70.5
101.6
6. Conclusion
MCA is a popular well established method since 1970 to analyze questionnaire
surveys of qualitative variables; but it is sensitive to the presence of outliers, which
usually form a small fraction of the data. MTCA is a robust L1 variant of MCA.
MCA and MTCA can produce different results, because the geometry underlying these two methods are different. We suggest the analysis of a data set
by both methods: each method sees the data from its point of view, and sometimes the views are similar and other times not similar. So MCA and MTCA
complement and enrich each other.
Cox (2006) titled his talk In praise of simple sum score. We showed that
the first MTCA factor scores can always be interpreted as simple sum score of
227
228
229
Vartan Choulakian
Department of Mathematics and Statistics
Universite de Moncton
New Brunswick, E1A 3E9, Canada
vartan.choulakian@umoncton.ca
Jacques Allard
Department of Mathematics and Statistics
Universite de Moncton
New Brunswick, E1A 3E9, Canada
jacques.allard@umoncton.ca
Biagio Simonetti
Department of Economic and Social Systems Analysis
University of Sannio
Via delle Puglie, 82, 82100, Benevento, Italy
simonetti@unisannio.it