1. Introduction
When working with a sample contingency table, a researcher might need to adjust it based on information available from other sources. This information might come from prior surveys, censuses, established theories or other sources. Often it comes as marginal information such as row and/or column totals. For example, consider a data set where each subject is cross-classified by income (low/high) and urbanity (urban/rural), and, marginal information about income and urbanity is available from a census. One would like to adjust the sample data to conform to the desired margins from census.
For two-way contingency tables of size (
), four well-known [
1,
2] margin-adjusting methods for estimating cell probabilities are raking (RAKE), least squares (LSQ), minimum chi-squared (MCSQ) and maximum likelihood under random sampling (MLRS). Assume that a random sample
is available from a multinomial
probability distribution, where
. Let
denote the sample cell proportions. Then RAKE finds the estimates
that minimize the discrimination information,
under the marginal constraints
where
denotes the estimators of
target cell probabilities
,
,
are
known,
.
Under the same constraints (
1), the methods LSQ, MCSQ, MLRS find the estimates
that minimize
respectively.
Instead of given marginal totals, one might like to use restrictions of a more general nature. Consider the survey data [
3] from the second National Health and Nutrition Examination Survey (NHANES II).
Table 1a shows the sample proportions and corresponding census proportions of
contingency tables of income by urbanity, and
Table 1b shows the sample proportions and corresponding census proportions of
contingency tables of education by urbanity. We observe differences in the census and sample values, possibly due to differences in target and sampled populations. For example, in
Table 1a census data, the magnitude of row totals (
) is different from that of the sample data (
). Similarly, in
Table 1b census data, the off-diagonal entries satisfy an order relation (
), but, in samples, the relation goes in the opposite direction (
). If such constraints are known
a priori (e.g., from census or other sources), then it is wiser to incorporate them into the analysis while adjusting the sample data.
Much prior work (e.g., [
2]) assumed that random samples were directly taken from the
with
row and column margins (
respectively). However, in practice, there are situations in which a random sample from the target population is inaccessible. For example, often sample units are too expensive to locate or unwilling to participate in the survey. In this case, to estimate the target cell probabilities, we have to take a random sample from a
that is systematically different from the target population. Clearly, the resulting estimators are typically biased. Researchers in [
3] have studied such discrepancies under marginal row and column constraints. A similar problem in a regression context can be found in [
4].
It is well-known that all four margin-adjusting methods are asymptotically equivalent under simple random sampling. However, their small sample results can be different. Using simulation methods, [
5] found that MCSQ is best, followed by MLRS, RAKE and LSQ, in order of performance in average root mean squared error. However, for margin adjusting, [
3] found that both RAKE and MLRS dominate MCSQ; and LSQ is inferior to all three methods when the sampled population is systematically different from the target population. In this paper, we consider
general linear constraints (not necessarily marginal) under
inequality restrictions and study the performance of those four methods. For simulation (
Section 4), we have restricted our attention to
tables to facilitate comparison with Little and Wu [
3].
4. A Simulation Study
We performed a simulation study to compare the methods in a systematic way. We restrict our attention to
tables so that comparison with equality [
3] is facilitated. In contrast to margin-adjusting methods (e.g., [
3]) where only one parameter, e.g.
, is enough to consider, for inequality constraints one needs to consider all
. In this simulation, we have saught solution of the primal problem itself because the table dimensions (
) are the smallest, and duality approach does not help much to reduce the necessary computation load.
We have considered two types of inequality restrictions in the simulation:
isotonic and
nonisotonic (see [
7] for definitions). For each of the 16 designs described below, sample sizes
are considered. Thus, in each of
cases, for a given
as the target population vector, we vary
and find
using (
8). Then, we take multinomial random samples from this
and calculate
p. This process is repeated 200 times for each of 48 cases.
For isotonic constraints, we use a tree order as: . The initial choices are or [The results from are not reported because performancewise ].
For isotonic constraints, closed-form solutions (
) are available for all four methods as follows. The LSQ under tree order is calculated using the algorithm on page 19 of [
7], and, MLRS = LSQ. The RAKE and MCSQ values are given by least square projections of
and
on to the constraints of interest, and then applying the inverse of those transformations (see pages 240 and 278 of [
7], respectively).
For nonisotonic constraints, we consider the constraints: , , where or (0.6,0.7). Here, we use or with .
With given and the target probabilities , first we determine the sample probabilities using NEQNF of IMSL libraries of Fortran (version 7, Rogue Wave Software, Inc., Louisville, CO, USA). Then, a multinomial random sample of size n is taken from the sampled population by using the multinomial random number generator GGMTN in the IMSL subroutine library, and we calculate p.
Next, is found for each of four methods. When there is no violation, no adjustment is needed. When there is a violation, the solution is found by using the subroutine LCONG of IMSL.
After we find the estimates
for either constraints, we calculate the root mean squared error of the estimates as RMSE =
, where
is the true value of the target probability. To provide a more systematic comparison between these four methods, we compute a relative RMSE (RRMSE) defined as
where
is the root mean squared error of the method that is ML under the model that generated the data, so
or
for each model under its corresponding method.
Figure 1,
Figure 2 and
Figure 3 give visual comparisons of the methods under each model, for sample sizes
, respectively. For each figure the horizontal reference line with 0 RRMSE corresponds to the ML estimates under the model that was used to generate the data.
As mentoned earlier, for each sample size a total of 16 designs are considered, 1–8 are nonisotonic and 9–16 are isotonic; these are listed below. These designs are so numbered on the horizontal axis of each of the
Figure 1,
Figure 2 and
Figure 3.
1. Nonisotonic; with ; 2. ...; 3. ...; 4. ...; 5. Nonisotonic; with ; 6. ...; 7. ...; 8. ...; 9. Isotonic; with ; 10. ...; 11. ...; 12. ...; 13. Isotonic; with ; 14. ...; 15. ...; 16. ....
Overall RMSE of estimators. A crude comparison of the estimators is presented in
Table 2, which gives the average RMSEs for each method over the 16 designs in each of isotonic and nonisotonic cases. Although the designs are different, this gives some illustration of the performance of the four methods. The RNDM values are obtained when the sample is taken directly from the target population. One would expect these values to be smaller than those that were generated from the sampled population, but we did not find that to be the case in our simulation study although they are pretty close.
When the target and sampled populations differ, one would expect that the method that is ML under the model that generated the data would have the lowest RMSE. For nonisotonic cases, RAKE satisfies this property; although MLRS does not satisfy this property, it follows RAKE closely. The RAKE estimates had the lowest RMSE under LSQ and MCSQ models as well. Thus, RAKE estimates seem to perform best, while MLRS follows RAKE very closely in each case. For the isotonic case, however, a different picture arises. Here, the LSQ estimates had the smallest RMSEs for the data generated under the RAKE model. Both LSQ and MCSQ estimates had the smallest RMSEs when the data were generated under the respective models. For MLRS, the MLRS estimates had slightly higher RMSEs than that of MCSQs.
Figure 1,
Figure 2 and
Figure 3 present RRMSEs for data generated under each of the four models for all 16 problems with
n = 30, 100, 1000, respectively. To interpret them, first note that smaller values of
mean stronger constraints. In addition, a negative value of RRMSE reflects that bias from model misspecification is represented by lower variance than the method that is ML for the model that generated the data.
Certain reasonable patterns emerge from these figures; estimates based on the correct model dominate other methods when the sample size is large, or when the constraints are isotonic; here, the bias from model misspecification dominates RMSE. Results from nonisotonic constraints are more homogeneous. For them, LSQ turned out to be generally larger than MLRS.
Panel a of the figures summarizes results for the data generated under the RAKE model. For nonisotonic constraints, RAKE and MLRS performed similarly. For n = 30, 100, LSQ is slightly inferior to the other methods for the nonisotonic constraints with but is competitive when . RAKE seems to dominate (or close) and MCSQ performs worst (except when n = 30) of all nonisotonic constraints cases 1–8. RAKE performs slightly worse for isotonic constraints cases when n = 30, but is best again when n = 100, 1000.
Panel b of the figures summarizes results for data generated under the LSQ model. For all constraints with n 1000, LSQ and MLRS performed similarly. For n = 30, 100, LSQ is much inferior to MLRS for the nonisotonic constraints with but performs similarly when . MCSQ performs worst throughout, except for isotonic constraints with n = 30, when all three methods did better than RAKE, but this turned around when n = 100, 1000.
Panel c of the figures summarizes results for the data generated under the MCSQ model. Although for isotonic constraints LSQ = MLRS, for nonisotonic constraints, LSQ performed much worse than MLRS. The MCSQ values were close to the LSQ values for all constraints, except for isotonic constraints designs 9, 12 with n = 1000 when MCSQ is way off. Rake performed competitively with MLRS for nonisotonic cases. However, for isotonic constraints, RAKE was outperformed by other three methods for all n.
Panel d of the figures summarizes results for data generated under the MLRS model. Although for isotonic constraints MCSQ performed best for all n, with nonisotonic constraints, MCSQ is beaten by all other methods for n = 100, and by RAKE and MLRS when n = 30. LSQ performed much worse than MLRS for all nonisotonic cases. MLRS performed best for nonisotonic constraints and was close to best (MCSQ) for isotonic constraints, for all n.