WO2002035442A2

WO2002035442A2 - Composite haplotype counts for multiple loci and alleles and association tests with continuous or discrete phenotypes

Info

Publication number: WO2002035442A2
Application number: PCT/US2001/045393
Authority: WO
Inventors: Dmitri Zaykin
Original assignee: Glaxo Group Limited
Priority date: 2000-10-23
Filing date: 2001-10-22
Publication date: 2002-05-02
Also published as: EP1350212A2; WO2002035442A3

Abstract

Haplotype frequencies for a plurality of individuals are associated with a continuous trait. Each individual includes a pair of chromosomes having a plurality of markers thereon. Each marker has a pair of alleles for an individual. A haplotype includes a combination of alleles for a set of markers on a predetermined chromosome. A subset of markers from the set of markers that may correlate with the continuous trait, is selected. A value of the continuous trait, and a pair of alleles for each of the markers in the subset of markers, is obtained for each individual. For each individual, probabilities of haplotypes that are compatible with the alleles in the subset of markers is determined. Finally, a regression is performed on the probabilities of haplotypes that are compatible with the alleles in the subset of markers, for all the individuals, to determine correlation between the continuous trait and the haplotypes.

Description

COMPOSITE HAPLOTYPE COUNTS FOR MULTIPLE LOCI AND ALLELES AND ASSOCIATION TESTS WITH CONTINUOUS OR DISCRETE

PHENOTYPES

Field of the Invention

This invention relates to data processing systems, methods and computer program products, and more particularly to bioinformatic systems, methods and computer program products.

Background of the Invention

Massive amounts of genomic data are now becoming available. The field of bioinformatics has emerged to analyze these data. For example, much literature has been devoted to detecting association of human traits with multiple genetic markers, comparing relative qualities of family-based and case-control designs. See for example, the publications by Morton et al., entitled Jests' and Estimates of Allelic Association in Complex Inheritance, Proc. Natl. Acad. Sci. USA, 1995, pp. 1 1389-11393; Risch et al., entitled The Relative Power of Family-Based and Case-Control Designs for Linkage Disequilibrium Studies of Complex Human Diseases, I. DNA Pooling, Genome Research, 8, pp. 1273-1288; and Teng et al., entitled The Relative Power of Family-Based and Case-Control Designs for Linkage Disequilibrium Studies of Complex Human Diseases, II. Individual Genotyping, Genome Research, 9(3), pp. 234-241.

Case-control data may not contain complete information about gametic phase, or haplotype, of the individuals. Nevertheless, haplotypes can be useful for fine mapping of disease susceptibility genes, for at least several reasons. First, despite the fact that the haplotype generally is unobservable, in many cases the haplotypes can be reasonably inferred from genotypes. Second, if recombination in the neighborhood of the disease- causing mutation is rare, then the haplotype of the original carrier may remain largely intact for many generations. Thus, haplotype can be a good surrogate for a disease susceptibility gene.

Third, there appears to be growing evidence that complex traits are governed by complex combinations of genes. See, for example, Bevan et al., Relative Power of linkage and Transmission Disequilibrium Test Strategies to Detect Non-HI.A Linked Coeliac Disease Susceptibility Genes, Gut. 45, pp. 668-671 ; Barnes, Gene-Environment and Gene-Gene Interaction Studies in the Molecular Genetic Analysis of Asthma and Atopy, Clinical & Experimental Allergy, 29 Suppl. 4, pp. 47-51 ; El-Gabalawy et al., Association of HLA Alleles and Clinical Features in Patients With Synovitis of Recent Onset, Arthritis & Rheumatism, 42, pp. 1696-1705; Tomer et al., Mapping the Major Susceptibility Loci for Familial Graves' and Hashimoto's Diseases: Evidence for Genetic Heterogeneity and Gene Interactions, Journal of Clinical Endocrinology & Metabolism, 84, 12, pp. 4656-4664; and Wicker et al, Genetic Control of Autoimmune Diabetes in the Nod Mouse, Ann. Rev. Immunol., 13, pp. 179-200. Haplotyping can help to reveal such epistatic effects. Finally, actual haplotypes have been and may continue to be typed using atomic imaging microscopy as described in Woolley et al., Direct Haplotyping of Kilobase-Size DNA Using Carbon Nanotube Probes, Nature Biotechnology, 18, pp. 760- 763. However, such data may not be routinely available in the immediately foreseeable future.

In the case of single locus genotypes, gametic phase may be completely determined from genotype and vice versa. In this case, associating haplotypes with disease can be extended to associating genotype with disease, for which there is a large literature. For example, Weir et al., Two-Locus Theory in Quantitative Genetics, Proceedings of the International Conference on Quantitative Genetics, 1977, pp. 247-269, and Nielsen, Detecting Marker-Disease Associate by Testing for Hardy-Weinberg Disequilibrium at a Marker Locus, 1998 American Journal of Human Genetics, 63, pp. 1531-1540 disclose models for predicting quantitative traits from genotypes. Sasieni, From Genotypes to Genes: Doubling the Sample Size, Biometrics, 53, 1997, pp. 1253- 1261, discusses the case of a binary (e.g., diseased/non-diseased) trait. Sasieni's paper discloses equivalence between two methods for disease association, where one model uses alleles as observations (2n), and the other uses individuals as observations (n).

Chiano et al, Fine Genetic Mapping Using Haplotype Analysis and the Missing Data Problem, American Journal of Human Genetics, 62, 1998, pp. 55-65, and Zhao et al., Model-Free Analysis and Permutation Tests for Allelic Associations, Human Heredity, 50, 2000, pp. 133-139, propose statistical tests for haplotype-disease associations, using inferred haplotypes, in random samples of cases and controls. However, there is often no clear dichotomy of the phenotype, so that individuals in a sample may not be classified into two or several distinct groups. Rather, a continuous trait may be present. There have been many descriptions of haplotype frequency inference when only single-locus genotypes are scored. In these situations, individuals that are heterozygous for more than one locus may convey ambiguous information about the gametic phase. Missing data techniques, such as the E-M technique, formalized by Dempster et al. in Maximum Likelihood From Incomplete Data Via the E-M Algorithm, Journal ofthe Royal Statistical Society B, 39, 1977, pp. 1-38, may be appropriate. Hill, Estimation of Linkage Disequilibrium in Randomly Mating Populations, Heredity, 33, 1974, pp. 229-239 provides a cubic equation for the maximum likelihood estimate of a gametic frequency for the case of two loci and two alleles. Weir et al., Estimation of Linkage Disequilibrium in Randomly Mating Populations, Heredity, 42, 1979, pp. 105-111, studied possible pitfalls of iterative techniques and described how the likelihood equations should be solved and all real roots examined, for the case of two loci.

Moreover, Little et al., Statistical Analysis With Missing Data, Wiley & Sons, Inc, 1987, pp. 181-182, provided a general description of the E-M technique for the count (multinomial) data. Long et al., An E-M Algorithm and Testing Strategy for Multiple- Locus Haplotypes, American Journal of Human Genetics, 56, 1995, pp. 799-810, put the problem into genetic context, and discussed tests for linkage disequilibrium (LD) and higher order interactions, giving elaborate details for carrying out E-M computations in the three locus case. Chiano et al. outlined multiple locus E-M and testing strategies for a binary response (e.g. presence of disease). Excoffier et al., Maximum-Likelihood Estimation of Molecular Haplotype Frequencies in a Diploid Population, Molecular Biology & Evolution, 12, 1995, pp. 921-927; Slatkin et al., Testing for Linkage Disequilibrium in Genotypic Data Using the Expectation-Maximization Algorithm, Heredity, 76, 1996, pp. 377-383; and Fallin et al., Accuracy of Haplotype Frequency Estimation for Biallelic Loci, Via the Expectation-Maximization Algorithm for Unphased Diploid Genotype Data, American Journal of Human Genetics, 67, 2000, pp. 947-959, studied the importance of E-M assumptions and behavior of tests for disequilibrium on estimated frequencies.

Summary of the Invention

Embodiments of the invention associate haplotype frequencies for a plurality of individuals with a continuous trait. Each individual includes a pair of chromosomes having a plurality of markers thereon. Each marker has a pair of alleles for an individual. A haplotype comprises a combination of alleles for a set of markers on a predetermined chromosome.

According to embodiments of the invention, a subset of markers from the set of markers that may correlate with the continuous trait, is selected. A value of the continuous trait, and a pair of alleles for each of the markers in the subset of markers, is obtained for each individual. For each individual, probabilities of haplotypes that are compatible with the alleles in the subset of markers is determined. Finally, a regression is performed on the probabilities of haplotypes that are compatible with the alleles in the subset of markers, for all the individuals, to determine correlation between the continuous trait and the haplotypes.

In embodiments of the invention, a regression is performed by sampling a first haplotype from the haplotypes that are compatible with the individual's set of alleles, from the probabilities of haplotypes that are compatible with the alleles in the subset of markers, for each individual, to thereby define a second haplotype which is determined by the sampling of the first haplotype. The value of the continuous trait for the individual is assigned to both the first haplotype and the second haplotype, to thereby define a doubled sample size. An analysis of variance then is performed, by comparing average values of the trait among the sampled first and second haplotypes for all the individuals. The sampling a first haplotype, assigning the value of the continuous trait and performing an analysis of variance, are repeatedly performed, to obtain a distribution of correlations of the continuous trait and the haplotype. A value then is determined from the distribution that identifies a significance ofthe correlation.

According to other embodiments of the invention, the above-described analysis of variance may be performed by defining a design matrix of first and second indicator values having two rows for each individual, where the second indicator value is associated with the first and second haplotypes and remaining positions in the design matrix are set to the first indicator value in the two rows. A regression is then performed on the design matrix, to thereby identify a correlation value between the value of the continuous trait and the first and second haplotypes. The value that is determined from the distribution can be a median that is determined from the distribution that identifies a significance of the correlation.

In other embodiments of the invention, a regression is performed by assigning a rank of significance for each haplotype in the set. For each individual, a first haplotype is sampled from the haplotypes that are compatible with the individual's set of alleles, to thereby define a second haplotype which is determined by the sampling of the first haplotype. The value of the continuous trait for the individual is assigned to both the first haplotype and the second haplotype, to thereby define a doubled sample size. A one degree of freedom regression is perfomied on the ranks for the sampled first and second haplotypes for all the individuals. The sampling a first haplotype, assigning the value of the continuous trait and performing a one degree of freedom regression are repeatedly performed to obtain a distribution of the correlation of the continuous trait in the haplotypes. A value is determined from the distribution that identifies a significance o the correlation. For example, a median may be determined from the distribution.

In yet other embodiments, the one degree of freedom regression may be performed by defining a design matrix having two columns of the ranks of the first and second haplotypes, and having two rows for each individual. A regression is performed on the design matrix, to thereby define a correlation value between the value of the continuous trait and the haplotypes.

In still other embodiments of the present invention, a regression is performed by relating the value of the continuous trait for each individual to a vector of estimated frequencies of all haplotypes. A multiple regression is performed of the trait values on the vectors of estimated frequencies, to thereby determine correlations between the continuous trait and the haplotypes.

In still other embodiments of the present invention, probabilities of haplotypes that are compatible with the alleles in the subset of markers is determined by a haplotype- response associate test on unrelated individuals. Additionally, the probabilities that haplotypes are compatible with alleles may be determined by obtaining a composite haplotype.

It will be understood that embodiments of the invention may be provided as systems, methods and/or computer program products.

Brief Description of the Drawings Figure 1 is a block diagram of data processing systems according to embodiments of the present invention.

Figures 2-6 are flowcharts of methods, systems and/or computer program products according to embodiments of the present invention Figures 7A-7J graphically illustrate simulated correlations between continuous traits and haplotypes according to embodiments of the invention.

Figures 8A-8J graphically illustrate simulated correlations between continuous traits and haplotypes according to other embodiments of the invention. Figures 9A-9C graphically illustrate simulated correlations between traits and haplotypes according to embodiments of the invention.

Figures 10A-10C graphically illustrate the distribution of the difference P_{Λ B} - P,^'._B for three penetrance matrices.

Detailed Description of the Embodiments of the Invention

The present invention now will be described more fully hereinafter with reference to the accompanying drawings, in which embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Like numbers refer to like elements throughout.

The present invention is described below with reference to block diagrams and/or flowchart illustrations of methods, apparatus (systems) and/or computer program products according to embodiments of the invention. It is understood that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, and/or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, create means for implementing the functions specified in the block diagrams and/or flowchart block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instructions which implement the lunction specified in the block diagrams and/or flowchart block or blocks. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented method such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the block diagrams and/or flowchart block or blocks.

It should also be noted that in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the flowcharts. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

Definitions

As used herein, the following terms have the following meanings:

An "allele" is an alternative form of a gene. Alleles may result from at least one mutation in the nucleic acid sequence and may result in altered mRNAs or polypeptides whose structure or function may or may not be altered. A natural or recombinant gene may have none, one, or many allelic forms. Common mutational changes which can give rise to alleles are generally ascribed to natural deletions, additions, or substitutions of nucleotides. These types of changes may occur alone, or in combination with the others, one or more times in a given sequence.

"Chromosomes" are the self-replicating genetic structures of cells containing the cellular DNA that bears in its nucleotide sequence the linear array of genes.

"Continuous trait" refers to a common detectable phenotypic variation of a particular inherited characteristic in individuals. Examples of continuous traits include blood pressure, testosterone levels, hair count, efficacy of a drug and body mass index.

Continuous traits may be contrasted with binary traits such as diseased/not diseased. The traits have an associated genetic marker.

A "haplotype" is a combination of alleles, which tend to be inherited together. "Haplotype frequencies" refers to the number of occurrences of a haplotype. "Individuals" refer to persons or organisms.

A "marker" is an identifiable physical location on a chromosome whose inheritance can be monitored. Markers can be, for example, a restriction enzyme cutting site, an expressed region of DNA (genes), or any segment of DNA with or without known coding function, whose pattern of inheritance can be determined.

Other terms that are used herein are well known to those having skill in the art and need not be described in detail herein. Statistical tests for association of haplotype frequencies with continuous traits according to embodiments of the invention now will be described qualitatively and in quantitative detail. Such tests can have improved sensitivity because haplotypes can be stronger predictors of traits when there is lack of recombination. Conditions for asymptotic equivalence of standard regression-based methods with methods that "double the same size" will be described in the case of known haplotypes. These models then will be extended to the case of inferred haplotypes. Haplotype frequencies can be estimated through expectation-maximization (E-M), and each individual in a sample is expanded into all possible haplotype configurations with corresponding probabilities. Embodiments of the invention then will be confirmed to have type I error control, and also can have excellent power. An application to gene mapping using epidemiologic data with adjacent markers then will be described, showing that embodiments ofthe invention can be used to improve the efficiency of genome scans by incorporating information from consecutive markers.

Embodiments of the invention can be more computationally efficient than conventional techniques that use time-consuming resampling coupled with numerical optimization at each step. Embodiments of the invention can allow an optimization (haplotype frequencies inference) to be performed only once. Then, each individual in a sample can be expanded into all consistent haplotype configurations with corresponding probabilities, and regression can be used to relate these probabilities to the response. This efficiency can allow embodiments of the present invention to be applied to whole genome scans. Moreover, the present invention allows continuous traits to be studied, while also allowing both discrete and continuous traits to be studied within a unified regression framework.

Qualitative Description

The present invention may be embodied in a data processing system such as illustrated in Figure 1 . The data processing system 24 may be configured with computational, storage and control program resources for associating haplotype frequencies for a plurality of individuals with a continuous trait, in accordance with embodiments of the present invention. Thus, the data processing system 24 may be contained in one or more enteφrise, personal and/or pervasive computing devices, that may communicate over a network which may be a wired and/or wireless, public and/or private, local and/or wide area network such as the World Wide Web and/or a sneaker network using portable media. Moreover, when integrated into a single computing device, communication may take place via an Application Program Interface (API).

Still referring to Figure 1, embodiments of the data processing system 24 may include input device(s) 52, such as a keyboard or keypad, a display 54, and a memory 56 that communicate with a processor 58. The data processing system 24 may further include a storage system 62, a speaker 64, and an input/output (I O) data port(s) 66 that also communicate with the processor 58. The storage system 62 may include removable and/or fixed media, such as floppy disks, ZIP drives, hard disks, or the like, as well as virtual storage, such as a RAMDISK. The I/O data port(s) 66 may be used to transfer information between the data processing system 24 and another computer system or a network [e.g., the Internet). These components may be conventional components such as those used in many conventional computing devices, which may be configured to operate as described herein.

The memory 56 may include an operating system to manage the data processing system resources and one or more applications programs including one or more application programs for associating haplotype frequencies for a plurality of individuals, with a continuous trait, according to embodiments ofthe present invention.

Figure 2 is a flowchart of methods, systems and/or computer program products 200 for associating haplotype frequencies with continuous traits according to embodiments of the present invention. It will be understood that these systems, methods and/or computer program products 200 may stored in the memory 56 of Figure 1 and may execute on the processor 58 of Figure 1. It also will be understood that each individual includes a pair of chromosomes having a plurality of markers thereon. Each marker includes a pair of alleles for an individual. A haplotype comprises a combination of alleles for a set of markers on a predetermined chromosome.

Referring now to Figure 2, at Block 210, a subset of markers is selected from the set of markers that may correlate with the continuous trait. The selection of a subset of markers may be determined empirically and/or theoretically based on available literature, studies and/or other techniques. The selection of a subset of markers that may correlate with the continuous trait is well known to those having skill in the art and need not be described further herein.

At Block 220, for each individual, a value of the continuous trait and the pair of alleles for each of the markers in the subset of markers is obtained. As with Block 210, the obtaining of a value of the continuous trait and the pair of alleles for each of the markers may be obtained through clinical trials or other studies that may involve a control group and a sample group. The obtaining a value of a continuous trait and a pair of alleles for each of the markers in the subset of markers is well known to those having skill in the art and need not be described further herein. It also will be understood by those having skill in the art that the operations of Blocks 210 and 220 may be embodied by storing data in the memory 56 of the data processing system 24, regardless of the source of the data or the manner in which it was obtained or derived. The following Table I illustrates an example of data that may be stored in memory as a result of performing the operations at Blocks 210 and 220. As shown in Table I, data for N individuals is stored.

Table I

For each individual, the value of the continuous trait and the allele numbers for markers 1-L are stored. It will be understood that although the data of Table I is shown in tabular form, it may stored in any other conventional manner including relational databases and/or linked lists.

Returning again to Figure 2, at Block 230, for each individual the probabilities of haplotypes that are compatible with the alleles in the subset of markers is determined. Then, at Block 230, a regression is performed on the probabilities of haplotypes that are compatible with the alleles in the subset of markers, for all of the individuals, to determine correlations between the continuous trait and the haplotypes.

Figure 3 is a block diagram of operations for performing regression analysis on the probabilities of haplotypes that are compatible with the alleles in the subject markers, for all the individuals, to determine correlations between the continuous trait and the haplotypes (Block 240 of Figure 2) according to embodiments 240' of the invention. In these embodiments 240', at Block 310, for each individual, a first haplotype from the haplotypes that are compatible with the individual set of alleles is sampled from the probability distribution determined at Block 230, to thereby define a second haplotype which is determined by the sampling of the first haplotype. At Block 320, the value of the continuous trait for the individual is assigned to both the first haplotype and the second haplotype, to thereby define a doubled sample size. Then, at Block 330, an analysis of variance is performed by comparing average values of the trait among the sampled first and second haplotypes for all the individuals. The operations of Blocks 310, 320 and 330 are repeated a sufficient number of times, to obtain a distribution of correlations of the continuous trait and the haplotypes. When all the haplotypes have been processed, at Block 350, a value is determined from the distribution that identifies a significance of the correlation. Figure 4 is a block diagram of embodiments of performing an analysis of variance

(Block 330 of Figure 3). As shown in Figure 4, at Block 410, an analysis of variance may be performed by defining a design matrix of first and second indicator values (such as 0 and 1) having two rows for each individual, where the second indicator value is associated with the first and second haplotypes and remaining positions in the design matrix are set to the first indicator value in the two rows. A regression in then performed on the design matrix, to thereby identify a correlation value between the value of the continuous trait and the first and second haplotypes.

Figure 5 is a flowchart of other embodiments of performing a regression on the probabilities of haplotypes that are compatible with the alleles in the subset of markers for all the individuals, to determine correlations between the continuous trait and the haplotypes (Block 240 of Figure 2). Embodiments 240" of Figure 5 first assign a rank of significance for each haplotype in the set, at Block 510. Operations corresponding to Blocks 310 and 320 of Figure 3 then are performed. Then, at Block 520, a one degree of freedom regression is performed on the ranks for the sampled first and second haplotypes for all the individuals. At Block 340, the operations of Blocks 310, 320 and 520 are then repeatedly performed for all haplotypes, to obtain a distribution of the correlation of the continuous trait and the haplotypes. Then at Block 350, a value is determined from the distribution that identifies the significance of the correlation.

Referring now to Figure 6, other embodiments of performing regression analysis on the probabilities of haplotypes that are compatible with the alleles in the subset of markers, for all the individuals, to determine correlations between the continuous trait and the haplotypes (Block 240 of Figure 2) are described. As shown in Figure 6, these embodiments 240'" relate the value of the continuous trait of each individual to a vector of estimated frequencies of all haplotypes (Block 610). Then, at Block 620, a multiple regression of the trait values is performed on the vectors of estimated frequencies.

Additional details of embodiments of the present invention as described in Figures 1-6 now will be provided.

Allelic Tests

In the above-cited Sasieni publication, allelic versus genotypic tests for the case- control design and bi-allelic markers were studied. As described therein, a genotypic test for association can operate on a 2 x 3 contingency table of individuals, classified according to their genotypes and the affection status. The total count of such a table is n. An allelic test would operate on a 2 x 2 table of allele counts versus affection status. Thus, each individual would contribute two alleles to the table, and the total count becomes 2n. The test implicitly assumes that the allele counts are binomially distributed, and thus may require that the population is in Hardy-Weinberg Equilibrium (HWE). Sasieni described that the Armitage's trend test addresses essentially the same question, however it does not "double" the data, and therefore can be applied to samples from non- randomly mating populations. Sasieni also provided explicit expressions for odds ratios comparing heterozygous and homozygous cases and argued that the genotypic test is sometimes a better choice, since it allows to test genotypic effects not explained by alleles, or "dominance deviations". See the above-cited Weir et al. 1977 publication.

It may not be clear which test should be preferred for the multi-allelic markers, and especially haplotypes, because, for a marker with L alleles, the number of possible genotypes is L(L+l)/2, which may lead to large degrees of freedom tests and sparse tables. The allele-based test, on the other hand, will have L categories. Thus, sparseness may be less of a problem. An intrinsic assumption of allelic tests is that the response can be explained by the allelic "main effects". Then, certain situations, e.g. the two allele case when both homozygotes have the same effects, different from the effect of the heterozygote, may not be detected. Nevertheless, allele-based tests still are sensitive in many cases, if not uniformly most powerful.

A justification for the data-doubling in the case of continuous traits, according to embodiments of the invention (Block 320 of Figures 3, 5 and 6) now will be provided. Let X denote a gamete that may take any one of L values; X=j denotes that the gamete takes the jth allelic value ("j" is a label for the particular allele). The gametes may be single locus, in which case the values of X are called alleles. When X is multi-locus they are called haplotypes. Individual i has two gametes, Xπ and Xj₂, and the genotype of individual i is denoted (X _\, X J). Individual i also has an associated phenotype, Yj.

Consider the 2n-dimensional linear model relating responses to gametic phase: Y, = Aα + ε (1)

Here A'=(An, A₎₂, A₂ι, A₂₂,..., A_n), A_n2), where A,, is the is the 1 x L allele indicator vector for gamete j of subject i. For example, if X = 2, then A\. = (0 1 0 ... 0), indicating that gamete j has allelic class 2. It will be understood that labeling of gametes as either 1 or 2 within an individual may be arbitrary. Let the elements of Y₂ denote corresponding phenotypic observations: Y,' = (Yi, Yi, Y₂, Y₂, ..., Y_n, Y_n), so that the data are doubled.

Equation (1) is an Analysis of Variance (ANOVA) model relating response to allele class. One may test for no affect of allele class on phenotype using the F test distribution with degrees of freedom L-l and 2n-L: here F =MSSA/(L-D}/{SSE/(2n-L)}, with

SSA = Y (A(A^'A)^"1 A^' - J₂„_x „ (2n))Y2 and

SSE = Y (I_2n-A(A^'A)^"1 A^')Y₂, and where J_{a b} denotes the (a \ b) matrix of l 's and I_a denotes the (a x a) identity matrix. The F test may be suspect because the response variable has been doubled. In other w ords, it may appear like "cheating" to artificially double the sample size. However, the F statistic is equivalent to that of the following n-dimensional regression model, and the data doubling is therefore valid.

An alternative model to Equation (1) with similar asymptotic properties and well- known finite-sample properties now will be described. The model is an n-dimensional regression model

Y = Dβ + ε (2) where Yj = trait value for individual i, D' - (Di, D₂, ..., D„), D - (D_{\ \}, D,₂, ..., Dπ ), and where

2 if ith individual is homozygous for allele j

^D 1 when ith individual is heterozygous including allele j

0 otherwise

There is some correspondence between Equations (1) and (2) in that An + A₂j =

Dj. However, it is also clear that Equation (2) may have the usual validity (or lack thereof, in cases of lack of fit) of standard regression models, whereas Equation (1) may seem outrageous since the observations are simply doubled. Nevertheless, it will be shown that these models can produce equivalent F statistics when HWE holds.

To understand potential lack of fit, note that Equation (2) may correspond to that of the Weir et al. 1977 publication and the above-cited Nielsen et al. publication, in the case of no "dominance" effects. Specifically, these publications assume the mean phenotypic response for genotype (j,k) is μ_jk = μ + V_j + V_k + d_jk, where

E_jP_jV_j = 0, and (3)

E_jP_jd_jk = E_kp_kd_jk = 0,

the p_j denoting population allele frequencies. Equation (2) is exactly equation (3) with d_j ≡ 0.

Equation (3) may lack sensitivity in cases of dominance effects d_jk Φ 0). However, in the case of larger L (anticipating the case where the alleles are multi-locus haplotypes), the test for {Ho : Vj ≡ 0 and d_j --- 0} may lose power because of the large numerator degrees of freedom (L(L+l)/2 - 1). In this case, the additive Equation (2) may be preferable despite possible lack of fit. In the additive model, the relation 23_j = : + 2V_j applies and in this case the test of Ho : 3 _I=3₂=.. .=3 is an L-l numerator degrees of freedom test of no effect of allele on response. The F test uses:

F,={SSA₁/(L-l)}/{SSE,/(n-L)}, ith SSAι-Y'(D(D'D)^"1D'-J„_λn/n)Y and

SSE_i=Y'(I,₁-D(D'Dy¹D')Y

The statistics F and Fi may be equivalent under the null hypothesis and Fi dominates F under the alternative, under HWE. In particular, it will be shown below that:

F, -F →_p 0 (4) when genotype has no effect on trait, and that:

F. F →_p O l (5) when V_α = Σp_jOt_j - (ΣP_jC _j)² > 0 . These results establish validity of the 2n-model Equation (1) under the null hypothesis. However, because the n-model Equation (2) does not require HWE assumption and has an asymptotically larger F value under the alternative, it may be preferred. Nevertheless, the equivalence of the two approaches is useful for developing methodologies in the more complicated situations where alleles (specifically, multi-locus alleles, or haplotypes) are unobservable, as will now be shown.

Determining Probabilities

Operations for determining probabilities of haplotypes that are compatible with the alleles in the subset of the markers (Figure 2, Block 230) now will be described in detail. In Equations (1) and (2), the "alleles" can denote multi-locus haplotypes rather than single-locus alleles. In this case the parameter V_j refers to the main effect of haplotype j. The haplotypes are generally unobservable, and therefore missing data methods may be used for their estimation. Consider two basic types of models. One is like the ANOVA model Equation (1), but where the A_} are generated at random from a distribution inferred through the observed single-locus genotypes, then results are averaged over random haplotype generations. The second basic type is like the regression model Equation (2), where instead of using actual haplotype frequencies (0, 1 ,2) for person i, the expected haplotype frequencies (given the observed single locus genotypes) are used.

Expectation-Maximization (E-M) techniques for inferring haplotype frequencies have been described previously. See the above-cited publications by Hill, Weir (1996), Long et al., Slatkin et al. and Chiano et al. However, a review of E-M will be provided because it can be used in embodiments of the invention.

First, initial haplotype frequencies f are sampled from a symmetric Dirichlet distribution, Dir((, (, ..., Q, where the dimension of the distribution is determined by the number of haplotypes compatible with the genotypes in the sample (Block 310). Values of (>1 result in more similar initial frequencies. The value used is (=1 (multivariate uniform distribution). Each person's multilocus genotype (of L loci) is expanded into a set of possible haplotypes. This may be done by generating (2^L/2 = 2^1"1) vectors of zeros and ones, indicating whether the first or the second allele of an individual at the current locus should be taken. (This does not assume that loci are bi-allelic.) If L=3, the set is 000, 010, 100, and 1 10. This "primary" set of haplotypes assumes its complement, 11 1, 101, 011 , and 001, and the maximum overall number of haplotypes compatible with a given genotype is 2^L. The above expansion only needs to be performed once, at the E-M initialization stage. Next, haplotype frequencies (real values) are mapped to corresponding vectors of possible haplotypes (vectors of integers) for each possible haplotype in a sample (Block 320). Long at al. used L-dimensional array indices for the mapping. However, the mapping may be more conveniently implemented through associative arrays, such as generic "map" from the C++ Standard Template Library. This can make the algorithm completely general with respect to the value of L.

Denote frequencies of haplotypes from the primary and complementary sets by f, and f\ Per-subject probabilities are then updated, under assumption of HWE, as follows: ^p. = j∑eG| ^{f f}; where *G,*= "^''. The sample log-likelihood is:

logl ^ lnP_; i=l summing over all n individuals. Then the f, and frequencies are updated. For each fj, and f.^c the updates ϊ and f" ' are

where my, mjj are the numbers of times that the haplotype i (or its complement) was counted as compatible with j-th individual's genotype. For example, if L=2, the 0- 1 expansion is 00, 10 for a primary set, and 11, 01 for the complement. If the j-th individual had A_\IA_\ B]/B₃ genotype, that would translate into A)Bι, AiBi haplotypes for the primary set, and AjB₃, AιB₃ for the complementary set. For updating the haplotype A₂B₃, then m is zero, but if the haplotype is A₁B₃, then my is equal to two.

The updating process (iteration) continues until the difference between subsequent sample log-likelihoods is sufficiently small. Several re-runs, starting from the random initialization of the haplotype frequencies may be needed to avoid local convergence, and the run with the highest log-likelihood should be taken. Estimated haplotype counts are given by the final values of 2nf, 2nf\

Regression

The performance of regression on the probabilities of haplotypes that are compatible with the alleles in the subset of markers (Block 240) now will be described in detail.

As noted previously, an F test and its p-value are available in the case of known haplotypes. However, the actual haplotypes generally are not available, only an estimate ofthe probability distribution of their values. Thus, the distribution of likely values of the "true p-value" is estimated and a "likely" p-value may be picked as follows:

1. For each individual i=l ,...,n, a pair of compatible haplotypes is randomly drawn from the distribution specified by the compatible haplotype frequencies (Blocks 310 and 320);

2. A model specified by ( 1) is formed, and a test statistic (F) for the importance of including the genotype is calculated (Block 330);

3. Operations 1 and 2 are repeated many times, and the p-value from each run is saved (Block 340); then 4. The final p-value is given by the median of the distribution of p-values

(Block 350). This embodiment is referred to as "embodiment 1 ", which is described in Figure 3.

It is possible to greatly increase power of the test if some of the vector haplotype indicators An, Aj₂ can be replaced by corresponding scalar rank scores Rπ, Rj₂ (in order of "importance"), based on prior tests or biological knowledge (Block 510). In that case the model is formed as

Yi-+Ri,3+,_π

This embodiment is referred to as "embodiment 2", which is described in Figure 5. Unlike embodiment 1, this embodiment can concentrate the effect into a test with a single degree of freedom (Block 520), and can have much greater power when the rank scores are chosen well.

As described in Figure 6, yet another embodiment (referred to as "embodiment 3") is to perform a multiple regression, based on n observations instead of 2n, (Block 620) directly on the set of per-person expected haplotype frequencies (Block 610). This embodiment is motivated by Equation (2), where the traits are regressed on the observed frequencies. If all elements in the matrix D in Equation (2) are divided by two, then they can be considered as probabilities for the individuals to have a particular allele. In the single-locus case, the identification of alleles may be certain, and so 0, 0.5, and 1 generally are the only values possible. In the case of E-M inferred haplotypes the corresponding model is:

L

^Yi = Σ ^fijβj ^{+ ε}i j=l

Frequencies for haplotypes incompatible with the ith individual's single-locus genotypes are set to zero. Also, haplotypes with expected counts that are less than one are removed from consideration. The usual F test of Ho : 3 _I=. . .=3L provides a test for effect of haplotypes on trait, and individual haplotype effects are tested as Ho : 3ι=0. The test can be made more robust by permuting the vector (Yι,...,Y_n) independently of the haplotype frequency data. The final p-value is the proportion of permutations that yield an F- statistic p-value that is no larger than the original F-statistic p-value. However, as indicated by simulation results below, the asymptotic p-values are valid in most situations. Additional theoretical details now will be provided. In particular, 2n may be corresponded with n by labeling each of the n individuals by i, wherein i 0 S = { 1 , 2 n} . Partition S disjointly and exhaustively into sets N_jk so that S = U_j=lU _=JN _]k , where

N_Jk is the set of individuals with genotype (j, k). Define N _j=N_j and let n_jk=*N_jk* denote the number of individuals out of n having genotype (j, k).

Consider a sequence of samples for which n -» ∞, under random sampling from an infinitely large population of individuals. In this case: n_jk/n=p_jk+θ_p(l) (6) where p_j is the population proportion of individuals with genotype (j, k), and where 'O_p(l)" denotes a term that converges to 0 in probability. If the population is in F1WE, p, = p (^? and p_j =2p_jp_k (8) where p_j denotes the population proportion of alleles with type j. In the sample of n individuals there are 2n gametes. Of these 2n, there are n_Aj = + (ri_j i + .. . + + .. . + n_{j L}) gametes having allele j. Under random sampling and HWE, n_Aj/(2n)=p_J+o_p(l) (9)

Equations (6)-(9) concern the behavior of the n_j and the p_j . Now consider the Y, under embodiment 3. Assume that Y,=μ_jk+0, in set N_jk, where the 0, are independent with Var(0,)=Φ^: 0.

Note that the sum of the Y, corresponding to occurrences of allele j is the "allelic sum"

⁼ 2r,oNjjYι + rk≠jr,o jiNι, and the corresponding "allelic average" is

Using this model, it can be seen that:

and that n^{l :}(Ϋ^ μ O →_d U (1 1) where \\ denotes the vector of allelic averages, where U denotes a multivariate normal L-vector, and where ~ _u denotes convergence in distribution. Also,

Y'Y/n→_pμ'- oo (12) and

Proof of Equation (4) now will be provided. Note that: — ' ' ^^{^ ^ j} ' ^\ —

^{1 ~} MSE, MSE ^~ (MSE,)(MSE)

Therefore, if it can be shown

MSA -MSA, =o_p(l), (14)

MSE -MSE, =o_p(l), (15) and

MSA, →_άQ, (16) for some random variable Q, then

by Equation (13), and the result will be proven. Hence, Equations (14), (15) and (16) need to be demonstrated.

The condition that genotype has no effect on trait implies that μ_j -≡ μ, and for the remainder ofthe proof of Equation (4) assume without loss of generality that μ = 0, since all quadratic forms in Equations (14-(16) are invariant to μ. To verify Equation (14), it will suffice to show SSA-SSAi = o_p(l). Both SSA and

SSAi are expressed as quadratic forms in Y_Λ , then examine the difference ofthe defining matrices.

Considering SSA first, note that (A'Ay'A'Y;.^ Ϋ_Λ . Letting D_Λ=A'A-diag{n_Λj}, so YA(AΑ ^''AΥ,

and that

Thus, SSA =Ϋ_Λ[D_Λ~D_ΛJ _LD_A/(2n)]Ϋ_A

= 2 n^Y )A„(n^Y_A By Equation (9):

A_π = P-pp' + θ_p(l), where P = diag{p_j} and p' = (p ),..., p_L). Now considering SSA, = Y'[D(D'D)^"lD'-J_πxn/n]Y, note that D'Y=D_ΛΫ_Λ , so that:

SSA, =Y;[D_A(D'D) -'D.-D^D^nVjY,

To find the limit of B_n, obtain the limit of D'D/(2n), which may be expressed as:

= P + pp' + o_p(l) under Equations (6)-(9). Since the elements of the inverse of a matrix are continuous functions of the elements of the elements of the original matrix, and since P + pp' is invertible (positive definite in fact), [(D'DV^n)]^"1 = (P + pp')^"1 + o_p(l) = P^"1 - J ι./2 + O_p(l). Thus,

B_n =P(P^'l-J_LλL/2)P-pp72 + o_p(l) = P-pp' + o_p(l) hence A„ - B„ = o,,(l). Using this result and Equation (11), SSA-SSA,

and Equation (14) is verified. To show Equation (15), note that

Also,

Using Equation (11), as well as convergence results for (D'DV(2n) and D_A/(2n) given above, it follows that

= θ_p(l) + Ϋ_A[2(P-pp72)~P + o_p(l)]Ϋ_Λ = o_p(l)+Y [P-pp']Ϋ_A = _p(D since \\ — - 0 and Equation (15) is proven.

To show Equation (16), use Equation (18). The result of Equation (16) follows by noting that n^{1 ~}Y_A converges in distribution and that the elements of B_n converge in probability, and Equation (4) is finally proven.

Proof of Equation (5) now will be provided. Note from Equation (18) that

SSA,/(2n) =ΫB-,Y_A

From Equation (17)

SSA/(2n) = Y_ΛA,N_A→_p V_α

Hence, SSAJSSA→_pl. Now consider

MSE MSE-MSE,

• + 1

MSE, MSE,

From Equations ( 10), ( 13) and (7), MSE - MSE, V„

^■ — .

MSE, σ,

implying MSE / MSE , = V_u / σ + 1 + o (1) , and the result is proven.

Simulations

Simulation experiments were conducted using actual programs, by running them multiple times in a UNIX shell script loop, together with programs simulating the data sets.

Type-I error rate has been studied using one, two, three and five bi-allelic markers to infer haplotype frequencies. All three embodiments (Figures 3, 5 and 6) can provide good size of the test even for small sample sizes. Between 5,000 and 10,000 simulations were used for calculating expected proportions of rejections, 10 restarts, and 1 ,000 samples for probability sampling (for calculating the median in embodiment 1).

A more detailed study of the type I error was conducted with embodiment 3, because of its general nature and because of the fact that is may be preferred over embodiment 1 from the asymptotic point of view.

First, the observation by Fallin et al., Accuracy of Haplotype Frequency Estimation for Biallelic Loci, Via the Expectation-Maximization Algorithm for Unphased Diploid Genotype Data, American Journal of Human Genetics, 67, pp. 947-959, that the E-M performance is not adversely affected by increasing the amount of linkage disequilibrium was confirmed, by comparing results for two and three markers, simulating samples with unlinked markers, as well as Dirichlet-derived samples with markers in complete and intermediate linkage disequilibrium. In the case of unlinked markers, alleles were sampled from the symmetric Dirichlet(l) distribution, thus allowing covering wide spectrum of frequencies. In addition, these simulations were repeated for two markers showing excess and deficit of heterozygotes to the extent that could be detected 60° o of the time in samples of 100 individuals. A normally distributed response, N(0, 1) was used, with two markers and a sample size of 50, as well as the binary response, Bernoulli(0.5), and a sample size of 50 with three markers. The use of the binary response is motivated by the asymptotic equivalence of the F test for the one-w ay ANOVA (from embodiment 1 , and thus, from embodiment 3 as well) and the IT test for contingency tables. See D'Agostino, Relation Between the Chi-Squared and ANOVA

Tests for Testing Equality of k Independent Dichotomous Populations, The American

Statistician, 26, pp. 30-32. The 5% nominal error rate was not exceeded, with very close correspondence of empirical and nominal rates when using the normal response.

Next, markers were allowed to be unlinked and response to follow different models, including binary, Gamma(10,5) distributed, Normal(0,l), mixture of two normals

(N(0,1) and N(5, 25)), and truncated normal. The truncated normal distribution (N(0,25), truncated at -5 and 5) was generated using the following procedure. Let f(.) be the normal

(:,Φ^~) density and F(.) its corresponding CDF. Define the random variable X-f(x). Then:

₌ Pr(a < X < x) Pr(a ≤ X ≤ b)

f f (y)dy

F(b) -F(a)

_ F(x) - F(a F(b) ~ F(a) has a uniform (0,1) distribution. Solving for X can provide a way for generating the truncated normal random variable from a uniform random number, u, as:

X~F^"1[u(F(b)-F(a))+F(a)]

As before, these simulations were repeated for situations with excess and deficit of heterozygotes that could be detected 60% ofthe time in samples of 100 individuals.

The error rates are given in Tables II-X1II, indicating good correspondence between declared 5% and observed error rates.

Table II. Type I error under HWE and the window size = 1

Table III. Type I error under excess of heterozygotes and the window size=l

Table IV. Type I error under deficit of heterozygotes and the window size=l

Table V. Type I error under HWE and the window size=2

Table VI. Type I error under excess of heterozygotes and the window size=2

Table VII. Type I error under deficit of heterozygotes and the window size=2

Table VIII. Type I error under HWE and the window size=3

11

Table IX. Type I error under excess of heterozygotes and the window size=3

Table X. Type I error under deficit of heterozygotes and the window size=3

Table XI. Type I error under HWE and the window side=5

Table XII. Type I error under excess of heterozygotes and the window size=5

Table XIII. Type I error under deficit of heterozygotes and the window size=5

All tests in the power studies were performed at 5% level. First, the situation where 3-locus haplotypes were assigned one of the four additive effects in the presence or absence of HWD in both directions was examined. As before, 60% probability of detecting these deviations in samples of size 100 was allowed for. Fifty individuals were sampled and calculated the F-test for Equation 3, averaging over 10,000 simulation experiments. A decrease in power from 67% to 61 % under excess of heterozygotes was shown, and increase of power to 0.98% under deficit of heterozygotes. With a larger sample size ( 100 individuals), the difference becomes smaller (94%, 93% and 99%, respectively), since the amount of ambiguity in the haplotype frequency inference increases with the proportion of multiple heterozygotes.

A realistic, but time-consuming simulation experiment also was performed, in which a large population was slowly mixed for a few generations into a smaller, expanding population (up to 10,000 individuals), followed by genetic drift for several generations more. Simulation of 2,000 bi-allelic markers randomly placed in a lOcM region was performed, allowing discrete generations, and random mating. Recombination followed Poisson (0.1) process, so that positions of recombination events during meiosis were uniformly distributed along the chromosomal region, with the mean number of recombination events equal to 0.1 per gamete. A single bi-allelic disease gene was placed before the 1000^th marker. A continuous response was assumed that followed a normal distribution with σ'^0.25. The mean of the distribution was equal to one for the one of the homozygotes, and zero for two other genotypes. At the final generation, 100 individuals were sampled and started embodiment 1 and embodiment 3 regressions at the beginning of the chromosome. A sliding window of one to seven markers was moved toward the end, calculating p (model p-value), and plotting -In p against the marker number, as shown in Figures 7A-7J. Figures 8A-8J are an independent repetition of the same simulation experiment, but with a sample size of 50. The actual polymorphism causing the shift in the response mean was removed from the data, thus was assumed "unobserved".

Accordingly, embodiments of the present invention appear to be quite robust, and can perform well under small sample sizes and various response models, even for binary data. Thus, embodiments of the invention can be used with case-control data as well as with continuous traits. The population simulation results described above are quite encouraging. Single- marker peaks around the true location are somewhat ragged, because of the stochastic differences in allele frequencies and amount of linkage disequilibrium with the disease gene. Some of the -In p variation for embodiment 1 might also be due to the stochastic nature of the E-M ANOVA. At each window, 10 initial restarts and 3200 samples were used to build the F-statistic distribution for embodiment 1. The whole analysis with window size up to five takes about two hours on an Ultra-2 Spark machine running Solaris, and seven marker window analysis can be completed in several hours. Note thai peaks around the true location are much higher for the haplotype-based tests with embodiment 3, while the overall pattern of p-values is similar for all embodiments. This confirms the theoretical considerations described above.

Thus, power may be increased by the combination of E-M inferred haplotype frequencies and regression. Although there is no explicit restriction on the number of loci, embodiments of the invention can be sufficiently fast for "sliding window" processing of two to seven markers in reasonable time in genome scans involving large numbers of loci.

Haplotype-based tests using continuous phenotype and E-M based frequencies therefore can be powerful and valid tests for association. Models based on individuals (n observations) or gametes (2n observations) can be null hypothesis-equivalent in the case of known gametic phase. Embodiments of the invention can be used as a screening tool for localizing genetic effects and/or for detecting epistatic effects involving candidate genes. Marker/disease and/or marker/trait associations can be uncovered. Systems, methods and/or computer program products according to embodiments of the invention can be efficient, and can allow rapid processing of large amounts of genetic data, including whole genome scans with dense maps of genetic markers.

Other Embodiments for Determining Probability

Other embodiments for determining probabilities of haplotypes that are compatible with alleles in a subset of markers now will be described in detail. These embodiments may be used as alternatives to the embodiments that previously were described in connection with Block 230 of Figure 2. Moreover, these embodiments also may be used independently, as alternatives to a conventional E-M technique, for various applications thereof. These embodiments can calculate haplotype — response association tests on unrelated individuals using alleles from multiple markers to statistically infer haplotypes. As was already described, it is known to apply the E-M algorithm to infer haplotype frequencies. Embodiments of the invention can calculate composite haplotype frequencies through sums of gametic and non-gametic terms. The above-cited Weir 1996 publication suggests a measure of linkage disequilibrium for two genetic markers that involves calculation of composite haplotype frequencies. Embodiments of the invention can extend the idea of composite haplotypes to an arbitrary number of markers and alleles and can provide an efficient algorithm for calculating composite haplotype frequencies.

Thus, embodiments of the invention can:

1. Extend Weir's method to arbitrary numbers of markers and alleles. 2. Allow composite haplotype frequencies to be used as initial haplotype frequencies for E-M calculations for faster convergence, because under HWE, the composite haplotype frequencies are unbiased estimates of the same population parameters as the ones obtained through E-M. 3. Provide an efficient computing algorithm. 4. Relate continuous or discrete responses using conditional sum of composite haplotype probabilities. 5. Provide higher power compared to E-M in many situations.

Embodiments of the invention may be distinguished from a conventional E-M algorithm for at least one or more ofthe following reasons: 1. Calculations of composite frequencies do not require the HWE assumption. This may be an important distinction between E-M — based and composite methods, since Hardy — Weinberg disequilibrium (HWD) may be expected for haplotypes related to the response. In the presence of the HWE, however, the composite haplotype frequencies may lead to an unbiased estimate of LD.

2. In the case of continuous data, E-M estimates the frequencies for the whole sample. This means that abundant haplotypes with response values from one tail of the distribution can affect probabilities of ambiguous haplotype configurations of the other tail, and thus can mask conceivable effects of haplotypes of the other tail.

3. Estimation and testing with composite counts can be non-iterative and more straightforward. In particular, one of the potential dangers of the E- M estimation is potential convergence to a local maximum. Multiple restarts using different initial haplotype frequencies are employed while doing the E-M. However, there may be no clear way of calculating an optimal number of restarts. il

4. Composite frequency calculations can be much faster. The amount of computing for a particular haplotype type can depend linearly on the sample size.

5. Asymptotic tests (e.g., contingency table χ^~ or logistic regression tests) may fail if the E-M is run separately for different categories of response, but the composite haplotype method may not be prone to this. Shuffling tests with E-M have been suggested. However, they may be notoriously slow, because they may require a new E-M estimation each time the response is scrambled. 6. There are many biologically plausible situations when genetic contribution to response is not determined solely by haplotypes. Rather, it can be important which alleles an individual has at a particular set of markers. Embodiments of the invention can capture the combination of both situations: presence of particular haplotypes as well as a particular set of alleles at different markers. Figures 9A-9C are an example of this, simulated under the assumption that pairs of haplotypes forming a genotype may additionally contribute to the response beyond what is explained by individual haplotypes. The functional (response-related) region extends up to the 50^th marker, and the height of the peak reflects the statistical strength of the method. The single-marker approach (Figure 9A) does not do well in comparison with either E-M-inferred haplotypes

(Figure 9B), or composite haplotypes (Figure 9C). Embodiments of the invention (Figure 9C) are far superior in this simulation.

Embodiments of the invention now will be described in detail. The multilocus, multiple allele definition derives from counting numbers of genotypes compatible with a particular haplotype. The amount of uncertainty is a function of numbers of distinct haplotypes that each genotype could expand into. This uncertainty defines w* eights for multilocus genotype contributions. For a multilocus genotype g. , define H(g_t ) to be the number of single — locus heterozygotes in g, . Then the weights are given by:

Sample composite haplotype counts are calculated as:

' = ∑ Ag_t )Ha,b, c„.. c- g_ι ) , (20)

1=1 where n is the sample size, and /(-) is the indicator function, defined as f l if / - th individual genotype ? has alleles a,b,c... I(a,b, c, ... _gl ) = \ .

[0 otherwise. Thus, if the /-th individual has at least one copy of all required alleles, it is counted with weight w(g,). The composite haplotype frequencies are given by:

Piibi . . ~ 2.-,n ^}lakc . ^■ In a two-locus, two-allele case, Equation (20) simplifies to Weir's composite count definition: n _lB = 2/z _UBB + n _{λ m} + n _laBB + n _laBb 12 (Weir, Genetic data analysis II, (1996)), where n _!(7#/„ for example, is the number of individuals with the genotype Aa/Bb. In a single-locus case, it simplifies to the usual definition ofthe allele count:

To relate composite haplotype frequencies to the response, per-individual conditional probabilities are computed. They are computed from additive contribution of pairs of composite haplotypes. Specifically, for composite haplotypes hk and Λ,, with frequencies (p_h , p, ) , the conditional probability of the pair h , hj) for the /-th individual with genotype g, is:

where p, p_h denote sample composite haplotype frequencies, and each haplotype in a pair is equally likely, i.e.:

,Λ _l ) .

These conditional probabilities can relate to response through multiple regression. Theoretical justification and details of the multiple regression approach are given in the above-cited Application Serial No. 09/694,748. Additional Embodiments for Determining Probabilities

Additional embodiments for determining probabilities of haplotypes that are compatible with alleles in a subset of markers now will be described in detail. These embodiments may be used as alternatives to the embodiments that previously were described in connection with Block 230 of Figure 2. Moreover, these embodiments also may be used independently, as alternatives to a conventional E-M technique, for various applications thereof.

Other embodiments for determining probabilities include Composite Haplotypes (CH). The CH embodiments introduced here can be used as a general test for association of di-genic counts with the phenotype. The comparisons presented here include the binary phenotype, so that the CH performance can be compared with an EM-based Likelihood- Ratio Test (LRT). Note however, that the power of CH can be increased if the data sets used are not dichotomized and the continuous phenotype is assumed. t The number of composite haplotypes can be determined by J = ]^~j>», . where k is i=l the number of markers, and m, is the number of alleles at the /-th marker. This number grows quickly with k, which may make the memory requirements demanding, and may adversely affect stability of the multiple regression. Embodiments ofthe present invention can allow identification of composite haplotypes with user-specified threshold frequency (/) by randomly reconstructing pairs of haplotypes for each individual W times and keeping a list of observed haplotypes with the corresponding frequency. The number IT is determined by the tolerable error associated with the binomial (nW,t) random variable. Thus, the speed of these embodiments may be affected very little by J.

Further increases in power may be obtained by replacing X by columns of its principle components as described by Johnson et al., Applied Multivariate Analysis (1982), with the user-defined proportion of variation in X, ω , that may be accounted for. The resulting matrix has fewer columns depending on the value of ω . To keep the comparisons fair, this technique was not used in the analysis presented here. However, note that using principal components of X may be especially useful when /. is large and the haplotype diversity is high. CH was chosen because of the potential benefit of the power of this technique.

CH also can be considerably faster than EM-based techniques that use iterative frequency estimation at each shuffling step. Moreover, many of the steps used in obtaining statistics values for permuted data sets can be pre-computed. For example, P = (X X)^"' X that is used to compute parameter estimates, β = P Y , is invariant under shuffling and should only be calculated once. Thus, CH can be faster and more accurate.

Case Study

Case-Control Di-Genic Frequencies

Assume for the moment two markers with two alleles and a binary phenotype, i.e. the subjects can be classified as either "cases" or "controls". Assume that the haplotype AB increases the probability of an individual being a "case". Denote the expected allele frequencies among cases and controls by ( p ,p_B ^' , p' ,p_B ^{' ~})- Therefore, the expected frequency of AB haplotype in cases and controls can be expressed as:

Cases : + D_] Controls :

+ D_Z where D_\, - are LD coefficients. The composite di-genic frequency (Weir, 1996) estimates a quantity different from P_A , specifically:

Cases : φ,_IB = p p_B +E>, + P_{Λ B} Controls : φ'_i'_B -

+D, + P^' _VB

The last terms, P ._B, P_Λ ^',_B are the frequencies of A, B alleles that reside on two different gametes in contrast to P_AB ,P_.iB , that measure their joint frequency on the same gamete. This "intra-gametic" frequency can also be written as a product of A, B allele frequencies plus the deviation (D_{4 B}) unexplained by the product. Generally, this deviation is not zero if the HWE at the haplotype level does not hold. As illustrated next, in addition to P_lB - P_iB > 0 , generally P, _B - P_{i u} > 0 . Therefore a test that φ_1B - φ_ΛB ≠ 0, is next considered which may be the basis of the CH embodiments.

Intra-Gametic Frequency Difference Between Cases (Y) And Controls (N)

Two-locus two allele frequencies can be arranged into the following 3x3 table: BB Bb Bb

AA fi fi

Aa fi fi+fi fi

Aa fi fi /io

where fζ is the frequency of AJB/ab and /₆ is the frequency of Ab/aB genotype. The missing gametic phase implies that only the sum (j + fij) can be observed. Denote penetrance parameters con-esponding to the genotypes g, ,...,g₎₀ by γ,...,γ_w. The mean penetrance (prevalence) is γ = f_ji ■ The following conditional probabilities may be observed:

r

Pr(g,^.|N) = ⁽ . ; -γ

Similar expressions hold for locus B. Using these, the following may be found for the cases:

P_iB= Pr , \Y)+- { g₂ I Y) + Pr(g₄ I ϊ) + Pr( _s ] Y ))

φ_AB = 2 Yx(g I Y) + Pr(g I Y) + ?v(g₄ I y) + 1 (Pr( ₅1 Y) + Pr(g₆ | Y))

p* = ^■ ^■p.,

D_iB=P_w-Pr{A\Y)?v{B\Y)

Co esponding expressions for controls are obtained by replacing Y with N in the conditional probabilities, and therefore P_iB - P_.iB and φ_B -φ_lB are available. Assuming, as before, that the haplotype AB increases the probability for an individual to be a "case", it may be expected that the difference P_AB~P_lB is positive. However, many of such situations give the intra-gametic frequency difference: P

that is greater than zero as well, and so is φ_B — φ"_iB. Figures 10A-10C are a numerical illustration of this observation, obtained for three penetrance matrices: (1,1,0,1,1,0,0,0,0,0), (1,1/2,0,1/2,1/2,0,0,0,0,0), (1/2,1,0,1,1,0,0,0,0,0) corresponding to Figures 10A, 10B and 10C, respectively. Each histogram is based on 50,000 observations and was obtained by sampling four haplotype frequencies from a uniform distribution and computing /,,..., ,₀ from the Hardy- Weinberg proportions. Only the last example has non-zero (9%) probability of P,,_β -P. _B < 0. In particular, ?r(P,_B - P_{Λ B} > θ) = 1 for penetrance matrices ofthe form may be depicted as: r = (fl b c b b c c c c c) Vα>b>c.

Composite frequencies estimation and hypothesis testing

In this section, the distinction between notation for population and estimated quantities may be dropped, and P_iB may be used, for example, to denote the sample frequency ofthe haplotype AB. In the two-locus bi-allelic case, the di-genic frequencies are estimated as:

(See Weir et al., 1996) where n _{a b}, for example, is the number of individuals with genotype Aa/Bb, η_4B s the di-genic count, and n is the sample size. The composite disequilibrium is calculated using the sum of inter- and intra-gametic components:

^Δ,ιs = D_iB +D_{A B} = P._iB + P _B - 2p p_B . Under random mating, E(P, „) = p ,p_B , and so Δ_ω is an unbiased estimate of the LD parameter, D _B. For the present purposes it is not necessary to separate the inter- and

P +P intra-gametic components as the work may be in the term of p_B - — ^ ^l-^- .

The general multi-locus multi-allelic definition of di-genic frequencies is as follows. Given /-th individual multi-locus genotype g_t, H(g_t) is the number of single- locus heterozygotes in g_t . Weights are defined as:

Sample composite haplotype counts are calculated from summing over individual contributions:

n_«bc. = ∑w(g₁)/(α,b,c,... g-,) ,

1=1 where n is the sample size, and /(■) is the indicator function, defined as:

.. , _Λ fl if - - th individual genotype g, has alleles a,b,c...

/(α, b, c, ... = ,) = |_{0 otherw}._se

Thus, if /-th individual has at least one copy of all required alleles, it is counted with

weight w(g,). The composite haplotype frequencies are given by p_abc = — η_{ahl b} . In a

single-locus case, they simplify to the usual definition ofthe allele frequency:

Once the composite frequencies are calculated, a row vector was obtained of conditional composite frequencies, x _; , for an individual j with the multi-locus genotype g, . These frequencies are estimates of conditional probabilities of composite haplotypes given the genotype g. . The length of the vector is the number of haplotypes that will be used for the test of association with the phenotype. For example, a minimum required sample frequency may be used as a threshold to reduce the number of haplotypes for the analysis. The Ar-th component of the vector corresponding to the k-t haplotype is calculated as:

where ρ[ denotes frequency of the composite haplotype that is complementary to the haplotype hi. The complementary haplotype is determined by the genotype, given the first haplotype, hi. The probability ?v(g_ι \ h_]t is either zero or one, so the sum in the denominator is over the haplotype pairs compatible with the genotype. Denoting the vector of phenotype values (not necessarily binary) by Y and letting

a generalized linear model E(Y) = E(Xb) , where E(-) is a link function may be obtained. The hypothesis of no overall association of haplotypes with trait is tested as /-/„ : b = 0 , and the individual haplotype k can be tested as H₀ : b_k = 0. Theoretical justification and details of this multiple regression approach are given in the above-cited Application Serial No. 09/694,748.

For the simulation studies carried above identity link was assumed to allow for computation of the F-statistic. The significance values of the tests were obtained from the permutation distribution generated by randomly shuffling the entries of Y at least 3200 times.

ΕM Versus CH: A Two-Locus, Two- Allele Simulation Study

For this study a binary (Y,N) phenotype was assumed. Penetrance matrices (probabilities for the Y category) of several different forms were used. The first matrix:

A, = (a b c b b c c c c c), has the restrictions that either a,b ≥ c or a,b ≤ c , and Pr(α > b) = 1 / 2 . This model explicitly assigns different penetrance values for genotypes that contain the AB haplotype. The matrix A is A, with a = b . Next eight matrices (A₃,... ,A,₀) followed the epistatic and heterogeneity models described in table 1 of McCarthy et al., "Sib-pair collection strategies for complex diseases", Genet Epid 17: 317 — 340 (1998), with the subtraction of three models that can be obtained from others by transposing the one of the penetrance matrices. Next, a case when all ten penetrance values are allowed to be different without a pre-specified order (An) was studied. The last matrix, A,₂, was set up as:

A_p = (a a a a b a a a a)

Although A, ₂ does not seem to correspond to a plausible biological model, it is the case where the effect of AB is only carried through the effect of the ambiguous double heterozygous genotype, and so the largest difference in performance may be expected from the two methods. Given the restrictions, exact penetrance values for matrices A, through A,₂ were drawn from a uniform (0, 1) distribution for each simulation run.

Population haplotype frequencies for each of 10,000 simulations were generated by (a) sampling from the multivariate uniform distribution, Dirichlet(l ,l , l,l), with ten di- locus population genotype frequencies obtained assuming HWΕ; and (b) by sampling ten- locus genotypes directly from the multivariate uniform distribution. The second way permits genotypes to deviate from the HWE proportions. Rejection sampling was used to obtain pre-specified values of LD (0 to 0.3 and 0.5 to 1 of the maximum possible value) and HW disequilibrium (0.5 to 1 of the maximum possible value). Samples of 50 and 100 individuals were obtained by multinomial sampling from the population frequencies. The program "FAST EH+" was used to calculate permutation p-values of the EM-based LRT. See, Zhao et al. "Model-free analysis and permutation tests for allelic associations", Human Heredity 50: 133 — 139 (2000). Permutation p-values for the CH were obtained from the overall test using the HTR.

EM Versus CH: A Multi-Locus Study

All parameters of the two-locus system are readily controlled, and Dirichlet- derived population frequencies coupled with the rejection sampling is adequate. For the multi-locus case, a population was used as an isolate model as described in Almasy et al., GAW12: Simulated genome scan, sequence, and family data for a common disease, (in press, Genetic Analysis Workshop 12, Genetic Epidemiology, supplement volume) (2001) and Thomas et al., Evolution of the simulated data problem, (in press, Genetic Analysis Workshop 12, Genetic Epidemiology, supplement volume) (2001). Two starting populations (100 and 10,000 individuals for the isolate and the general populations) in linkage equilibrium were considered. The LD was created by 20 generations of genetic drift and migration from the general population into the isolate. The migration stopped after first eight generations. The number of migrants was equal to 5% of the isolate size. The initial values of population allele frequencies were sampled from the uniform (0, 1 ) distribution. The recombination was modeled assuming no interference. The final generation of the isolate consisted of 10,000 individuals. 100 individuals were sampled for the consequent analysis and 512 separate evolutions were perfomied.

The response was modeled by assigning a genotypic value, G_k - N(0,1) to each genotype in the response region defined by ten consecutive SNPs. These SNPs were assumed unobserved and genotypes of two to eight SNPs that were 0.025 cM away from the response region were used for the analysis. The phenotype of the /'s individual was constructed as Y_jk = G_k + ε_t , where ε₍ - N(0, 0Λ) . The phenotypic value was dichotomized about the sample mean prior to the analysis. The average LD between adjacent markers was 0.35 as measured by the correlation coefficient. The program "FAST EH+" was used to carry out the LRT. See, Zhao et al.

Results

All power values were obtained as proportions of p-values at least as small as 5%. Results for two-locus, two-allele case showed that CH and EM+LRT have similar performance under the assumption of the population HWE for the models that attribute decreased or increased penetrance to a specific haplotype (table XIV). Table XIV. Power values for the LRT and the CH, two-locus simulations, HWE, LD range: 0.5,...,1 of the maximum value, and the sample size of 50.

A, A₂ A,

CH 0.831 0.936 0.549 LRT 0.831 0.939 0.547

Surprisingly, the model A, , did not reveal higher power for the EM-based test. Under the HWD (table XV) the power ofthe LRT appears to be slightly affected. Table XV. Power values for the LRT and the CH, two-locus simulations, HWD, LD range: 0.5,..., 1 of the maximum value, and the sample size of 50.

A, A_^ A,

CH 0.852 0.950 0.227 LRT 0.834 0.942 0.203

Results for matrices A ,..., A, , (these are not specifically haplotype-driven configurations) are shown in table XVI.

Table XVI. Power values for the LRT and the CH, two-locus simulations, HWD, LD range: 0.5 1 of the maximum value, and the sample size of 50.

Overall, in these embodiments, CH shows small improvement in power. Similar results were observed for smaller values of LD, 0 to 0.3 of the maximum value, with higher power for both tests (data not shown). This can be attributed to reduction of haplotype diversity caused by high values of LD. Table XVII presents results from multi-locus simulations. One to seven markers used in the analysis weren't directly affecting the phenotype, therefore the power values reflect the strength of the LD between the "unobserved" functional region and these markers. Power values are clearly higher for the CH with the largest value (90%) observed for five marker composite haplotypes. Although the permutation test is most likely to have the correct size, the validity of the CH test was verified under the null hypothesis. For each of 10,000 simulations, population haplotype frequencies were sampled from the Dirichlet distribution and obtained multinomial samples of genotypes of various //. These simulations were performed for normally and binary distributed Y and haplotype sizes of one to ten.

Additionally, in the present embodiments, CH shows a small improvement in power when the size ofthe haplotype is increased.

Table XVII. Power values for the LRT and the CH, 512 multi-locus forward simulations, sample size of 100.

haplotype size 1 2 3 4 5 6 7

CH 0.154 0.412 0.523 0.783 0.900 0.850 0.818

LRT 0.160 0.406 0.501 0.568 0.662 0.644 0.639

In the drawings and specification, there have been disclosed typical preferred embodiments of the invention and, although specific terms are employed, they are used in a generic and descriptive sense only and not for purposes of limitation.

The foregoing examples are illustrative of the present invention, and are not to be construed as limiting thereof. The invention is defined by the following claims, with equivalents of the claims to be included therein.

Claims

What is Claimed is:

1. A method of associating haplotype frequencies for a plurality of individuals with a continuous trait, each individual including a pair of chromosomes having a plurality of markers thereon, each marker having a pair of alleles, wherein a haplotype comprises a combination of alleles for a set of markers on a predetermined chromosome, the method comprising: selecting a subset of markers from the set of markers that may correlate with the continuous trait; for each individual, obtaining a value of the continuous trait and a pair of alleles for each ofthe markers in the subset of markers; for each individual, determining probabilities of haplotypes that are compatible with the alleles in the subset of markers; and performing a regression on the probabilities of haplotypes that are compatible with the alleles in the subset of markers, for all the individuals, to determine correlations between the continuous trait and the haplotypes.

2. A method according to Claim 1 wherein the step of performing regression comprises: for each individual, sampling a first haplotype from the haplotypes that are compatible with the individual's set of alleles, to thereby define a second haplotype which is determined by the sampling ofthe first haplotype; assigning the value of the continuous trait for the individual to both the first haplotype and the second haplotype, to thereby define a doubled sample size; performing an analysis of variance by comparing average values of the trait among the sampled first and second haplotypes for all the individuals; repeating the steps of sampling, assigning and performing to obtain a distribution of correlations of the continuous trait and the haplotypes; and determining a value from the distribution that identifies a significance of the correlation.

3. A method according to Claim 2 wherein the step of performing an analysis of variance comprises: defining a design matrix of first and second indicator values having two rows for each individual, where the second indicator value is associated with the first and second haplotypes and remaining positions in the design matrix are set to the first indicator value in the two rows; and performing a regression on the design matrix, to thereby identify a correlation value between the value ofthe continuous trait and the first and second haplotypes.

4. A method according to Claim 3 wherein the step of determining a value comprises: determining a median from the distribution that identifies a significance of the correlation.

5. A method according to Claim 1 wherein the step of performing regression comprises: for each haplotype in the set, assigning a rank of significance; for each individual, sampling a first haplotype from the haplotypes that are compatible with the individual's set of alleles, to thereby define a second haplotype which is determined by the sampling ofthe first haplotype; assigning the value of the continuous trait for the individual to both the first haplotype and the second haplotype, to thereby define a doubled sample size; performing a one degree of freedom regression on the ranks for the sampled first and second haplotypes for all the individuals; repeating the steps of sampling, assigning the value of the continuous trait and performing a one degree of freedom regression to obtain a distribution of the correlation of the continuous trait and the haplotypes; and determining a value from the distribution that identifies a significance of the correlation.

6. A method according to Claim 5 wherein the step of performing a one degree of freedom regression comprises: defining a design matrix having two columns of the ranks of the first and second haplotypes and having two rows for each individual; and performing a regression on the design matrix, to thereby identify a correlation value between the value of the continuous trait and the haplotypes.

7. A method according to Claim 6 wherein the step of determining a value comprises: determining a median from the distribution that identifies a significance of the conflation.

8. A method according to Claim 1 wherein the step of performing regression comprises: relating the value of the continuous trait for each individual to a vector of estimated frequencies of all haplotypes; and performing a multiple regression of the trait values on the vectors of estimated frequencies.

9. A method according to Claim 1 wherein the step of determining comprises performing an expectation-maximization.

10. A method of associating haplotype frequencies for a plurality of individuals with a continuous trait, each individual including a pair of chromosomes having a plurality of markers thereon, each marker having a pair of alleles, wherein a haplotype comprises a combination of alleles for a set of markers on a predetermined chromosome, the method operating on data for each individual including a value of the continuous trait and a pair of alleles for each of the markers in a subset of markers from a set of markers that may correlate with the continuous trait, the method comprising: for each individual, determining probabilities of haplotypes that are compatible with the alleles in the subset of markers; and performing a regression on the probabilities of haplotypes that are compatible with the alleles in the subset of markers, for all the individuals, to determine correlations between the continuous trait and the haplotypes.

1 1. A method according to Claim 10 wherein the step of performing regression comprises: for each individual, sampling a first haplotype from the haplotypes that are compatible with the individual's set of alleles, to thereby define a second haplotype which is determined by the sampling of the first haplotype; assigning the value of the continuous trait for the individual to both the first haplotype and the second haplotype, to thereby define a doubled sample size; performing an analysis of variance by comparing average values of the trait among the sampled first and second haplotypes for all the individuals; repeating the steps of sampling, assigning and performing to obtain a distribution of correlations ofthe continuous trait and the haplotypes; and determining a value from the distribution that identifies a significance of the correlation.

12. A method according to Claim 11 wherein the step of performing analysis of variance comprises: defining a design matrix of first and second indicator values having two rows for each individual, where the second indicator value is associated with the first and second haplotypes and remaining positions in the design matrix are set to the first indicator value in the two rows; and performing a regression on the design matrix, to thereby identify a correlation value between the value ofthe continuous trait and the first and second haplotypes.

13. A method according to Claim 12 wherein the step of determining a value comprises: determining a median from the distribution that identifies a significance of the correlation.

14. A method according to Claim 10 wherein the step of performing regression comprises: for each haplotype in the set, assigning a rank of significance; for each individual, sampling a first haplotype from the haplotypes that are compatible with the individual^'s set of alleles, to thereby define a second haplotype which is determined by the sampling ofthe first haplotype; assigning the value of the continuous trait for the individual to both the first haplotype and the second haplotype, to thereby define a doubled sample size; performing a one degree of freedom regression on the ranks for the sampled first and second haplotypes for all the individuals; repeating the steps of sampling, assigning the value of the continuous trait and performing a one degree of freedom regression to obtain a distribution of the con-elation of the continuous trait and the haplotypes; and determining a value from the distribution that identifies a significance of the correlation.

15. A method according to Claim 14 wherein the step of performing a one degree of freedom regression comprises: defining a design matrix having two columns of the ranks of the first and second haplotypes and having two rows for each individual; and performing a regression on the design matrix, to thereby identify a correlation value between the value of the continuous trait and the haplotypes.

16. A method according to Claim 15 wherein the step of determining a value comprises: determining a median from the distribution that identifies a significance of the correlation.

17. A method according to Claim 10 wherein the step of performing regression comprises: relating the value of the continuous trait for each individual to a vector of estimated frequencies of all haplotypes; and performing a multiple regression of the trait values on the vectors of estimated frequencies.

18. A method according to Claim 10 wherein the step of determining comprises performing an expectation-maximization.

19. A system for associating haplotype frequencies for a plurality of individuals with a continuous trait, each individual including a pair of chromosomes having a plurality of markers thereon, each marker having a pair of alleles, wherein a haplotype comprises a combination of alleles for a set of markers on a predetermined chromosome, the system comprising: data representing a subset of markers from the set of markers that may correlate with the continuous trait; for each individual, a value of the continuous trait and a pair of alleles for each of the markers in the subset of markers; means for determining probabilities of haplotypes that are compatible with the alleles in the subset of markers, for each individual; and means for performing a regression on the probabilities of haplotypes that are compatible with the alleles in the subset of markers, for all the individuals, to determine conelations between the continuous trait and the haplotypes.

20. A system according to Claim 19 wherein the means for performing regression comprises: means for sampling a first haplotype from the haplotypes that are compatible with the individual's set of alleles, to thereby define a second haplotype which is determined by the sampling of the first haplotype; means for assigning the value of the continuous trait for the individual to both the first haplotype and the second haplotype, to thereby define a doubled sample size; means for performing an analysis of variance by comparing average values of the trait among the sampled first and second haplotypes for all the individuals; means for repeatedly activating the means for sampling, the means for assigning and the means for performing to obtain a distribution of correlations of the continuous trait and the haplotypes; and means for determining a value from the distribution that identifies a significance of the correlation.

21. A system according to Claim 20 wherein the means for performing an analysis of variance comprises: a design matrix of first and second indicator values having two rows for each individual, where the second indicator value is associated with the first and second haplotypes and remaining positions in the design matrix are set to the first indicator value in the two rows; and means for performing a regression on the design matrix, to thereby identify a correlation value between the value of the continuous trait and the first and second haplotypes.

22. A system according to Claim 21 wherein the means for determining a value comprises: means for determining a median from the distribution that identifies a significance of the correlation.

23. A system according to Claim 19 wherein the means for performing a regression comprises: data representing a rank of significance for each haplotype in the set; means for sampling a first haplotype from the haplotypes that are compatible with the individual's set of alleles, to thereby define a second haplotype which is determined by the sampling of the first haplotype for each individual; means for assigning the value of the continuous trait for the individual to both the first haplotype and the second haplotype, to thereby define a doubled sample size; means for performing a one degree of freedom regression on the ranks for the sampled first and second haplotypes for all the individuals; means for repeatedly activating the means for sampling, the means for assigning the value of the continuous trait and the means for performing a one degree of freedom regression to obtain a distribution of the correlation of the continuous trait and the haplotypes; and means for determining a value from the distribution that identifies a significance of the conelation.

24. A system according to Claim 23 wherein the means for performing a one degree of freedom regression comprises: a design matrix having two columns of the ranks of the first and second haplotypes and having two rows for each individual; and means for performing a regression on the design matrix, to thereby identify a correlation value between the value ofthe continuous trait and the haplotypes.

25. A system according to Claim 24 wherein the means for determining a value comprises: means for determining a median from the distribution that identifies a significance ofthe correlation.

26. A system according to Claim 19 wherein the means for performing regression comprises: means for relating the value of the continuous trait for each individual to a vector of estimated frequencies of all haplotypes; and means for performing a multiple regression of the trait values on the vectors of estimated frequencies.

27. A system according to Claim 19 wherein the means for determining comprises means for performing an expectation-maximization.

28. A computer program product that associates haplotype frequencies for a plurality of individuals with a continuous trait, each individual including a pair of chromosomes having a plurality of markers thereon, each marker having a pair of alleles, wherein a haplotype comprises a combination of alleles for a set of markers on a predetermined chromosome, the computer program product comprising a computer usable storage medium having computer-readable program code embodied in the medium, the computer-readable program code comprising: computer-readable program code that is configured to store a subset of markers from the set of markers that may correlate with the continuous trait; computer-readable program code that is configured to store a value of the continuous trait and a pair of alleles for each of the markers in the subset of markers, for each individual; computer-readable program code that is configured to determine probabilities of haplotypes that are compatible with the alleles in the subset of markers, for each individual; and computer-readable program code that is configured to perform a regression on the probabilities of haplotypes that are compatible with the alleles in the subset of markers, for all the individuals, to determine correlations between the continuous trait and the haplotypes.

29. A computer program product according to Claim 28 wherein the computer-readable program code that is configured to perform a regression comprises: computer-readable program code that is configured to sample a first haplotype from the haplotypes that are compatible with the individual's set of alleles, to thereby define a second haplotype which is determined by the sampling of the first haplotype, for each individual; computer-readable program code that is configured to assign the value of the continuous trait for the individual to both the first haplotype and the second haplotype, to thereby define a doubled sample size; computer-readable program code that is configured to perform an analysis of variance by comparing average values of the trait among the sampled first and second haplotypes for all the individuals; computer-readable program code that is configured to repeatedly activate the computer-readable program code that is configured to sample, the computer-readable program code that is configured to assign and the computer-readable program code that is configured to perform, to obtain a distribution of correlations of the continuous trait and the haplotypes; and computer-readable program code that is configured to determine a value from the distribution that identifies a significance of the correlation.

30. A computer program product according to Claim 29 wherein the computer-readable program code that is configured to perform an analysis of variance comprises: computer-readable program code that is configured to provide a design matrix of first and second indicator values having two rows for each individual, where the second indicator value is associated with the first and second haplotypes and remaining positions in the design matrix are set to the first indicator value in the two rows; and computer-readable program code that is configured to perform a regression on the design matrix, to thereby identify a correlation value between the value of the continuous trait and the first and second haplotypes.

31. A computer program product according to Claim 30 wherein the computer-readable program code that is configured to determine a value comprises: computer-readable program code that is configured to determine a median from the distribution that identifies a significance of the correlation.

32. A computer program product according to Claim 28 wherein the computer-readable program code that is configured to perform a regression comprises: computer-readable program code that is configured to store a rank of significance for each haplotype in the set; computer-readable program code that is configured to sample a first haplotype from the haplotypes that are compatible with the individual's set of alleles, to thereby define a second haplotype which is determined by the sampling of the first haplotype, for each individual; computer-readable program code that is configured to assign the value of the continuous trait for the individual to both the first haplotype and the second haplotype, to thereby define a doubled sample size; computer-readable program code that is configured to perform a one degree of freedom regression on the ranks for the sampled first and second haplotypes for all the individuals; computer-readable program code that is configured to repeatedly activate the computer-readable program code that is configured to sample, the computer-readable program code that is configured to assign the value of the continuous trait and the computer-readable program code that is configured to perform a one degree of freedom regression, to obtain a distribution of the correlation of the continuous trait and the haplotypes; and computer-readable program code that is configured to deteπriine a value from the distribution that identifies a significance of the correlation.

33. A computer program product according to Claim 32 wherein the computer-readable program code that is configured to perform a one degree of freedom regression comprises: computer-readable program code that is configured to provide a design matrix having two columns of the ranks of the first and second haplotypes and having two rows for each individual; and computer-readable program code that is configured to perform a regression on the design matrix, to thereby identify a correlation value between the value of the continuous trait and the haplotypes.

34. A computer program product according to Claim 33 wherein the computer-readable program code that is configured to determine a value comprises: computer-readable program code that is configured to determine a median from the distribution that identifies a significance ofthe correlation.

35. A computer program product according to Claim 28 wherein the computer-readable program code that is configured to perform regression comprises: computer-readable program code that is configured to relate the value of the continuous trait for each individual to a vector of estimated frequencies of all haplotypes; and computer-readable program code that is configured to perform a multiple regression of the trait values on the vectors of estimated frequencies.

36. A computer program product according to Claim 28 wherein the computer-readable program code that is configured to determine comprises computer- readable program code that is configured to perform an expectation-maximization.

37. A computer program product that associates haplotype frequencies for a plurality of individuals with a continuous trait, each individual including a pair of chromosomes having a plurality of markers thereon, each marker having a pair of alleles, wherein a haplotype comprises a combination of alleles for a set of markers on a predetermined chromosome, the computer program product operating on data for each individual including a value of the continuous trait and a pair of alleles for each of the markers in a subset of markers from a set of markers that may correlate with the continuous trait, the computer program product comprising a computer usable storage medium having computer-readable program code embodied in the medium, the computer- readable program code comprising: computer-readable program code that is configured to determine probabilities of haplotypes that are compatible with the alleles in the subset of markers, for each individual; and computer-readable program code that is configured to perform a regression on the probabilities of haplotypes that are compatible with the alleles in the subset of markers, for all the individuals, to determine correlations between the continuous trait and the haplotypes.

38. A computer program product according to Claim 37 wherein the computer-readable program code that is configured to perform a regression comprises: computer-readable program code that is configured to sample a first haplotype from the haplotypes that are compatible with the individual's set of alleles, to thereby define a second haplotype which is determined by the sampling of the first haplotype, for each individual; computer-readable program code that is configured to assign the value of the continuous trait for the individual to both the first haplotype and the second haplotype, to thereby define a doubled sample size; computer-readable program code that is configured to perform an analysis of variance by comparing average values of the trait among the sampled first and second haplotypes for all the individuals; computer-readable program code that is configured to repeatedly activate the computer-readable program code that is configured to sample, the computer-readable program code that is configured to assign and the computer-readable program code that is configured to perform to obtain a distribution of correlations of the continuous trait and the haplotypes; and computer-readable program code that is configured to determine a value from the distribution that identifies a significance of the correlation.

39. A computer program product according to Claim 38 wherein the computer-readable program code that is configured to perform an analysis of variance comprises: computer-readable program code that is configured to provide a design matrix of first and second indicator values having two rows for each individual, where the second indicator value is associated with the first and second haplotypes and remaining positions in the design matrix are set to the first indicator value in the two rows; and computer-readable program code that is configured to perform a regression on the design matrix, to thereby identify a correlation value between the value of the continuous trait and the first and second haplotypes.

40. A computer program product according to Claim 39 wherein the computer-readable program code that is configured to determine a value comprises: computer-readable program code that is configured to determine a median from the distribution that identifies a significance of the correlation.

41. A computer program product according to Claim 37 wherein the computer-readable program code that is configured to perform a regression comprises: computer-readable program code that is configured to assign a rank of significance for each haplotype in the set; computer-readable program code that is configured to sample a first haplotype from the haplotypes that are compatible with the individual's set of alleles, to thereby define a second haplotype which is determined by the sampling of the first haplotype, for each individual; computer-readable program code that is configured to assign the value of the continuous trait for the individual to both the first haplotype and the second haplotype, to thereby define a doubled sample size; computer-readable program code that is configured to perform a one degree of freedom regression on the ranks for the sampled first and second haplotypes for all the individuals; computer-readable program code that is configured to repeatedly activate computer-readable program code that is configured to sample, computer-readable program code that is configured to assign the value of the continuous trait and computer- readable program code that is configured to perform a one degree of freedom regression, to obtain a distribution ofthe correlation of the continuous trait and the haplotypes; and computer-readable program code that is configured to determine a value from the distribution that identifies a significance ofthe correlation.

42. A computer program product according to Claim 41 wherein the computer-readable program code that is configured to perform a one degree of freedom regression comprises: computer-readable program code that is configured to provide a design matrix having two columns of the ranks of the first and second haplotypes and having two rows for each individual; and computer-readable program code that is configured to perform a regression on the design matrix, to thereby identify a correlation value between the value of the continuous trait and the haplotypes.

43. A computer program product according to Claim 42 wherein the computer-readable program code that is configured to determine a value comprises: computer-readable program code that is configured to determine a median from the distribution that identifies a significance ofthe correlation.

44. A computer program product according to Claim 37 wherein the computer-readable program code that is configured to perform regression comprises: computer-readable program code that is configured to relate the value of the continuous trait for each individual to a vector of estimated frequencies of all haplotypes; and computer-readable program code that is configured to perform a multiple regression of the trait values on the vectors of estimated frequencies.

45. A computer program product according to Claim 37 wherein the computer-readable program code that is configured to determine comprises computer- readable program code that is configured to perform an expectation-maximization.

46. A method according to Claim 1, wherein the step of determining comprises obtaining a composite haplotype.

47. A method according to Claim 10, wherein the step of determining comprises obtaining a composite haplotype.

48. A method according to Claim 19, wherein the step of determining comprises obtaining a composite haplotype.

49. A computer program product according to Claim 28 wherein the computer- readable program code that is configured to determine comprises computer-readable program code that is configured to obtain a composite haplotype.

50. The computer program product according to Claim 37 wherein the computer- readable program code that is configured to determine comprises computer-readable program code that is configured to obtain a composite haplotype.

51. A method according to Claim 1 , wherein the step of determining probabilities of haplotypes that are compatible with the alleles in the subset of markers comprises determining a haplotype-response association test on unrelated individuals.

52. A method according to Claim 10, wherein the step of determining probabilities of haplotypes that are compatible with the alleles in the subset of markers comprises determining a haplotype-response association test on unrelated individuals.

53. A method according to Claim 19, wherein the step of determining probabilities of haplotypes that are compatible with the alleles in the subset of markers comprises determining a haplotype-response association test on unrelated individuals.

54. A computer program product according to Claim 28 wherein the computer- readable program code that is configured to determine comprises computer-readable program code that is configured to determine a haplotype-response association test on unrelated individuals.

55. A computer program product according to Claim 37 wherein the computer- readable program code that is configured to determine comprises computer-readable program code that is configured to detennine a haplotype-response association test on unrelated individuals.