Analysis of Microarray Gene Expression Data - M. Lee (Kluwer
Analysis of Microarray Gene Expression Data - M. Lee (Kluwer
Analysis of Microarray Gene Expression Data - M. Lee (Kluwer
ANALYSIS OF MICROARRAY
GENE EXPRESSION DATA
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic,
mechanical, recording, or otherwise, without written consent from the Publisher
List of Tables xv
Preface xvii
2.2 Genes 8
2.3 DNA 9
2.4 RNA 12
2.6 Proteins 14
3. MICROARRAY TECHNOLOGY 19
3.6 Hybridization 34
Real-time RT-PCR 40
5. BACKGROUND NOISE 53
Data Set 75
Genes 82
Contents vii
Testing 146
Contents ix
Approach 206
Testing 207
Designs 227
Contents xi
List of Figures
6.1 MA plot 79
List of Tables
9.1 A half-replicate of a
design 106
9.2 The other half-replicate of a
design 106
9.3 An incomplete block design for 4 treatments 108
Preface
I thank Jeff Sklar and his former research team at the Brigham and
Women’s Hospital for introducing me to microarray technology in 1999.
At that time, statistical methodology for analyzing microarray data was
a new research field that needed much development and published re
ports in the literature were sparse. It took me several weeks at Frank
Kuo’s laboratory observing the procedures and details before I began
to understand how gene expression is measured in microarray experi
ments. During the past few years, statistical models and methods for
microarray data have been studied by many investigators. There is still,
however, much room for improvement. Hence I thought it might be a
useful contribution if I published a synthesis of what I have learned.
I thank David Beier, Mason Freeman, Cynthia Morton, and Rus
Yukhananov for providing datasets for illustrations in the book. I thank
Harry Björkbacka for contributing the chapter on microarray technolo
gies and for providing insightful comments for the chapter on DNA,
proteins, and gene expression. I thank Ming-Hui Chen, Frank Kuo,
Weining Lu, and Pi-Wen Tsai for providing helpful comments on draft
chapters of this book. I am especially grateful to Alex Whitmore for
his many constructive comments, contributions, encouragement, and his
tireless efforts in reading preliminary drafts. I thank Paul Guttry, Jay
lyn Olivo, and Nancy Voynow of the Editorial Office at Brigham and
Women’s Hospital for proofreading the manuscript. Some errors might
remain in the book, but their number would be greater without the help
I have received. In order to write this book, I have worked in my office
during every weekend and holidays for the past three years. I thank
my family for their understanding. This project is supported in part by
National Institutes of Health grants CA89756, HG02510 and HL72358.
Thanksgiving in Boston, 2003
This page intentionally left blank
GENOME PROBING
USING MICROARRAYS
This page intentionally left blank
Chapter 1
INTRODUCTION
Notes
1
Lander, E.S. (1996). Science, 274, 536-539.
2 Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown,
P.O., Botstein, D. and Futcher, B. (1998). Molecular Biology of the Cell, 9, 3273-329.
3 Golub, T.R., Slonim, D.K., Tamayo, P., et al. (1999). Science, 286, 531-537.
4 Welsh, J.B., Zarrinkar, P.P., Sapinoso, L.M., et al. (2001). Proceedings of the National
Academy of Sciences, USA, 98, 1176-1181.
5 Harcia, J.G. (1999). Nature Genetics, 21, 42-47.
6 Martin, K.J., Graner, E., Li, Y., Price, L.M., Kritzman, B.M., Fournier, M.V., Rhei, E.,
and Pardee, A.B. (2001). Proceedings of the National Academy of Sciences, USA, 98,
2646-2651.
7 Hedenfalk, I., Ringner, M., Ben-Dor, A., Yakhini, Z., Chen, Y., Chebil, G., Ach, R.,
Loman, N., Olsson, H., Meltzer, P., Borg, A., Trent, J. (2003). Proceedings of the National
Academy of Sciences, USA, 100, 2532-2537.
8 Rosenwald, A., Alizadeh, A.A., Widhopf, G., et al. (2001) Journal of Experimental Medi
cine, 194, 1639-1648.
Chapter 2
What has just been described is the central dogma of molecular biol
ogy that formulates how information is stored and converted to all the
components and interactions that build up a living organism.
Proteins are the most functionally versatile of the life molecules. Being
the “work horses” or “machines” of a cell, proteins catalyze an extraor
dinarily wide variety of chemical reactions and also serve as the building
blocks of cellular structures. They are the building blocks of muscles,
skin, and hair, as well as the enzymes that catalyze and control all
chemical reactions in an organism, ranging from food digestion to nerve
impulses and the components that are responsible for DNA replication,
transcription, and translation.
In the following sections we will discuss the building blocks and the
higher-order structure of the macromolecules of life.
2.2. Genes
Genes are the units of the DNA sequence that control the identifiable
hereditary traits of an organism. A gene can be defined as a segment of
DNA that specifies a functional RNA. The total set of genes carried by an
individual or a cell is called its genome. The genome defines the genetic
construction of an organism or cell, or the genotype. The phenotype,
on the other hand, is the total set of characteristics displayed by an
organism under a particular set of environmental factors. The outward
appearance of an organism (phenotype) may or may not directly reflect
the genes that are present (genotype). Today the complete genome
sequences of several species are known, including several bacteria, yeasts,
DNA, RNA, Proteins, and Gene Expression 9
2.3. DNA
Except for some viruses, the genetic material of all known organisms
consists of one or more long molecules of deoxyribonucleic acid (DNA).
The chemical components of the DNA molecule dictate the inherent
properties of a species. DNA is made up of chains of chemical build
ing blocks called nucleotides. Each nucleotide consists of a phosphate
group, a deoxyribose sugar molecule, and one of four different nitroge
nous bases usually referred to by their initial letters: guanine (G), cyto
sine (C), adenine (A), or thymine (T). Genetic information is encoded
in DNA by the sequence of these nucleotides. The information stored
in the sequence of nucleotides in terms of the four nitrogenous bases is
analogous to a long word in a four-letter alphabet.
The carbons in the deoxyribose sugar group of a nucleotide are as
signed numbers followed by a prime symbol etc.). In DNA, the
nucleotides are connected to each other via a link of the hydroxyl phos
phate group of one pentose ring of the deoxyribose sugar to the OH
group of the next pentose ring. The chemical connections between the
repeating sugar and phosphate groups are called phosphodiester bonds.
With one end and the other end, each chain is said to have polarity.
It is conventional to write nucleic acid sequences in the direc
tion. DNA forms a double helix of two intertwined chains (strands) of
nucleotides. The two polynucleotide chains run in opposite directions;
that is, one strand runs in the direction, while the other strand
runs in the direction.
It was proposed in the now classic manuscript by Watson and Crick8 in
1953 that the two nucleotide chains are held together by hydrogen bonds
that form between the nitrogenous bases. The polarity of the double
helix requires specific hydrogen bonding between the bases so that they
fit together. Guanine preferentially hydrogen-bonds with cytosine, and
adenine can bond preferentially with thymine. That is, G pairs only
with C, and A pairs only with T. These matching base pairs are referred
to as complementary. For example, a short segment with ten nucleotides
might be of the form
10 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
2.4. RNA
As was introduced in section 2.1, the core biochemical flow of genetic
information can be summarized as the process of RNA synthesis (tran
scription) and the process of protein synthesis (translation). The first
step in making a protein is to copy, or transcribe, the information en
coded in the DNA of the genes into a single-stranded molecule called
ribonucleic acid (RNA). Since this process is similar to the process of
copying written words, the synthesis of RNA from DNA is called tran
scription. The DNA is said to be transcribed into RNA, and the RNA
is called a transcript. The nucleotides of RNA contain the sugar ribose,
while the nucleotides of DNA contain deoxyribose that has one more
oxygen. Furthermore, instead of thymine, RNA contains uracil (U), a
base that has hydrogen-bonding properties identical to those of thymine.
Hence the RNA bases are G, C, A, and U. RNA is less stable than DNA.
RNA synthesis requires the RNA polymerase enzyme complex that binds
to a specific sequence at one end of a gene (the promoter) and separates
the two strands of DNA. It moves along the gene, maintaining the sep
arated strand “bubble”, and uses only one of the separated strands as
a template, synthesizing an ever-growing tail of polymerized nucleotides
that eventually becomes the full-length transcript. Hence, RNA is a
single-stranded nucleotide chain, not a double helix. Since RNA is al
ways synthesized in the direction, the addition of ribonucleotides
by RNA polymerase is at the end of the growing chain.
There are two general classes of RNAs. Those that take part in the
process of decoding genes into proteins are referred to as “ informational
RNAs” called messenger RNA (mRNA). In the other class, the RNA
itself is the final functional product. These RNAs are referred to as
“functional RNAs”. Functional RNAs are the transfer RNAs (tRNA)
and the ribosomal RNA (rRNA), which are both part of the intricate
protein synthesis machinery that translates the informational mRNA
into protein.
Figure 2.2 shows that the sequence of messenger RNA is complemen
tary to the sequence of the bottom strand of DNA and is identical to
the top strand of DNA, except for the replacement of T with U. A mes
senger RNA includes a sequence of nucleotides that corresponds to the
sequence of amino acids in the protein. This part of the nucleic acid is
called the coding region. Because mRNA is an exact copy of the DNA
coding regions, mRNA analysis can be used to identify polymorphisms
in coding regions of DNA. A polymorphism is a DNA region for which
nucleotide sequence variants exist in a population of organisms. Such
variations can sometimes explain the occurrence of a disease or enzyme
DNA, RNA, Proteins, and Gene Expression 13
2.6. Proteins
The primary structure of a protein is a linear chain of building blocks
called amino acids. There are 20 amino acids that commonly occur in
proteins. These amino acids are linked together by covalent bonds called
peptide bonds. A peptide bond is formed through a condensation reaction
during which one water molecule is removed. Because of the manner in
which the peptide bond forms, a polypeptide chain always has an amino
end and a carboxyl (COOH) end. This primary chain is coiled
and folded to form a functional protein. Proteins are the most important
determinants of the properties of the cells and organisms. The biological
role of most genes is to encode, or carry, information for the composition
of proteins. This composition, together with the timing and amount of
each protein produced, determines the structure and physiology of an
organism, i.e., the phenotype.
Because the process of reading the mRNA sequence and converting it
into an amino acid sequence is like converting one language into another,
the process of protein synthesis is called translation. The four-letter
alphabet of the genes is translated into the 20-amino-acid alphabet of
proteins in ribosomes. Ribosomcs are big complexes of several proteins
and ribosomal RNA (rRNA). The rRNA functions to guide mRNA into
a correct starting position by binding to special sequences present in
the beginning of all mRNAs. The translation of the genetic code into a
DNA, RNA, Proteins, and Gene Expression 15
protein is achieved with the help of transfer RNAs (tRNAs). The tRNAs
contain a trinucleotide sequence complementary to the codon called the
anticodon. Each species of tRNA molecules is charged with a specific
amino acid in an enzymatic reaction, hence coupling a certain amino
acid to a certain anticodon nucleotide triplet on the tRNA molecule.
In essence, the translation of the DNA code to protein amino acids is
done in this enzymatic coupling step. The ribosome subsequently aligns
the mRNA codon with the matching tRNA anticodon, and if the base
pairing matches, the amino acid carried by the tRNA is attached to
the growing chain of amino acids to form a polypeptide chain. Hence,
the specific base pairing of the nucleotides once again ensures that the
correct information is transferred. When the ribosome reaches a stop
codon, it releases the polypeptide chain, which then folds into the defined
three-dimensional structure of a protein. Proteins must often undergo
post-translational modifications to become active. These modifications
can, for instance, be cleavages of the polypeptide chain at predefined
sites or binding of additional molecules like lipids. sugars, or co-factors
that assist in catalysis of chemical reactions.
Notes
1
Lewin, B. (2000). Genes, VII, Oxford University Press, New York.
2 Cooper, G.M. (2000). The Cell - A Molecular Approach, 2nd ed., Sinauer Associates Inc,
Sunderland, Massachusetts.
3 Griffiths, A.J.F., Gelbart, W.M., Miller, J.H., and Lewontin, R.C.. (1999). Modern Ge
netic Analysis, Freeman, New York.
4 Lehninger, A.L., Nelson, D.L., and Cox, M.M. (2000) Principles of Biochemistry, Worth
Publishing, 3rd ed.
5 Griffiths, A.J.F., Miller, J.H., Suzuki, D.T., Lewontin, R.C., and Gelbart, W.M. (2000).
An Introduction for Genetic Analysis, Freeman, New York.
6 Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K., Walter, P. (2002). Molecular
Biology of the Cell, Garland Publishing, 4th ed.
7 Dale, J.W. and von Schantz, M. (2002). From Genes to Genomes: Concepts and Appli
cations of DNA Technology. John Wiley and Sons, Ltd, England.
8 Watson, J.D., and Crick, F.H.C. (1953). Nature, 171, 737-738.
Chapter 3
MICROARRAY TECHNOLOGY
19
trol over what part of the gene will be utilized for hybridization. The
oligonucleotides can, for instance, be designed to optimally differentiate
between highly similar transcripts that might cross-hybridize on a cDNA
array.
In-situ oligonucleotide arrays, Figure 3.5, were developed by Fodor
et al.16 and Affymetrix, Inc. In-situ oligonucleotide arrays use a combi
nation of photolithography and solid-phase oligonucleotide chemistry to
synthesize short oligonucleotide probes (25-mer oligos) directly on the
solid support surface. The number of oligonucleotides (50,000 probes
per 1.28 square centimeters) on a chip manufactured by this method
vastly exceeds what can be achieved by spotting solution robotically.
Affymetrix Inc. has chosen to utilize this advantage to construct an
array with several oligonucleotide probes and cross-hybridization con
trols for each target gene. However, the researcher has little, if any,
control over what probes are used on pre-manufactured arrays like the
Affymetrix GeneChip arrays. On the other hand, comparison of results
between different laboratories is facilitated by the use of products from
a common manufacturer.
For in-situ oligonucleotide arrays, the test and reference samples (or
the treatment and control samples) are hybridized separately on dif
ferent chips. In contrast, for either spotted cDNA arrays or spotted
oligonucleotide arrays, a test and a reference sample labeled with two
different fluorescent dyes are commonly simultaneously hybridized on
the same arrays. This difference affects how microarray data generated
with single-color or two-color arrays are analyzed (see section 3.8).
been fully sequenced, one can amplify every known and predicted open
reading frame (ORF) in the genome using reverse transcription PCR
(RT-PCR) and sequence-specific primers. In organisms with smaller
genomes and infrequent introns, such as yeast and prokaryotic microbes,
purified total genomic DNA serves as a template, and sequence-specific
oligonucleotides are used as primers.
When dealing with large genomes and genes with frequent introns,
such as those of the human and mouse, cloned expressed sequence tags
(EST), individual full-length cDNA clones, or collections of partially se
quenced cDNAs corresponding to each of these transcripts can be used
as the source of gene-specific detector probes in an array. Many methods
are available for recovering purified cDNA from the PCR amplification
reaction. A simple method is to prepare purified template cDNAs from
the bacterial colonies that harbor them and follow-up with ethanol pre
cipitation, gel filtration, or both, to prepare relatively pure cDNA for
printing. The choice of template source and PCR strategy vary with the
organism being studied.
Synthesized oligonucleotides can also be used as probes in spotted
microarrays (Figure 3.4). Genes of interest are chosen from public se
quence databases including GeneBank, dbEST, and UniGene. Many
variables have to be considered in selecting the sequence of the oligonu
cleotide to be made. First, the length of the oligonucleotide has to be
chosen. The longer the oligonucleotide, the more specific it will be.
However, longer oligonucleotides are more costly and more difficult to
make. Today several commercial oligonucleotide sets are available for
mouse, human, and other organisms, varying in probe length between
30 and 70 nucleotides. Second, the probes must be selected so that they
are specific for their target genes. If similarities exist between probes
on the same microarray, they can cross-hybridize to more than one gene
target, making the results hard to analyze. Third, all oligonucleotides
must have similar hybridization properties. Usually all the probes are
designed so that their melting temperature is within 1-2 degrees Cel
sius and that they have a similar content of G and C nucleotide base
pairs. Several probe selection algorithms have been developed, but so
far no consensus exists on the most effective design principles. For in
stance, there are several algorithms just for calculating oligonucleotide
melting temperatures. Other considerations that may go into designing
the probes are self-hybridization properties (palindromic sequences) and
synthesis efficiency of certain sequences. The location of the probe along
the message may also be important.
26 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Microarray Technology
27
28 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Microarray Technology 29
Microarray Technology 33
3.6. Hybridization
Hybridization of the labeled target to the probes on a microarray is
performed by adding the targets dissolved in hybridization buffer to the
slide within a confined space, followed by incubation for a given amount
of time at a certain temperature. The hybridization can, for instance,
be performed under a microscope slide cover slip or within a chamber
that limits the volume. Volumes are kept small to reduce the time of
hybridization. Automated hybridization stations have been developed
that agitate the hybridization solution over the slide and allow for better
control of hybridization conditions, which gives lower backgrounds and
better reproducibility. The hybridization conditions need to be set so as
to promote the specific hybridizations between the target and individual
probes and limit nonspecific hybridizations to the support itself or other
probes. This is achieved mainly by varying the temperature and the
ionic strength of the hybridization buffer. The temperature needs to be
lower than the melting temperature of the probes but sufficiently high
to reduce nonspecific hybridizations. The salt concentration, pH, and
other characteristics of the buffer may also promote specific hybridiza
tions. It may be advantageous to add competing DNA like salmon sperm
DNA, Cot-1 DNA (enriched with mammalian repetitive sequences), and
poly-A DNA (to block nonspecific hybridization to poly A regions). Af
ter hybridization for anywhere from several hours to overnight, the hy
bridization solution is discarded and the slides are subjected to washes
of varying ionic strength to remove nonspecifically bound targets with
increasing stringency. After the wash regimen, the slide is dried and is
ready to be scanned. The dyes used are degraded over time by expo
sure to light, so hybridized slides and labeled target solutions need to be
stored in the dark.
Microarray Technology 35
ning, the fewer PCR cycles that are required to reach the threshold
cycle. The threshold cycle is inversely proportional to the logarithm of
the starting amount of template in the PCR reaction. By construct
ing standard curves of known amounts of starting template, unknown
samples can be quantified very accurately. See Ginzinger27 (2002) for
a review of gene quantification using quantitative real-time PCR and
Bustin28 (2002) for a review of reverse transcription PCR.
Notes
1
Schalon, D., Smith, S.J., Brown, P.O. (1996). Genome Research, 6, 639-645.
2 Lockhart, D.J., Dong, H., Bryne, M.C., Follettie, M.T., et al. (1996). Nature Biotechnol
ogy, 14, 1675-1680.
3
Lander, E.S. (1996). Science, 274, 536-539.
4
Lipshutz, R.J., Fodor, S.P., Gingeras, T.R., Lockhart, D.J. (1999). Nature Genetics, 21,
20-24.
5 Brown, P.O., and Botstein, D. (1999). Nature Genetics, 21, 33-37.
6
Eisen, M.B., Brown, P.O. (1999). Methods in Enzymology, 303, 179-205.
7 Southern, E., Mir, K., Shchepinov, M. (1999). Nature Genetics, 21, 5-9.
8
Bowtell, D. (1999). Nature Genetics, 21, 25-32.
9 Cheung, V.G., Morley, M., Aguilar, F., Massimi, A., Kucherlapati, R., Childs, G. (1999).
21,10-14.
16 Fodor, S.P., Rava, R.P., Huang, X.C., Pease, A.C., Holmes, C.P., Adams, C.L. (1993).
Clara, CA.
20 Affymetrix GeneChip, 700228 rev. 2.
21 Lockhart, D., Dong, H., Byrne, M., Follettie, M., Gallo, M., Chee, M., Mittmann, M.,
Wang, C., Kobayashi, M., Horton, H., et al. (1996). Nature Biotechnology, 14, 1675-1680.
22 Wodicka, L., Dong, H., Mittmann, M., Ho, M., and Lockhart, D. (1997). Nature Biotech
31-36.
24 Ausubel, F.M. et al. (editors) (1993). Current Protocols in Molecular Biology, John Wiley
Chapter 4
INHERENT VARIABILITY
IN MICROARRAY DATA
population as all those that are extant in U.S. females of a given race
in a specified age bracket at a given point in time. The connection be
tween such a general population and the microarray results for a small
sample of uterine tumors taken from a few women in this population
is very distant and tenuous, spanning sampling variability introduced
at several intervening stages of sample selection. In order to generalize
the findings to the general population of women, the microarray studies
should be based on a random sample taken from the general patient pop
ulation. To reiterate, the following discussion and analysis will assume
that the immediate biological specimens in the laboratory are the target
biological populations of interest for statistical inferences.
Even after the biological specimen is in hand, variability in expression
measurement can arise from many factors. One of the first sources of
variability encountered in many microarray studies is that produced by
selecting the sample of genetic material from the population of interest.
For example, a sample of tumor tissue must be selected from a patient’s
tumor in a microarray study of uterine cancer. It is clear that the
sample material may vary to the extent that the tumor is not a uniform
biological object. The genetic composition of the sample may differ
depending on whether the core biopsy is taken, for instance, from the
peripheral zone or the transition zone of the tumor. The microarray
study design must take sampling variability of this kind into account.
For example, the design might call for taking a systematic selection of
several samples from different parts of the tumor.
In general, the exact gene expression levels are unknown, and par
ticular strategies have to be developed to quantify systematic errors in
microarray experiments. A variety of correction methods can be found in
the literature, including comparison of duplicated spots to quantify the
variability for the same array and the same pin; analysis of control spots
to quantify the variability from pin to pin and variations across the filter;
checking the reproducibility on different filters; analysis of empty back
ground spots for non-specific noise and overshining; and use of dilution
series of the target7. Brazma et al.8 (2001) proposed the Minimum In
formation about a Microarray Experiment (MIAME) as a standard for
recording and reporting microarray-based gene expression data. Any
single microarray output is subject to substantial variability even under
the relatively controlled conditions of an experiment. It is advisable to
consider appropriate experimental designs and perform multiple stages
of quality control before hybridizing valuable experimental samples.
sample formed the target biological sample. Only output from channel 1
(green) contained expression readings for the target tissue. Output from
channel 2 (red) contained noise alone. The study design consisted of 288
genes, each printed at three locations on the same slide. By compar
ing the signals from these triplicates, Lee et al. evaluate the minimum
variability that is likely to be inherent in a microarray system and learn
more about the reproducibility of the array process and the outcome
of analysis. The experiment was designed so that 32 of the 288 genes
would be expected to be highly expressed because of Alu repeats that
should cross-hybridize to similar sequences widely distributed among ex
pressed and unexpressed portions of the human genome. Results based
on individual replicates, however, show that there are 55, 36, and 58
highly expressed genes in replicates 1, 2, and 3, respectively. On the
other hand, we will show in later chapters that by applying appropriate
statistical methods one can pool the readings from the three replicates
and obtain more accurate analytical results such that only 2 of the 288
genes are incorrectly classified as expressed. As a result, a minimum of
three replicates is recommended in a microarray study. This replication
test data set is used to demonstrate a number of points later in this text.
Notes
1 Schena, M., Editor. (2000). DNA Microarrays, Oxford University Press, New York.
2 Bulyk, M.L., Huang, X., Choo, Y., Church, G.M., (2001), Proceedings of the National
Academy of Sciences, USA, 98, 7158-7163.
3 Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K. , Eisen, M.B., Brown,
P.O., Botstein, D. and Futcher, B. (1998). Molecular Biology of the Cell, 9, 3273-329.
4 Schuchhardt, J., Dieter, B., Arif, M., et al. (2000), Nucleic Acids Research.
5 Wang, X., Ghosh, S., and Guo, S. (2001). Nucleic Acids Research, 29, No. 15, e75.
6 Yang, Y.H., Dudoit, S., Luu, P., and Speed, T.P. (2001). In Bittner, M.L., Chen, Y.,
Dorsel, A.N., and Dougherty, E.R. (eds), Microarrays: Optical Technologies and Infor
matics, SPIE Society for Optical Engineering, San Jose, CA.
7 Herzel, H., Beule, D., Kielbasa, S., Korbel, J. (2000), Chaos, 11, 1-3.
8 Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert C.,
Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege,
F.C.P., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U.,
Schulze-Kremer, S., Steward, J., Taylor, R., Vilo, J., and Vingron, M. (2001). Nature
Genetics, 29, 365-371.
9 Lee, M.-L.T., Kuo, F.C., Whitmore, G.A., and Sklar, J., (2000). Proceedings of the Na
BACKGROUND NOISE
53
BACKGROUND NOISE
55
56 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
The full data set consists of gene expression measurements for all
genes and for all specimens The
are usually arranged in a G × N matrix of values, with genes corre
sponding to rows and specimen samples corresponding to columns. Dif
ferent microarray platforms yield different types of expression measure
ments. The exact relationship of measurement to the true concen
tration of gene in specimen depends on the technology, imbedded
adjustments that have been used, and, importantly, on the background
noise that is present. We defer the discussion of missing and saturated
intensity values to later chapters.
The ScanAlyze system, for example, outputs measures CH1I and CH2I,
which are uncorrected mean pixel intensities for the array spot for two
fluorescent hybridizations4 and also produces background corrections
CH1B and CH2B for the same channels.
The correction procedure (5.2) may yield negative values for gene ex
pression, especially where component is small (or zero) or the back
ground estimate is large. The following variant of the background-
corrected reading is obtained when negative values from (5.2) are set to
zero.
human tissue sample formed the target biological sample. The experi
ment was designed so that 32 of the 288 genes would be expected to be
highly expressed because of Alu repeats5 that should cross-hybridize to
similar sequences widely distributed among expressed and unexpressed
portions of the human genome. Only output from channel 1 (green)
contained expression readings for the target tissue. Output from chan
nel 2 (red) contained noise alone. Counting the 288(3) = 864 spots as
the designated gene set, we have G = 864. There is only one biological
specimen, so N = 1. Note that the designated gene set on the ar
ray contains triplicates of 288 distinct genes. This data set is used to
demonstrate a number of points in the discussion that follows and also
later in the text.
The readings of gene expression for the 864 cDNA spots are dis
played in a histogram in Figure 5.3. Here index because only one
specimen is under consideration. The histogram shows the gene expres
sion data in terms of their common logarithms (i.e., logarithms to base
10). The gene expression data are the output denoted as CH1I in the
ScanAlyze system, which are the uncorrected mean pixel intensities of
spots for the green fluorescent hybridization (Eisen, 1999). Observe the
large concentration of small readings that generally correspond to noise.
A scattering of larger readings also appear that are mainly associated
with gene probes that should be highly expressed. The logarithmic scale
amplifies the detail in the lower range of the data.
The noise component of the data appears unimodal and roughly sym
metrically distributed. The distribution of the smaller number of ex
pressed genes stands out as a separate distribution to the right end of
the histogram scale. Thus, the distribution pattern of the data looks
very much like a mixture of gene expression readings for a large num
ber of unexpressed genes (noise alone) and a smaller number of genes
expressed to varying degrees.
Taking the Replication Test Data Set as a case example, a log-value
of 3.8 is used as a cutoff for unexpressed genes (based on a rough visual
judgment for Figure 5.3). It is found that 761 of the 864 spots lie below
this cutoff level. As the experiment involved 256 gene triplicates that
should be unexpressed, the count 3(256) = 768 closely matches this
count of 761. Among the 761 spots, a plot of against shows
a fairly clear linear relationship but one that does not follow the line
of identity. The scatter plot and line of identity appear in Figure 5.4.
The correlation coefficient for and is 0.773. What is a little
more surprising is that only 9 of the background-corrected readings
for the 761 unexpressed genes are negative. If the background noise
BACKGROUND NOISE
61
estimates are unbiased estimates of their true values, then about half
The plot shows clearly that the values of for the 761 spots are
consistently larger than their counterpart background noise estimates
(i.e., they lie above the line of identity). This fact is also indicated
by a comparison of their mean values; 2250 being the mean for and
1722 for Thus, the results of the diagnostic check suggest that
background correction for this case example is moderately successful at
best.
The case example is convenient because there is a clear separation
of expressed and unexpressed genes (Figure 5.3). In many applications,
however, the separation is not as clear, and the set of unexpressed genes
required for the diagnostic test will not be so easy to discern.
62 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Taking the logarithm (to base 2) of both sides, the multiplicative model
in (5.6) can be written as an additive model on the logarithmic scale
experience shows that the MM probes actually track the signal. They
state that “the MMs should be viewed as a set of average lower affinity
probes”. They then go on to develop a method to exploit the signal
content of the MM probes. Along a similar line, Irizarry et al.10 (2003)
remark that “Recent results . . . suggest that subtracting MM as a way
of correcting for non-specific binding is not always appropriate”. They
cite two sources for this remark and go on to say that “until a better
solution is proposed, simply ignoring these values is preferable.”
BACKGROUND NOISE 65
Notes
1 Brown, C.S., Goodwin, P.C., and Sorger, P.K. (2001). Proceedings of the National Acad
emy of Sciences, USA, 98, 8944-8949.
2
Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. (1995), Science, 270, 467-470.
3
DeRisi, J.L. , Iyer, V.R., and Brown, P.O. (1997), Science 278, 680-686.
4
Eisen, M.B. (1999). ScanAlyze User Manual, Version 2.32; Stanford University: Stanford,
CA.
5
Lee, M.-L.T., Kuo, F.C., Whitmore, G.A., Sklar, J. (2000). Proceedings of the National
Academy of Sciences, USA, 97, 9834-9839.
6 Li, C., Wong, W.H. (2001). Proceedings of the National Academy of Sciences, USA, 98,
31-36.
7
Sasik, R., Calvo, E., Corbeil, J., (2002), Bioinformatics, 18, 1633-1640.
8 Affymetrix (2002). http://www.affymetrix.com/products/25mer-content.html.
9 Naef, F., Socci, N. D. and Magnasco, M. (2003). Bioinformatics, 19(2), 178-184.
10 Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B. and Speed, T.P. (2003).
Chapter 6
TRANSFORMATION
AND NORMALIZATION
In this case, the offset parameter is chosen separately for each array
and then applied across all genes on the array. The red (R) and green (G)
color readings can be reversed in calculating this log-ratio of intensities.
Observe that the offset parameter tends to keep the average intensity
of the combined colors roughly unchanged but shifts one color reading
relative to the other. The shift is most pronounced for low intensities.
The authors propose that parameter be chosen to minimize the sum of
the absolute deviations of the observations from their median value
for all genes This fitting criterion has the effect of giving a nearly
horizontal scatter in an MA plot for the two transformed intensities.
See section 6.2.4 for a discussion of such plots.
The effect of the affine adjustment on the transformed expression in
tensities may need to be explored in different applications. For example,
with a transformation the first derivative equals
for natural logarithms. This fact tells us that a small change
TRANSFORMATION AND NORMALIZATION 71
where
represents the proportional error that always exists,
The transformation in (6.11) has the general form reported by the au
thors, but the expression for constant in terms of the model parameters
differs from that obtained in the cited paper. The two versions of the
transformation differ little when is small. This transformation sta
bilizes the asymptotic variance of data distributed according to model
(6.6). For a large value of the transformation (6.11) is approximately
the natural logarithm. At near zero, the transformation (6.11) is ap
proximately linear. This transformation was considered by Hawkins5
(2002) in the context of another application.
Geller et al.6 (2003) show that data from Affymetrix GeneChips con
form to the same two-component model in Durbin et al. (2002). Huber
et al.7 (2002) also consider a family of transformations that is related to
the generalized-log family.
the test and the reference sample and that the number of upregulated
genes largely matches the number of downregulated genes, normaliza
tion factors can be based on the total fluorescence, expression ratios,
or regression analysis. Total fluorescence normalization assumes that
approximately the same total amount of test and reference sample has
hybridized, and thus the total fluorescence of both dyes used should be
the same on the array. A normalization factor calculated from the ratio
of the total fluorescence of the dyes can be used to re-scale the intensity
of each gene on the array.
Observe that, for any given array if all gene expression readings
are multiplied by an arbitrary positive constant, the scale-normalized
readings would be unchanged for any of the Box-Cox transformations.
Thus, the ratios eliminate any scaling factor that cuts across all genes
within any experimental condition.
than green, producing the concave upward pattern in the scatter plot
and smooth fitted function.
Locally weighted regression (LOWESS) (Cleveland 1979, 1981) is a
method for smoothing scatterplots in which the fitted value at is the
value of a line fitted to the data using weighted least squares where the
weight for point is large if is close to and small if is not
close to A robust fitting procedure guards against outliers distorting
the smoothed points13.
The color asymmetry demonstrated in Figure 6.1 has led to the use of
microarray study designs in which arrays are produced in pairs with the
colors in one array reversed relative to the colors in the second array in
order to compensate for the color differences. These are called reversed-
color designs and are discussed in more detail in Chapter 9. Thus, a
reversed-color experiment copes with the color asymmetry by obtaining
two gene expression readings for each gene under each experimental
condition one from each color channel.
Note that M captures the differential expression for the two experimental
conditions and A measures the mean intensity (both on the transformed
scale). As and are averages over both arrays and colors, the effects
of these two factors are neutralized. Thus, the MA plot in this case will
show a relationship between differential expression and average intensity
that is free of color bias. Figure 6.2 shows this plot for the Mouse
Juvenile Cystic Kidney Data Set. The plot uses the data only from
Arrays 2 and 3 in Table 6.1, as these contain an imbedded reversed-
color design. Because the data are normalized, the MA plot has axes
centered on zero.
A LOWESS function has been fitted to the plot. Note its approx
imate linearity and the relatively uniform scatter of points about the
smooth function. Also, note the tendency for differential expression to
decline with intensity. The correlation coefficient for M and A across
all genes is -0.51. The figure shows that there is a relationship between
differential expression and average intensity and that it is not solely a
color phenomenon.
pin tip and is the LOWESS function fitted to the data from the
pin tip.
Instead of using the entire set of genes to fit the LOWESS curve for
normalization, Tseng et al.14 (2001) fit the LOWESS curve to a selected
set of rank-invariant genes before conducting nonlinear normalization
(see subsection 6.2.6 for more details). The curve is extrapolated to
genes with the highest and lowest intensities, as these are excluded from
the rank-invariant gene set by definition.
Here and denote the green and red intensity levels of gene
respectively, and denotes the rank of intensity among all
K genes.
Notes
1
Kerr, M.K., Afshari, C.A., Bennett, L., Bushel, P., Martinez, J., Walker, N.J., and
Churchill, G.A. (2002). Statistica Sinica, 12, 203-218.
2 Rocke, D.M., Lorenzato, S. (1995). Technometrics, 37, 176-184.
3 Rocke, D.M., Durbin, B. (2001). Journal of Computational Biology, 8, 557-569.
4 Durbin, B.P., Hardin, J.S., Hawkins, D.M., and Rocke, D.M. (2002). Bioinformatics, 18,
S105-S110.
5 Hawkins, D.M. (2002). Statistics in Medicine, 21, 1913-1935.
6 Geller, S.C., Gregg, J.P., Hagerman P., Rocke, D.M. (2003). http: // handel.cipic.ucdavis.edu/
CA.
11
Dudoit, S., Yang, Y.H., Callow, M.J., Speed, T.P. (2002). Statistica Sinica, 12, 111-139.
12 Cleveland, W.S. (1981). The American Statistician, 35, 54.
13 Cleveland, W.S. (1979). Journal of the American Statistical Association, 74, 829-836.
14 Tseng, G.C., Oh, M.-K., Rohlin, L., Liao, J.C., and Wong, W.H. (2001). Nucleid Acids
MISSING VALUES
IN MICROARRAY DATA
Some researchers follow the practice of flagging readings that are sus
pect, and these may be converted to missing values or otherwise excluded
from the analysis before proceeding. For instance, spots with dust parti
cles, irregularities, or other bad features may be flagged manually. Spots
may be flagged as ‘absent’ or ‘feature not found’ when nothing is printed
in the location of a spot or if the imaging software cannot detect any
fluorescence at the spot. Expression readings that are barely above the
background correction (using a criterion such as less than two back
ground standard deviations above) may also be flagged.
86 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
used as the imputed values for gene If the nearest neighors also
have missing values on conditions various backup remedies
are applied. The measure of similarity or distance that is employed
can vary – Euclidean distance is commonly used. The SAM software
includes the neighbors method for imputation. It employs
Euclidean distance as the distance metric and uses a row average to
complete the imputation if the nearest neighbor method still leaves
gene with one or more missing values.
Many variations of this method have been proposed. Some variants
suggest, for example, that the imputation average be weighted by
the similarity of the neighbor to the gene under consideration, with
more similar neighbors being given greater weight. There is also
variation in terms of the number of neighbors to be used. A study
by Troyanskaya et al. (2001) shows that results are adequate and
relatively insensitive to values of between 10 and 20.
3. Regression estimate method:
Most commercial statistical software packages have one or more rou
tines available for dealing with imputation. A common one involves
using fitted regression values to replace missing values. The software
package Stata, for example, offers this routine. The method works
as follows. Let be missing for a particular gene under condi
tion As before, let be the set of conditions for which gene
has observed values and let be the set of genes having observed
data for condition and the conditions in Regress on the
expression levels in set for all genes Use the fitted value
from this regression as the imputed value of The difference
is the imputation error in this case.
The regression model can be applied to the original expression in-
tensities as just described or to transformed values, such as log-
intensities. Some checks should be made to assess the validity of
the regression model for the application. For example, a logarithmic
transformation of the observed expression readings may improve the
applicability of the regression model. Moreover, the model must be
chosen so that it does not yield invalid fitted values. For instance, a
regression model that yields negative imputed values for expression
on the original reading scale would be inappropriate.
4. Principal component method:
This method requires that there be as many genes as experimental
samples, i.e., which usually poses no difficulty as most mi
croarray data sets have G >> C, i.e., the number of genes G is much
MISSING VALUES IN ARRAY DATA 91
Notes
1 Little, R.J.A. and Rubin, D.B.. (1987). Statistical Analysis with Missing Data, Wiley.
2 Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein,
D., Altman, R.B. (2001). Bioinformatics, 17 (6), 520-525.
3 Chu, G., Narasimhan, B., Tibshirani. R. and Tusher, V. SAM (Significance Analysis of
Microarrays): Users Guide and Technical Document (Version 1.21), Stanford University.
Chapter 8
93
We shall use the line plotted at level 15.5 to define the onset of partial
saturation. The numbers of totally saturated, partially saturated, and
unsaturated spots as a function of voltage are tabulated in Table 8.1.
We note for this data set that any spot that is saturated at one voltage
remains saturated at higher voltages.
Figure 8.2 shows the revised scatter plots, corresponding to Figure 8.1,
but now with imputed values replacing intensities exceeding 15.5 on the
logarithmic scale. The method seems to have made a reasonable ad
justment for saturation and produced a roughly linear intensity scale at
each voltage level as expected. The asymptotes at different voltages are
SATURATED INTENSITY READINGS
97
roughly parallel except at the highest voltage, for which the imputation
also happens to involve the greatest extent of extrapolation.
Notes
1
Dudley, A.M., Aach, J., Steffen, M.A., and Church, G.M. (2002). Proceedings of the
National Academy of Sciences, USA, 99, 7554-7559.
2
Naef, F., Socci, N.D., Magnasco, M. (2003). Bioinformatics, 19; 178-184.
3 Naef, F., Lim, D.A., Patil, N. and Magnasco, M. (2002b). Phys. Rev. E., 65, 040902.
4 Roweis, S.T. and Saul, L.K. (2000). Science, 290, 2323-2326.
This page intentionally left blank
II
STATISTICAL MODELS
AND ANALYSIS
This page intentionally left blank
Chapter 9
EXPERIMENTAL DESIGN
2. Two Factors:
In a two-way analysis, there are two factors involved in the compar
ison. If the first factor has levels and the second factor has
levels, then a two-way factor structure consists of a total of
combinations of levels. For example, if there are two types of mice,
mutant and wild type, involved in the previous experiment, then, in
addition to the factor for toxin exposure (with three levels), a second
factor having two levels (the two types of mice) is also taken into
account in the comparison.
3. Multiple Factors:
When there are more than two factors involved in the experiment,
the multi-way structure is often called a factorial design. An
factorial design consists of factors having levels,
respectively. Together there are a total of combinations.
If all combinations are taken into account in the design, it is called
a complete factorial design. For example, in a mouse experiment,
the multiple factors taken into account may include sex, age group,
mouse type, and other characteristics.
EXPERIMENTAL DESIGN 105
In either of the two half-replicates, the main effects of any factor are
orthogonal to those of the other two factors. The main effects of any
factor, however, are confounded with the two-factor interactions not
involving that factor. Hence, if one can assume that the two-factor and
higher-order interaction effects are negligible, the design of each half-
replicate allows the experimenter to estimate independently the main
effects of any of the three main factors (mouse type, age group, and sex).
Using a half-replicate design, therefore, saves half of the experimental
resources.
In experiments where both half-replicates are available, by combining
the two latin-squares, we have a complete factorial design. The benefit
of having both half-replicates is that the experimenters can estimate two-
factor interaction effects. If the two 2 × 2 latin-squares constitute blocks
in the experiment, then the blocks are confounded with the three-factor
interaction effects. In a design with both latin-squares, all the two-factor
interaction effects, in addition to the main effects, can be estimated
independently. Two-factor interaction effects cannot be estimated if
only one latin-square is used repeatedly.
5. Split-plot Design:
A split-plot design is a factorial experiment in which a main effect
is confounded with blocks (the larger experimental units). In this
setting, the blocks are called whole plots and the smaller experimen
tal units nested within whole plots are called subplots. Let the levels
of a factor, say factor A, be randomly assigned to the whole plots
and the levels of a second factor, say factor B, be randomly assigned
to the subplots within each whole plot. In general, subplots within
a whole plot will be more similar than subplots in different whole
plots. Consequently, within-whole-plot comparisons will generally be
more precise than between-whole-plot comparisons. So the split-plot
design is advantageous if the main effects of factor B and the AB
interactions are of greater interest than the main effects of factor
A alone. Yates9 (1935) showed that when the number of replica
tions and the experimental conditions are suitable, a split-plot latin
square, which eliminates the error variation arising from two types of
grouping, may be preferable to randomized blocks.
In studying the contributions of sex, genotype, and age to transcrip
tional variance in adult fruitflies Drosophila melanogaster, Jin et al.10
(2001) conducted an experiment involving two sexes, two genotypes,
and two age groups. Six replications including dye swaps were made
for each combination of two genotypes and two sexes. A total of 24
two-color cDNA microarrays were used. Using a split-plot design,
110 ANALYSIS OF MICRO ARRAY GENE EXPRESSION DATA
they directly contrasted the two age groups by always having the
1-week and 6-week adult flies together on the same array block.
The design also includes another set of 12 arrays like these that have
the dyes reversed. The 12 arrays with dyes reversed are not shown.
The experiment thus consists of 24 arrays that allow a split-plot com
parison of age groups, genotypes, and sexes. The combined design
EXPERIMENTAL DESIGN 111
7. Other Designs:
Although experimental designs often have standard forms or stan
dard structural components, some sophisticated variations have been
developed to deal with special circumstances. For example, where
the main objective is to compare treatment conditions with the con
trol condition but not to have comparisons among the treatments,
Hedayat and Majumdar (1984) consider a class of balanced test-
treatment incomplete block designs12 that can be optimal for making
treatment-control comparisons.
contexts and give a sample size formula for these designs. Their general
conclusion is that it is usually not efficient to reverse the dyes for every
individual sample but rather to increase the number of samples keeping
the design balanced with respect to treatment and dye combinations.
is that each sample must be labelled with both the red and green dyes,
which means doubling the number of labelling reactions. Another draw
back is that indirect comparisons may still be required for some pairs of
treatments, as is the case for treatments and in Table 9.13.
in fresh media, the media alone might induce gene expression changes.
To isolate the treatment effect from the media change effect, the media
changes should be included only as reference samples for the same time
points. The design in Table 9.14 cannot isolate the media change because
each sample has its previous time point as reference.
An alternative design might have used the replicate to reverse the
dye order for each comparison. If the experiment is one where the time
course shows small changes over time (like a slow increase) the design
might have a slight disadvantage relative to a reference design if the
variability does not allow statistical verification of small changes.
EXPERIMENTAL DESIGN 119
Notes
1 Fisher, R.A., (1947). The Design of Experiments, Oliver and Boyd, Edinburgh, 4th ed.
2 Comstock, R.E., and Winters, L.M. (1942). Journal of Agricultural Research, 64, 523-532.
3 Kerr, M.K., and Churchill, G.A. (2001b). Biostatistics, 2, 183-201.
4
Cochran, W.G., and Cox, G.M., (1992), Experimental Designs, Wiley, New York.
5
Winer, B.J. (1971). Statistical Principles in Experimental Design, 2nd ed., New York,
McGraw-Hill.
6 Milliken, G.A. and Johnson, D.E., (1992), Analysis of Messy Data: Volume 1, Designed
pages 821-824.
9 Yates, F. (1935). Journal of Royal Statistical Society, Suppl. 2, 181-247.
10 Jin, W., Riley, R.M., Wolfinger, R.D., White, K.P., Passador-Gurgel, G., and Gibson, G.
J.C., Sabet, H., Tran,T., Yu, X., Powell, J.I., Yang, L., Marti, G.E., Moore, T., Hudson,
J., Lu, L., Lewish, D.B., Tibshirani, R., Sherlock, G., Chan, W.C., Greiner, T.C., Weisen
burger D.D., Armitage, J.O., Warnke, R., Levy, R., Wilson, W., Grever, M.R., Byrd, J.C.,
Botstein, D., Brown, P.O., and Staudt, L.M. (2000). Nature, 403, 503-511.
14 Kerr, M.K., Martin, M., and Churchill, G.A. (2001c). Journal of Computational Biology,
7, 819-837.
15 Simon, R., Radmacher, M.D., Dobbin, K. (2002). Genetic Epidemiology, 23, 21-36.
16 DeRisi, J., Iyer, V.R., and Brown, P.O. (1997). Science, 278: 680-865.
17 Dobbin, K., Shih, J.H., Simon, R. (2003). Bioinformatics, 19, 803-810.
18 Björkbacka, H., personal communication, 2003
This page intentionally left blank
Chapter 10
ANOVA MODELS
FOR MICROARRAY DATA
Here is an overall mean parameter, and are main effects that can
be considered as normalizing parameters for gene and experimental
condition respectively. Kerr and Churchill2 (2001) use the agricul
tural word variety for the biological specimen, treatment or experimen
tal condition. The term is an interaction parameter for gene and
experimental condition. This interaction term reflects differential ex
pression for gene in condition The last parameter, is an error
term. The overall mean is defined so the error term is centered, i.e.,
Thus,
The error term captures all random variability in gene intensity, whether
its source is background noise or various sources of variability affecting
hybridization. To ensure uniqueness and estimability in the present
context, we require the parameters to sum to zero over their respective
indices.
Given gene intensity reading and its logarithmic transformation
the equations below show that if the data are normalized
by both gene and experimental condition, the resulting normalized data
estimate the interaction parameters for gene and condition, i.e., estimate
differential expression. With complete data, the estimability constraints
imply that the following correspondences exist between the normalizing
means for the data and the parameter estimates, denoted by
and
where
to denote pairwise interaction effects for factors and when they have
their respective levels and with For example, with
L = 3 factors, the parameter where signifies
the interaction parameter for factors 1 and 3 when these two factors
have their levels 5 and 4, respectively. The model can be expanded to
include third- and higher-order interaction terms if needed, as indicated
by the series of dots in (10.5).
ANOVA MODELS FOR MICROARRAY DATA 125
are the intensity readings that have been normalized for all L factors.
We note that if the sets of parameters have been estimated using the
sum-to-zero constraint then the will be centered on zero.
126 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
stage for all factors except the gene factor. Then, parameter estimates
involving genes are derived in a second-stage analysis where the estima
tion proceeds gene by gene.
We consider a simple design to illustrate the two-stage estimation
procedure. We assume for this demonstration that the study contains
G readings on gene expression obtained from C treatment conditions
and the experiment is replicated R times and use index
for the replicates. We include a replicate main effect to absorb scale
differences in the replicates. The full ANOVA model is
The two-stage approach partitions the full model (10.8) into two sub-
models, as follows.
Here the terms absorb all of the gene-specific effects in the model.
Generally speaking, the first stage should be reserved for effects that
are not indexed by gene It may be quite reasonable, however, to
include selected interactions for pairs of factors (excluding gene) that
impact response. For instance, dye and array may interact and we
can eliminate these interaction effects by using the first-stage ANOVA
as a normalization step.
10.4.1. Example
To demonstrate ANOVA procedures and the two-stage method for
analyzing microarray data, we will consider the Mouse Juvenile Cystic
Kidney Data Set, introduced earlier in Section 6.2.2. The study design
was presented in Table 6.1. We have identified four sources of variation
in this design that we wish to take into account, namely, array, dye
(green or red), tissue type (mutant or wild-type), and gene.
In the first-stage ANOVA, we include all effects in the model that are
not indexed by gene. We shall include only main effects for array
dye and tissue type as follows.
where denotes the fitted value. These residuals constitute the nor
malized microarray data. The normalization is such that the respective
sums of the log-intensities across all genes for each array, dye and tissue
type are zero. As an illustration, Table 10.1 shows the eight residuals
(normalized values) for a particular gene in this study. The consistent
pattern of negative residuals suggests that gene has low expression
levels across all arrays, dyes, and tissue types, relative to the average for
all genes.
The estimates of for gene from the regression output are 0.03545
and – 0 .03545 for the mutant (type 1) and wild-type (type 2) tissues,
respectively. The judgment that remains to be made in this case is
whether or not these results suggest that gene is truly differentially
expressed in the two tissue types. The investigation of this issue is taken
up in the next section. Of course, the kind of results we have presented
for gene must also be generated for the remaining 1727 genes in this
data set. Thus, this second-stage analysis involves performing 1728 such
regression analyses.
130 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
where
For the example, Figure 10.1 shows a plot of the statistics for
the 1728 genes. (Recall that A line of identity is also plotted
in the figure. A vertical line appears at the 90th percentile point of the
horizontal scale.
It is evident that the plot is quite linear for at least 90 percent of the
genes. Genes that have large values of that lie well above the line of
identity can be concluded to be differentially expressed. The list of the
dozen genes with the largest values of is identical to that given in Lee
et al. (2002b) as genes that differentiate between mutant and wild-type
tissues. Among the dozen genes, three genes are up-regulated in wild-
type tissue and the remainder are up-regulated in mutant tissue. For
this data set, we have Median(MST) = 0.027630. As
and it follows from (10.24) that the offset parameter is
In model (10.26) all effects are indexed by gene and are assumed to
serve similar roles to those from the normalization model in (10.25),
but at the gene level. The gene-by-array interaction term models
the effects for each spot. In addition to standard stochastic assumptions,
Wolfinger et al. (2001) assume that the effects and
are all normally distributed random variables with zero means and vari
ance components respectively. These ran-
dom effects are assumed to be independent both across their indices and
with each other. The remaining terms in the models are assumed to be
fixed effects, and thus both models (10.25) and (10.26) are mixed mod
els. Variance components were estimated by the method of restricted
maximum likelihood (REML). Note that the original notation has been
modified to be consistent with usage in this chapter.
The estimates of primary interest are the gene-by-condition interac-
tion effects which measure the effects of treatment conditions for each
gene. Differences between these effects can be tested by using mixed-
model t-tests of all possible pairwise comparisons within a gene. The
degrees of freedom ( d f ) for the can be set equal to the df for error
from the second-stage ANOVA model. In their article, Wolfinger et al.
also demonstrate how increasing the number of replications can increase
the statistical power of the analysis.
Wernisch, Kendall, Soneji, et al.11 (2003) also apply mixed ANOVA
models to quantify the various sources of error in microarray replicates.
Their model differs from the model of Wolfinger et al. in that Wernisch
et al. introduce common variance components for all genes. Signifi
cance values for differential expression are obtained by a hierarchical
bootstrapping scheme applied to scaled residuals.
The two components, having superscripts (1) and (2), denote the array
and sub-array error components, respectively. The components of the
error term in (10.27) have zero means by definition. They may be as
sumed to be independent and possibly also to be normally distributed.
To show the significance of recognizing the separate error components,
note that in comparing gene expression for genes on the same array, only
the variance of error component applies. The error component
is common to both genes because they lie on the same array and, hence,
is canceled in considering the difference.
The error components in (10.27) assume two levels of nesting. In gen
eral, nesting can involve more than two levels and can be more intricate
in structure, depending on the specifics of the design. For example, in
model (10.3), arrays may be nested within each treatment condition,
which may be a reason to break error component into a sum of two
further components, say,
The experimental setting of model (10.27) is of the split-plot design
type with the arrays being one experimental unit and the spots on arrays
being a nested experimental sub-unit.
As described in section 9.2, Jin et al.12 (2001) investigated the effects
of sex, genotype, and age on transcriptional variance in adult fruit flies.
Having six replications for each combination of two genotypes and two
sexes, their experiment consisted of 24 cDNA arrays with 48 separate
labeling reactions. Their experiment involved a split-plot design such
that both sex and genotype are evaluated at the whole-plot level, while
age and dye are evaluated at the sub-plot level. Because of the split-plot
nature of the design, the error structure of the ANOVA mixed model
corresponds to the array mean square for sex and genotype terms, and
the error mean square for age and dye terms.
Treatment of microarray experiments as split-plot designs was also
considered by Emptage et al.13 (2003). Using the array as the larger
size experimental unit (the whole plot) and the spot on the array as
the smaller size experimental unit (the subplot), they discussed model
equations appropriate to different designs.
138 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Thus, the geometric mean of the log ratios estimates the difference in
treatment effects for the treatment and control conditions in
140 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
the study, i.e., the difference It is in this precise sense that the
color ratio and ANOVA analyses correspond.
Given this correspondence, one may ask why the ANOVA method
with fluorescent intensity readings as the outcome variable may be fa
vored over the use of color ratios. The reason is that the ANOVA sta
tistical machinery is so standard, familiar, and flexible that it is easy
to incorporate many of the other relevant design elements of microarray
studies into the analysis, including other experimental factors, and to
deal with statistical issues, such as missing values and model diagnos
tics.
ANOVA MODELS FOR MICROARRAY DATA 141
Notes
1 Kerr, M.K., Martin, M., and Churchill, G.A. (2001c). Journal of Computational Biology,
7, 819-837.
2 Kerr, M.K., Churchill, G.A. (2001b). Biostatistics, 2, 183-201.
3 Lee, M.-L.T., Lu, W., Whitmore, G.A., Beier, D. (2002b). Journal of Biopharmaceutical
http://www-stat.stanford.edu/ ~ tibs/clickwrap/sam
8 Lee, M.-L.T., Bulyk, M.L., Whitmore, G.A., Church, G.M. (2002a). Biometrics, 58, 129
136.
9
Wolfinger, R.D., Gibson,G., Wolfinger, E.D., Bennett, L., Hamadeh, H., Bushel, P., Af
shari, C., Paules, R. (2001). Journal of Computational Biology, 8, 625-637.
10 Sudarsanam, P., Vishwanath, R.T., Brown, P.O., and Winston, F. (2000). Proceedings of
7, 819-837.
This page intentionally left blank
Chapter 11
MULTIPLE TESTING
IN MICROARRAY STUDIES
This framework postulates that for any given gene, there are, in fact,
only two possible situations. Either the gene is not differentially ex
pressed (null hypothesis is true) or it is differentially expressed (al
ternative hypothesis is true). The test declaration or decision is either
that the gene is differentially expressed rejected) or that it is not
differentially expressed accepted). Thus, there are four possible test
outcomes for each gene corresponding to the four combinations of true
hypothesis and test declaration.
The total number of genes being tested is G with and being
the unknown numbers that are truly differentially expressed and not
differentially expressed, respectively. Usually, will be much larger
than and, indeed, in some studies it may be uncertain if any gene is
actually differentially expressed (i.e., it may be uncertain if
The counts of the four test outcomes are shown by the entries
and
in the multiple testing framework. These counts are random
variables in advance of the analysis of the study data. The counts
and are the numbers of true and false negatives (i.e., true and false
declarations that genes are not differentially expressed). The counts
and are the numbers of true and false positives (i.e., true and false
declarations of genes being differentially expressed). The totals A and
R are the numbers of genes that the study declares are not differentially
expressed accepted) and are differentially expressed rejected),
respectively.
We index the genes for which and hold by the sets and
respectively. We must remember, of course, that the memberships
of these index sets are unknown because we do not know in advance if
any given gene is differentially expressed or not. The central problem of
multiple testing is to classify the genes into two sets that match and
as closely as possible. The classification should be done in a manner
that minimizes the scientific cost of misclassification, with costs being
appropriately defined.
The test decision for any gene is taken on the basis of a summary
statistic which we will denote here by In different applications dis
cussed in Chapter 14, the summary statistic may be a standard normal
statistic statistic, F statistic, statistic or other test statistic. Un
der the null hypothesis that gene is not differentially expressed,
is an outcome from a null probability density function
146 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
is defined as the false discovery rate or FDR for short, and was pro
posed by Benjamini and Hochberg (1995) as an error control criterion.
The criterion could call for the FDR to have the specified level
Again, the subscript F is used to remind us that the error probability
applies to the whole family of G tests.
1. Šidàk method:
If the G multiple tests being performed in Table 11.1 have indepen
dent test statistics, then their P-values, are also
independent. The Šidàk method assumes such independence. It also
assumes that the null hypothesis holds for all genes G, i.e., that
These assumptions lead to the following decision rule:
The explanation for the form of (11.7) is that if is true for every
gene and there is to be no false positive then must be accepted for
every gene. Under dependence, the Bonferroni inequality states that
the probability of this event will be no smaller than Thus,
setting equal to guarantees that the actual familywise type
I error risk will not exceed In future discussion, we will treat
as the actual risk rather than an upper bound.
The righthand equality here follows from (11.7). The fact that the
Benjamini and Hochberg prove that their method, under the stated
assumptions, ensures that the false discovery rate is no larger than
Thus, it controls the FDR for any number of true null
hypotheses (i.e., any number of genes that are not differentially ex
pressed). Where no genes are differentially expressed (so
the upper bound becomes the specified FDR requirement.
The comparison value in the B & H method will typically vary from
as index ranges over which will be a
152 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Notes
1
Benjamini, Y., Hochberg, Y. (1995). Journal of Royal Statistical Society, B 57, 289-300.
2
Hochberg, Y. and Tamhane A.C. (1987). Multiple Comparison Procedures. John Wiley
and Sons, New York.
3 Shaffer, J.P. (1995). Annual Review of Psychology, 46, 561-584.
4 Wolfinger, R.D., Gibson, G., Wolfinger, E.D., Bennett,L., Hamadeh, H., Bushel, P., Af
and Methods for P-value Adjustment, John Wiley and Sons, New York.
This page intentionally left blank
Chapter 12
PERMUTATION TESTS
IN MICROARRAY DATA
between the two groups, where denotes the mean response for group
The asterisk (*) reminds us that is the observed difference in group
means for the actual data gathered in the study. We are inclined to
accept the null hypothesis (postulating identical response patterns)
if is small and to reject if is large.
If is true, so the response patterns under treatment and control
conditions are identical, then each permutation of the response val
ues can be imagined to be a possible realization of the experimental
study and can be analyzed accordingly. This analysis would yield a cal
culated difference in mean response for the two artificial groups created
by the permutation. Denote this calculated difference corresponding to
permutation by
The permutation procedure yields a total of A such differences
where A is given by equation (12.1). One of these permutations corre
sponds to the actual pattern of response for the study. If happens to
be that special permutation then i.e., the calculated difference
for that particular permutation will match the observed difference in the
study.
A P-value for a test of hypotheses gauges the consistency of the state
ment in the null hypothesis with the statistical evidence. Specifically,
PERMUTATION TESTS IN MICROARRAY DATA 159
the P-value is the probability under the null hypothesis that the test
statistic would match or be less consistent with than the actual
statistic observed in the study. P-values are referred to as one- or two-
sided according to whether the null hypothesis is one- or two-sided. In
a permutation test, therefore, if the null hypothesis were true then the
P-value is the fraction of the A calculated differences that are greater
or equal to the observed difference in absolute value, i.e.,
The smallest possible P-value occurs when takes the most extreme
value among all possible eligible permutations. In this situation, the P-
value equals 1/A (or a larger value if two or more permutations match
this most extreme outcome). Thus, 1/A is the smallest P-value that
can be given by a single permutation test. In the case where
and A = 462, for instance, the smallest possible P-value would
be 1/462 = 0.0022. This technical observation is important because it
shows that the permutation test cannot signal a significant difference,
i.e., a small P-value, unless a reasonably large number of permutations
are possible. This implies that and must be targe enough to make
A large. For example, in a case like that in Table 12.1 with and
we have A = 10 and the smallest possible P value would be
1/10 = 0.10, which would not constitute strong evidence against a null
hypothesis.
160 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
The permutation test for any single gene, say gene yields an ob
served test statistic as well as a set of test statistics for the permutations.
We now use symbol for the generic form of our test statistic and B for
the number of permutations. We denote the observed test statistic by
and the B permutation test statistics by The number
PERMUTATION TESTS IN MICROARRAY DATA 163
Stage 2:
Figure 12.1 and Figure 12.2 show the SAM plot and SAM output for
this illustration. Several adjustments and specifications were required
to prepare the data for analysis in SAM. First, the data were treated
as two-class data, blocked by mouse. Second, the logarithms of the
expression intensity data were transformed to base 2. Third, all settings
were given default values. The SAM parameter, which controls the
rejection region for the test of each gene, was chosen so that 35 genes
were identified as significant (29 positive, 6 negative with ).
This specification for was somewhat subjective, and other values could
be chosen to produce either a longer or shorter list of significant genes.
PERMUTATION TESTS IN MICROARRAY DATA
167
168 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Notes
1 Ludbrook, J. and Dudley, H. (1998). The American Statistician, 52, 127-132.
2 Good, P. (2000). Permutation Tests: A Practical Guide to Resampling Methods for Test
ing Hypotheses, 2nd edition, Springer, New York.
3 Chu, G., Narasimhan, B., Tibshirani, R., Tusher, V. (2002). Significance Analysis of
Microarrays (SAM): Users Guide and Technical Document, version 1.21., http://www-
stat.stanford.edu/ ~ tibs/clickwrap/sam
4 Dudley, A.M., Aach, J., Steffen, M.A., and Church, G.M. (2002). Proceedings of the
National Academy of Sciences, USA, 99, 7554-7559.
Chapter 13
BAYESIAN METHODS
FOR MICROARRAY DATA
Different forms of mixture model (13.1) arise when the raw expression
data are monotonically transformed. If the transformation is
then in (13.1) is replaced by For example, may denote the log-
arithm of the raw intensity measurement, i.e., In mathe
matical terms, if is the p.d.f. of then and are related
by
Changing notation for the p.d.f.s with each change of variable is poten
tially confusing so, for expository convenience, the notation
and is used as generic notation, whichever variable may be under
consideration. The context will make the functional form of the p.d.f.s
clear.
Different families of distributions have been proposed for the com-
ponent p.d.f.s and of the mixture model, although
the settings in which they have been presented differ somewhat from
the one used here. The idea of the mixture model is sometimes explic
itly developed in these articles and at other times is only implicit in
the presentation. Lee et al. (2000), for example, present an illustra
tion of mixture model (13.1) for log-intensities in a simple microarray
experiment. They model the component distributions and
as normal densities with They compute empirical Bayes
BAYESIAN METHODS FOR MICROARRAY DATA 175
estimates of the model parameters. Baldi and Long3 (2001) also model
log-expression data using the normal distribution family. Newton et al.4
(2001) have used the gamma distribution family as a model for gene
expression data and explicitly develop the mixture model. Ibrahim et
al.5 (2002) use the lognormal distribution family as a model for expres
sion data. Rocke and Durbin6 (2001) use a hybrid model in which the
components of the additive noise model, and are assumed to
be independent lognormal and normal random variables, respectively.
Efron et al. (2001) propose a non-parametric approach to modeling the
components of the mixture model.
It was noted already that the background noise p.d.f. is the most
accessible. The next subsection gives an illustration of estimating the
background noise p.d.f. as a gamma distribution. In later sec-
tions, we include other case illustrations to show how the components of
mixture models are estimated from microarray data.
Observe that multiplicative color and spot effects cancel out in measure
where and
Observe that statistic is the ratio of the total expression for treatment
1 to the total expression for treatment 0 (the control) for the two array
spots. The totals are denoted by and respectively, in (13.6).
Under appropriate assumptions, the ratio of the two totals in (13.6)
can be modeled as a ratio of two independent gamma random variables
with a common shape parameter. The requisite assumptions are: (1)
the four readings are independent gamma random variables, (2) the nu
merator readings share a common scale parameter, say as do the
two denominator readings, say and (3) the color and spot effects do
not alter the scale parameter but are additive in determining the shape
parameter. With these assumptions, the numerator and denominator
totals and are independent gamma random variables with a com
mon shape parameter and scale parameters and respectively.
The differential expression parameter is
in this context. Under the preceding conditions, the Newton et al. hier
archical Bayes development will continue to apply. This development is
taken up in the next section.
The gamma model introduces a technical element that ties in with
much earlier discussion related to data normalization. In the differential
expression setting, the statistic estimates the true differential expres
sion parameter It has been assumed implicitly in the modeling that
statistic encapsulates all information in the data about the unknown
parameter i.e., that is a sufficient statistic for In the log-normal
model for the color-ratio statistic (13.4), statistic is sufficient. In the
gamma model for the color-ratio statistic (13.6), statistic is
not sufficient. It is found that the product statistic
180 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
where and
This difference is none other than the logarithm of the geometric mean
where and denote the events that the gene is not differentially
expressed and differentially expressed, respectively.
The posterior probabilities give the relative evidence in favor of a gene
being differentially expressed or not. The probabilities remind us that
establishing the true situation for a gene is not a certainty. A posterior
probability of states that the odds are even that the
gene in question is differentially expressed while a posterior probability
of represents odds of 999 to 1 in favor of differential
expression.
The posterior distribution of given may be written down directly
from the mixture model, as follows.
Under quite general conditions, will tend to be a value that lies between
and the modal value of and, hence, is a shrinkage estimator.
The discussion in the previous section pointed out a case where the
statistic is not a sufficient statistic for Where that is the case, a
refinement can be made in the posterior probability distribution for in
(13.12). We will not pursue that refinement here.
The component probabilities and and density functions
and in (13.11) may depend on unspecified parameters, in which
case the probabilities and are, in fact, conditional pos
terior probabilities because they are conditional on the unspecified para
meters. A full Bayesian approach would require a specification of a joint
prior distribution for these unknown parameters in order to eliminate
their conditionality. The elicitation of appropriate prior distribution
forms is a challenging aspect of the Bayesian approach. Ibrahim et al
(2002), for example, discuss this issue in a microarray context. Another
approach to dealing with any unspecified parameters is the empirical
Bayes approach that is described in the next section.
Table 13.1 shows the parameters and their estimates. One unusual fea
ture arises in this application. A single gene (gene ) is outlying in
the negative domain, having a difference statistic of In
maximizing the log-likelihood function, it is found that the routine at-
tempts to fit a degenerate density function to this single observation. We
therefore take to be degenerate and give it parameter estimates
consistent with this degeneracy (specifically, zero variance). To estimate
the remaining parameters for the model, the p.d.f. is dropped as
a component of in the likelihood function (13.15). The outlying
observation (gene ) is also dropped from the likelihood calculation.
The parameter estimates suggest that about 18 percent of the 13,028
genes in the study are differentially expressed and virtually all of these
186 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
values of 1.0000 in the table are rounded and do not represent certainty.
How well does the fitted normal-Weibull mixture model fit the ob
Notes
1 Lee, M.-L.T., Kuo, F.C., Whitmore, G.A., Sklar, J. (2000). Proceedings of the National
Academy of Sciences, USA, 97, 9834-9839.
2 Efron, B., Tibshirani, R., Storey, J.D., Tusher, V. (2001). Journal of American Statistical
Methods and Software, Parmigiani, G., Garrett, E.S., Irizarray, R.A., and Zeger, S.L., eds.
255-271. Springer, New York.
17 Newton, M.A., Noueiry, A., Sarkar, D., Ahlquist, P. (2003). Technical Report No. 1074,
Chapter 14
statistical test.
Later, we shall study two particular types of functions
The variable is a random variable for gene that will have some
realization in the microarray study. Under null hypothesis sum
mary measure has a probability density function (p.d.f.) that we
denote by Similarly, under the alternative hypothesis sum
mary measure has a p.d.f. that we denote by We shall show
that it is the statistical distance between these two density functions, in
a precise sense, that defines the level of power for a microarray study.
POWER AND SAMPLE SIZE CONSIDERATIONS 197
for any single gene in the index set under the decision rule. Thus,
For example, if the familywise type I error is 0.20 and is large, the
Poisson mean is In this case, the probabil
ity of experiencing no false positive is exp(–0.223) = 0.80. The proba
bility of exactly one false positive is 0.223exp(–0.223) = 0.223(0.80) =
0.18. The probability of experiencing two or more false positives is there
fore 0.02. Because of the direct connection between and the mean
in this case, either value may be used to specify the desired control
over the familywise type I error risk.
As another example, if an investigator feels that expecting 2.5 false
positives is tolerable, then this specification implies that
and, hence, a familywise type I error probability of
This value may appear very high. The illus
tration reminds us, however, that a large value of may be reasonable
in microarray studies where a few false positives among thousands of
genes must be tolerated in order to avoid many false negatives (i.e., to
avoid missing many differentially expressed genes). The design of a mi
croarray study involves a careful balancing of costs of false positives and
false negatives. The connection between and in this last example
is
Substitution of (14.14) into (14.11) gives the following implied value for
the familywise type I error probability for this rule.
The mean number of false positives for this rule is approximately i.e.,
Although the form of (14.14) is motivated by the theory of
order statistics in which is a whole number, (14.14) and (14.15) can
be used with fractional values of
(14.10) may be defined by specifying the type I error probability for any
gene to be
That is, for the Bonferroni procedure, the individual error rate is
defined as the desired familywise error rate divided by the total
number of genes having no differential expression. This definition of the
acceptance interval guarantees that the following inequality holds for
the familywise type I error probability.
It can be seen that the expected number of false positives equals the
familywise type I error probability in this case. Thus, necessarily, the
expected number cannot exceed one (although the actual number
is not so constrained).
Unlike the independence approach discussed in the preceding section,
there is no direct link between the probability distribution for the num
ber of false positives and the familywise type I error probability
under the Bonferroni approach. The Bonferroni procedure controls the
chance of incurring one or more false positives but provides no proba
bility statement about how many false positives may be present if some
do occur (i.e., the approximate Poisson distribution does not apply).
(1) The familywise type II error probability (or, equivalently, one mi
nus the familywise power level).
Hence,
where
is a vector of design-related coefficients specified
by the investigator.
Examples of such linear combinations include any
single differential expression estimate, say or any difference of such
estimates, say
Frequently the linear combination of interest
POWER AND SAMPLE SIZE CONSIDERATIONS 209
We see that this measure implicitly takes account of all differential ex
pression estimates and, hence, is responding to differential expression in
any of the C experimental conditions in the study. Statistic in (14.30)
212 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
As before, the first step is the most difficult because it requires some
knowledge of the inherent variability of the data in the planned microar
ray study, which depends on the experimental error in the scientific
process, the experimental design and the number of replicates of the
design used in the study.
This model simplifies reality in two respects. First, it assumes that the
prior probabilities are the same for all genes, although this assumption
can be relaxed easily. Second, it assumes that if a gene is differentially
expressed then it is expressed at the level specified in the alternative
hypothesis
POWER AND SAMPLE SIZE CONSIDERATIONS 215
From Bayes theorem, the posterior probabilities for any gene having
summary statistic can be calculated from the components of the
mixture model (14.32) as follows.
where there are coefficients of each sign. Thus, from (14.29) and
(14.28), we have
Table 14.3 gives the sample size of the treatment and control groups
required to achieve a specified individual power level for the
experimental design we have just described. A more extensive version of
the table is provided in Appendix A. The table is entered based on the
specified mean number of false positives ratio anticipated
number of undifferentially expressed genes and desired individual
power level If is expected to be similar to the total gene
count G, the table could be entered using G without introducing great
error. To conserve space, only two individual power levels are offered
in this illustrative table, namely, 0.90 and 0.99. The sample size shown
POWER AND SAMPLE SIZE CONSIDERATIONS
219
in the table is the smallest whole number that will yield the specified
power. The total number of experimental conditions C is double the
entry in the table, i.e., An examination of Table 14.3 shows
that the required sample size is most sensitive to the ratio and
the required power level and least sensitive to the mean number of false
positives The required sample size is also moderately sensitive
to the number of undifferentially expressed genes because of the
effect of controlling for simultaneous inferences. The practical lesson to
220 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
be drawn from this last observation is that the gene set should be
kept as small as possible, consistent with the scientific objective of the
microarray study. Inclusion of superfluous genes in the analysis, possibly
for reasons of data exploration or data mining, will have a cost in terms
of power loss. Of course, housekeeping genes and genes included on the
arrays as positive controls may be used for diagnostic and quality-control
checks but do not enter the main analysis. Such monitoring genes should
not be counted in the number used in power calculations.
A formula for the sample size in this setting has the following simple
form:
will be Observe
that the up-regulation and down-regulation have been specified in a
convenient form so the interaction parameters sum to zero. If for gene
expression for litermates receiving the same treatment is anticipated to
be 0.40 on a log-2 scale and blocks are to be used then the
noncentrality parameter in (14.40) equals
Observe that these parameter values sum to zero as required by the in
teraction sum constraint. The differential expression in question may be
either an up- or down-regulation, depending on the sign of the differ
ence The non-centrality parameter (14.40) for this pattern of gene
expression has the following form.
Solving for gives i.e., eight blocks are required in the randomized
block design where an isolated effect is anticipated.
Under the Šidák approach, in which estimated differential expression
vectors are assumed to be mutually independent across genes, the
familywise power level and expected number of true positives
can be calculated from using (14.22) and (14.23). Under the
Bonferroni approach, in which estimated differential expression vectors
may be dependent across genes, a lower bound on the familywise
POWER AND SAMPLE SIZE CONSIDERATIONS 227
Notes
1 Simon, R., Radmacher, M.D., Dobbin, K. (2002). Genetic Epidemiology, 23, 21-36.
2 Lee, M.-L.T., Whitmore, G.A. (2002c). Statistics in Medicine, 21, 3543-3570.
3 Benjamini, Y., Hochberg, Y. (1995). Journal of Royal Statistical Society, B 57, 289-300.
4
Hochberg, Y. and Tamhane A.C. (1987). Multiple Comparison Procedures. John Wiley
and Sons, New York.
5 Dudoit, S., Yang, Y.H., Callow, M.J., Speed, T.P. (2002). Statistica Sinica, 12, 111-139.
6 Lee, M.-L.T., Kuo, F.C., Whitmore, G.A., Sklar, J. (2000). Proceedings of the National
Academy of Sciences, 97, 9834-9839.
7 Efron, B., Tibshirani, R., Storey, J.D., Tusher, V. (2001). Journal of American Statistical
Association, 96, 1151-1160.
8 Lee, M.-L.T., Whitmore, G.A., Yukhananov, R.Y. (2003). Journal of Data Science, 1,
103-121.
III
UNSUPERVISED
EXPLORATORY ANALYSIS
This page intentionally left blank
Chapter 15
CLUSTER ANALYSIS
Depending upon the objective of the research study, interest may focus
on either finding clusters of genes having similar expression patterns
across specimen samples or finding clusters of specimen samples sharing
similar expression patterns across the gene set.
In order to cluster genes, each gene can be represented by a row
vector across N specimen samples (cell lines or experiment conditions).
Specifically, let the row vector
With appropriate care for technical considerations, one can work read
ily with either kind of proximity matrix. As an example of a technical
consideration, Gower9 (1967) has shown that one can only construct
a proper distance measure from a similarity measure if the similarity
matrix S is nonnegative definite.
The following are three special cases of Minkowski measures that find
application in clustering.
Euclidean distance
City-block distance
Maximum distance
Dunn and Everitt10 (1982) used the city block distance measure as
the natural distance measure in comparing amino acids in homologous
proteins.
This distance measure is symmetric and nonnegative and has the prop
and thereby reduces the number of clusters by one in each step. In con
trast, a divisive clustering method begins with one cluster containing all
G genes and successively splits the least homogeneous cluster into two
successor clusters that are each more uniform than the parent cluster.
The splitting can continue until G singleton clusters (individual genes)
are formed. In order to have a solution with an ‘optimal’ number of
clusters, the investigator will need to decide on a particular stage at
which to stop the iterative procedure.
Listed below are some commonly used linkage methods which allow
us to specify the type of joining algorithm used to amalgamate clusters.
Results of hierarchical methods can be shown in a tree diagram, known
as a dendrogram. It must be borne in mind that using different linkage
methods or encountering small changes in the data set can lead to very
different dendrograms.
the ordering can affect the final configuration of clusters. Eisen et al.
(1998) apply this hierarchical clustering method to the analysis of gene
expression data.
The centroid linkage method uses the average value of all points in a
cluster (i.e., the cluster centroid) as the reference point for distances to
other points or clusters. The distance between two clusters is defined as
the Euclidean distance between the centroids of the cluster pair. The
process proceeds by combining clusters according to the distance be
tween their centroids, the clusters with the shortest distance being com
bined first. A disadvantage of the centroid method is that if the sizes of
the two clusters to be considered are very different, then the centroid of
the new cluster will be very close to that of the larger cluster.
The median linkage method uses the median distances between pairs
of points in different clusters as the inter-cluster distance measure. See
Gower (1967) for a discussion of median clustering methods.
15.5.7. Applications
Notes
1 Jain, A.K. and Dubes, R.C. (1988). Algorithms for Clustering Data, Prentice Hall, Engle
wood Cliff, New Jersey.
2 Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data, John Wiley and Sons,
New York.
3 Afifi, A.A. and Clark, V. (1990). Computer-aided Multivariate Analysis. 2nd edition,
Chapman and Hall, New York.
4 Jobson, J.D. (1992). Applied Multivariate Data Analysis, Springer-Verlag, New York.
5 Everitt, B.S. (1993). Cluster Analysis. Edward Arnold, New York.
6 Johnson, R.A., Wichern, D.W. (1998). Applied Multivariate Statistical Analysis, 4th edi
and C.B. Read, eds.), John Wiley & Sons, New York.
9 Gower, J.C. (1967). Biometrics, 23, 623-628.
10 Dunn, G. and Everitt, B.S. (1982). An Introduction to Mathematical Taxonomy, Cam
if
The 0s are zero matrices of suitable dimensions. Note that both matrices
and are nonnegative definite and symmetric and that they
have the same nonzero eigenvalues, which are all positive. Diagonal
Principal Components and Singular Value Decomposition 255
Then, one projects the resulting matrices V and D onto X and obtains
data. Hence, the strength of thier method is that it does not require
complete information and is not affected by a minority of outliers.
Principal Components and Singular Value Decomposition 259
Notes
1 Pearson, K. (1901). Phil. Mag. (6), 2, 559-572.
2 Hotelling, H. (1933). Journal of Educational Psychology, 24, 417-441.
3
Anderson, T.W., (1958), An Introduction to Multivariate Statistical Analysis. Wiley, New
York.
4
Seal, H. (1964). Multivariate Statistical Analysis for Biologists, New York, Wiley.
5 Morrison, D.F. (1976). Multivariate Statistical Methods, 2nd ed., New York, McGraw-Hill.
6
Joliffe, I.T. (2002). Principal Component Analysis, 2nd ed., Springer-Verlag, New York.
7 Flury, B. (1988), Common Principal Components and Related Multivariate Models, Wiley,
New York.
8
Krzanowski, W.J. (1988). Principles of Multivariate Analysis: A User’s Perspective, Ox
ford University Press, Oxford.
9 Dunteman, G.H. (1989), Principal Component Analysis, Sage University Papers, Sage,
York.
11 Jobson, J.D. (1992). Applied Multivariate Data Analysis, Springer-Verlag, New York.
12 Johnson, R.A., Wichern, D.W. (1998). Applied Multivariate Statistical Analysis, 4th ed.,
Prentice-Hall, Inc., New Jersey.
13 Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F.,
Schwab, M., Antonescu, C.R., Peterson, C., and Meltzer, P.S. (2001). Nature Medicine,
7, 673-679.
14 Searle, S.R. (1982), Matrix Algebra Useful for Statistics, John Wiley and Sons, pages
316-317.
15 Alter, O., Brown, P.O. and Botstein, D. (2000), Proceedings of the National Academy of
13, 23-36.
19 Liu, L., Hawkins, D.M., Ghosh, S., Young, S.S. (2003). Proceedings of the National Acad
emy of Sciences, USA, 100, 13167-13172.
This page intentionally left blank
Chapter 17
SELF-ORGANIZING MAPS
Figure 17.1 shows a schematic that illustrates these elements for the
case of a one-dimensional map (i.e., an arrangement of reference vectors
264 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
17.5. Applications
17.5.1. Using SOM to Cluster Genes
38 initial leukemia samples into two classes on the basis of the expres
sion pattern of all 6817 genes. The SOM was constructed using the
GENECLUSTER software, with a variation filter excluding genes with
less than five-fold variation across the collection of samples. By com
paring the clusters obtained by SOM to the known AML-ALL classes,
the evaluation results show that the SOM paralleled the known classes
closely: Class contained mostly ALL (24 of 25 samples), and class
contained mostly AML (10 of 13 samples). The SOM was thus quite
effective in discovering the two types of leukemia.
Notes
1 Kohonen, T. (1997). Self-Organizing Maps, Springer, Berlin.
2 Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dimitrovsky, E., Lander,
E.S., Golub, T.R. (1999). Proceedings of the National Academy of Sciences, USA, 96,
2907-2912.
3 Golub, T.R., Slonim, D.K., Tamayo, P., et al. (1999). Science, 286, 531-537.
4 Ramaswamy, S., Tamayo, P., Rifkin, R. et al. (2001). Proceedings of the National Academy
of Sciences, USA, 98, 15149-15154.
5 Eisen, M., Spellman, P.T., Brown, P.O. and Bostein, D. (1998). Proceedings of the National
Academy of Sciences, USA, 95, 14863-14868.
This page intentionally left blank
IV
SUPERVISED
LEARNING METHODS
This page intentionally left blank
Chapter 18
DISCRIMINATION AND
CLASSIFICATION
In practice, mean
can be replaced by the sample mean for group
The covariance matrix can also be replaced by the sample
covariance matrix S pooled from both groups and An observation
in the test set is then allocated to group if
of a given training set and a chosen distance measure for pairs of ob
servations, such as the Euclidean distance or correlation coefficient, the
neighbor method classifies an observation in the test set by
the following steps:
(1) Select an integer
(2) Find the closest observations in the training set
(3) Classify this test observation using majority vote; that is, choose
the class that is most common among these neighbors in the train
ing set
Note that the form of is very similar to the special case of the linear
maximum likelihood discriminant rule described in (18.11), except that
the sum of standard deviations is used in the denominator
of the weight instead of the variance A positive value of indicates a
vote for class 1 and a negative value indicates a vote for class 2. The total
vote for class 1 is obtained by summing the absolute
284 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
values of the positive votes over the informative genes. Analogously, the
total vote for class 2 is obtained by summing the
absolute values of the negative votes. The sample is assigned to the class
with the higher vote total, provided that the prediction strength exceeds
a predetermined threshold.
The gene-casting weighted models can be evaluated by leave-one-out
cross validation with the training set being used to predict the class of
a randomly withheld specimen. This procedure can be repeated for all
specimens and the cumulative error rate recorded. Thereafter, the total
number of prediction errors in cross-validation can be calculated and a
final model chosen which minimizes cross-validation error.
the training set. The parameters of the predictor were determined by the
expression levels of these 50 genes in the training samples. The predictor
was then used to classify new samples, by applying it to the expression
levels of these 50 genes in the new samples. A sample was assigned to
the class with the greater prediction strength, provided that strength
exceeded a predetermined threshold. Otherwise, the sample was given
an ‘uncertain’ classification. On the basis of previous analysis, Golub et
al. (1999) used a threshold value of 0.3. The 50-gene predictors derived
in cross-validation tests assigned 36 of the 38 samples as either AML
or ALL and the remaining two as uncertain. They then applied this
50-gene predictor to an independent collection of 34 leukemia samples
that consisted of 24 bone marrow and 10 peripheral blood samples. The
predictor made strong predictions for 29 of these 34 test samples.
To test the hypothesis that class discovery could be tested by class
prediction, Golub et al. evaluated the clusters and found by the
SOM methods as discussed earlier in section 17.5.2. They constructed
predictors to assign new samples as or Predic
tors that used a wide range of different numbers of informative genes
performed well in cross-validation. The results suggest an iterative pro
cedure for refining clusters, in which a SOM is used initially to cluster
the data, a predictor is constructed, and samples not correctly predicted
in cross-validation are removed. The edited data set could then be used
to generate an improved predictor to be tested on an independent data
set.
The procedure for selecting a threshold of prediction strength is an
essential ingredient for classification. Heuristic selection rules can be
used, but with a certain unavoidable level of subjectivity.
286 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Notes
1
Mardia, K.V., Kent, J.T., Bibby, J.M. (1979). Multivariate Analysis, Academic Press,
Inc., San Diego.
2 Johnson, R.A., Wichern, D.W. (1998). Applied Multivariate Statistical Analysis, 4th Edi
tion, Prentice-Hall, Inc., New Jersey.
3
McLachlan, G.J. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wi
ley, New York.
4 Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning,
Springer, New York.
5
Denison, D.G.T., Holmes, C.C., Mallick, B.K., and Smith, A.F.M. (2002). Bayesian Meth
ods for Nonlinear Classification and Regression, John Wiley & Sons, West Sussex, U.K.
6
Fix, E., and Hodges, J. (1951). Technical report, Randloph Field, Texas, USAF School of
Aviation Medicine.
7
Cover, T. and Hart, P. (1967). Proc. IEEE Trans. Inform. Theory, 11, 21-27.
8
Golub, T.R., Slonim, D.K., Tamayo, P., et al. (1999). Science, 286, 531-537.
Chapter 19
vector are displayed in Figure 19.1 on the arrows connecting the input
nodes to the output node.
the correct class label for each sample from knowledge of the input
vector The success of the objective depends on a careful selection of
weight vector
rule:
the set of points which are misclassified by the decision rule defined
in (19.1). Then,
The iterations continue until no classification errors are made within the
loop for each The algorithm computes a linear combination of the
variables and returns the signs of the class predictions. This procedure
is guaranteed to converge if there exists a hyperplane that correctly
classifies (separates) the training data14.
To find a local minimum for the risk function (19.12) under the set of
equality type constraints for
the method of Lagrange multipliers can be applied as follows.
Define the Lagrange function such that
equal to zero for all It then gives the following set of equations
Artificial Neural Networks 297
19.3.4. Discussion
A multilayer neural network trained using back-propagation can solve
a problem that is not linearly separable. Because a neural network
can approximate any continuous function to any degree of accuracy, a
neural network is useful when one does not have any idea of the func
tional relationship between the input and the output variables, i.e., for
a ‘black-box’ problem. A neural network, however, cannot reveal the
true functional relation that is buried in the summing of the sigmoidal
functions. The solution usually provides a very good fit to a training
sample because of the abundance of weight coefficients being fitted in
the model. The fitted model may perform less well with predicting out
comes for a holdout sample. As a neural network does not embody any
random error terms, it does not have a built-in inference mechanism.
Continuing research is attempting to remedy this limitation.
The empirical risk function may have more than one minimum. The
gradient descent procedure will converge to one of them. Neural net
works typically have a slow convergence rate. The quality of the result
ing solution depends on many characteristics of the network formulation
and algorithmic implementation, including the initial starting weights,
the number of hidden layers, the risk function, and the learning rate or
step size
Hence, neural networks are not well-controlled learning machines. In
many practical applications, however, neural networks demonstrate good
results15.
Notes
1
Bishop, C.M. (1995). Neural Networks for Pattern Recognition. Clarendon Press, Oxford.
2 Stern, H.S. (1996). Neural networks in applied statistics (with discussion), Technometrics,
38, 205-220
3
Hertz, J., Krogh, A., Palmer, R.G. (1991). Introduction to the Theory of Neural Compu
tation, Addison-Wesley, Redwood City, CA.
4 Ripley, R.B. (1996). Pattern Recognition and Neural Networks, Cambridge University
Biomedical Engineering, The Institute of Electric and Electronics Engineers, Inc., New
York.
10
Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F.,
Schwab, M., Antonescu, C.R., Peterson, C., and Meltzer, P.S. (2001). Nature Medicine,
7, 673-679.
11
McCulloch, W.S. and Pitts, W. (1943). Bull. Math. Biophys, 5, 115-137.
12 Rosenblatt, F. (1958). Psychological Review, 65, 386-408. Reprinted in Shavlik & Diett
erick (1990).
13
Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and the Theory of Brain
Mechanisms. Spartan Books, Washington D.C.
14
Christiani, N. and Shawe-Taylor, J. (2000), Support Vector Machines, pages 11-18.
15 Vapnik (1998), Statistical Learning Theory, Wiley, New York.
Chapter 20
and the sign function denotes the true class label corresponding to
If the training set contains only two classes, say and we use the
conventional labels that if the sample point actually belongs
to class and if the sample point actually belongs to class
between the hyperplane and the two groups in the training set deter
mines the maximal margin separating hyperplane. The situation is il
lustrated in Figures 20.1 and 20.2. The figures show a set of samples
containing two classes plotted in the original input space, which hap
pens to be two-dimensional. The two groups of samples are linearly
separable in the original input space. Figure 20.1 shows a hyperplane
that separates the two groups. Figure 20.2 shows the maximal margin
hyperplane and the separating span referred to as the maximal margin
It can be seen that the maximal margin hyperplane is a special case
of a separating hyperplane.
label the two leading constraints in (20.5) can be combined into one.
Furthermore, for any weight and bias satisfying (20.8), any positive
multiple of them will satisfy the equation as well. Hence we can let
As a result, for linearly separable groups in the training
set, the optimization problem in (20.6) is equivalent to finding (and
that solves
Equation (20.10) implies that the value of has the same sign
as the true class label
SUPPORT VECTOR MACHINES 305
The above conditions imply that, for the weight vector that defines
the optimal hyperplane, the following equalities hold true.
Substituting (20.13) into (20.11), and taking into account the equalities
in (20.12), the function can be written in terms of Lagrange multipliers
as follows.
Figure 20.4 illustrates such a mapping. The figure shows a set of samples
containing two classes being mapped from the original input space to a
Using the dual representation one does not represent the feature vec
tors explicitly. The use of kernels makes it possible to map the
data implicitly into a feature space and to train a linear SVM in such
a space. Hence, the computational problems inherent in evaluating the
feature map can be avoided.
and
radial basis:
where is a scalar.
20.6. Examples
20.6.1. Functional Classification of Genes
Brown, Grundy, Lin et al.4 (2000) use the SVM method to classify
genes by function. To begin, a set of genes in a functional class (i.e.,
312 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Notes
1
Vapnik, V. (1998). Statistical Learning Theory, John Wiley and Sons, New York.
2 Cristianini, N. and Shawe-Tylor, J. (2000). Support Vector Machines, Cambridge Univer
sity Press, Cambridge, U.K.
3 Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning,
Springer, New York.
4 Brown M.P.S., Grundy, W.N., Lin, D., et al.. (2000), Proceedings of the National Academy
of Sciences, USA, 97, 262-267.
5 Ramaswamy, S., Tamayo, P., Rifkin, R., et al. (2001). Proceedings of the National Acad
emy of Sciences, USA, 98, 15149-15154.
6 The implementation of SVM-FU is available at www.ai.mit.edu/projects/cbcl
This page intentionally left blank
Appendix A
Sample Size Table for Treatment-control
Designs
318 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Distance
0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Genes Power correctly declared as differentially expressed = 0.80
100 69 31 18 11 8 6 5 4 3 3 2
200 75 34 19 12 9 7 5 4 3 3 3
500 84 37 21 14 10 7 6 5 4 3 3
1000 90 40 23 15 10 8 6 5 4 3 3
2000 96 43 24 16 11 8 6 5 4 4 3
5000 105 47 27 17 12 9 7 6 5 4 3
10000 111 50 28 18 13 10 7 6 5 4 4
20000 117 52 30 19 13 10 8 6 5 4 4
Genes Power correctly declared as differentially expressed = 0.90
100 84 38 21 14 10 7 6 5 4 3 3
200 91 41 23 15 11 8 6 5 4 3 3
500 101 45 26 17 12 9 7 5 5 4 3
1000 108 48 27 18 12 9 7 6 5 4 3
2000 114 51 29 19 13 10 8 6 5 4 4
5000 124 55 31 20 14 11 8 7 5 5 4
10000 130 58 33 21 15 11 9 7 6 5 4
20000 137 61 35 22 16 12 9 7 6 5 4
Genes Power correctly declared as differentially expressed = 0.95
100 98 44 25 16 11 8 7 5 4 4 3
200 106 47 27 17 12 9 7 6 5 4 3
500 116 52 29 19 13 10 8 6 5 4 4
1000 123 55 31 20 14 11 8 7 5 5 4
2000 130 58 33 21 15 11 9 7 6 5 4
5000 140 63 35 23 16 12 9 7 6 5 4
10000 147 66 37 24 17 12 10 8 6 5 5
20000 155 69 39 25 18 13 10 8 7 6 5
Genes Power correctly declared as differentially expressed = 0.99
100 127 57 32 21 15 11 8 7 6 5 4
200 135 60 34 22 15 12 9 7 6 5 4
500 147 65 37 24 17 12 10 8 6 5 5
1000 155 69 39 25 18 13 10 8 7 6 5
2000 163 73 41 27 19 14 11 9 7 6 5
5000 174 78 44 28 20 15 11 9 7 6 5
10000 182 81 46 30 21 15 12 9 8 7 6
20000 190 85 48 31 22 16 12 10 8 7 6
APPENDIX A: Sample Size Table for Treatment-control Designs 321
Distance
0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Genes Power correctly declared as differentially expressed = 0.80
100 54 24 14 9 6 5 4 3 3 2 2
200 60 27 15 10 7 5 4 3 3 2 2
500 69 31 18 11 8 6 5 4 3 3 2
1000 75 34 19 12 9 7 5 4 3 3 3
2000 82 37 21 13 10 7 6 5 4 3 3
5000 90 40 23 15 10 8 6 5 4 3 3
10000 96 43 24 16 11 8 6 5 4 4 3
20000 103 46 26 17 12 9 7 6 5 4 3
Genes Power correctly declared as differentially expressed = 0.90
100 67 30 17 11 8 6 5 4 3 3 2
200 75 33 19 12 9 7 5 4 3 3 3
500 84 38 21 14 10 7 6 5 4 3 3
1000 91 41 23 15 11 8 6 5 4 3 3
2000 98 44 25 16 11 8 7 5 4 4 3
5000 108 48 27 18 12 9 7 6 5 4 3
10000 114 51 29 19 13 10 8 6 5 4 4
20000 121 54 31 20 14 10 8 6 5 4 4
Genes Power correctly declared as differentially expressed = 0.95
100 80 36 20 13 9 7 5 4 4 3 3
200 88 39 22 14 10 8 6 5 4 3 3
500 98 44 25 16 11 8 7 5 4 4 3
1000 106 47 27 17 12 9 7 6 5 4 3
2000 113 51 29 19 13 10 8 6 5 4 4
5000 123 55 31 20 14 11 8 7 5 5 4
10000 130 58 33 21 15 11 9 7 6 5 4
20000 138 62 35 22 16 12 9 7 6 5 4
Genes Power correctly declared as differentially expressed = 0.99
100 106 47 27 17 12 9 7 6 5 4 3
200 115 51 29 19 13 10 8 6 5 4 4
500 127 57 32 21 15 11 8 7 6 5 4
1000 135 60 34 22 15 12 9 7 6 5 4
2000 144 64 36 23 16 12 9 8 6 5 4
5000 155 69 39 25 18 13 10 8 7 6 5
10000 163 73 41 27 19 14 11 9 7 6 5
20000 172 77 43 28 20 14 11 9 7 6 5
322 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Distance
0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Genes Power correctly declared as differentially expressed = 0.80
100 47 21 12 8 6 4 3 3 2 2 2
200 54 24 14 9 6 5 4 3 3 2 2
500 62 28 16 10 7 6 4 4 3 3 2
1000 69 31 18 11 8 6 5 4 3 3 2
2000 75 34 19 12 9 7 5 4 3 3 3
5000 84 37 21 14 10 7 6 5 4 3 3
10000 90 40 23 15 10 8 6 5 4 3 3
20000 96 43 24 16 11 8 6 5 4 4 3
Genes Power correctly declared as differentially expressed = 0.90
100 60 27 15 10 7 5 4 3 3 2 2
200 67 30 17 11 8 6 5 4 3 3 2
500 77 34 20 13 9 7 5 4 4 3 3
1000 84 38 21 14 10 7 6 5 4 3 3
2000 91 41 23 15 11 8 6 5 4 3 3
5000 101 45 26 17 12 9 7 5 5 4 3
10000 108 48 27 18 12 9 7 6 5 4 3
20000 114 51 29 19 13 10 8 6 5 4 4
Genes Power correctly declared as differentially expressed = 0.95
100 72 32 18 12 8 6 5 4 3 3 2
200 80 36 20 13 9 7 5 4 4 3 3
500 90 40 23 15 10 8 6 5 4 3 3
1000 98 44 25 16 11 8 7 5 4 4 3
2000 106 47 27 17 12 9 7 6 5 4 3
5000 116 52 29 19 13 10 8 6 5 4 4
10000 123 55 31 20 14 11 8 7 5 5 4
20000 130 58 33 21 15 11 9 7 6 5 4
Genes Power correctly declared as differentially expressed = 0.99
100 97 43 25 16 11 8 7 5 4 4 3
200 106 47 27 17 12 9 7 6 5 4 3
500 118 53 30 19 14 10 8 6 5 4 4
1000 127 57 32 21 15 11 8 7 6 5 4
2000 135 60 34 22 15 12 9 7 6 5 4
5000 147 65 37 24 17 12 10 8 6 5 5
10000 155 69 39 25 18 13 10 8 7 6 5
20000 163 73 41 27 19 14 11 9 7 6 5
APPENDIX A: Sample Size Table for Treatment-control Designs 323
Distance
0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Genes Power correctly declared as differentially expressed = 0.80
100 41 18 11 7 5 4 3 2 2 2 2
200 47 21 12 8 6 4 3 3 2 2 2
500 56 25 14 9 7 5 4 3 3 2 2
1000 62 28 16 10 7 6 4 4 3 3 2
2000 69 31 18 11 8 6 5 4 3 3 2
5000 77 35 20 13 9 7 5 4 4 3 3
10000 84 37 21 14 10 7 6 5 4 3 3
20000 90 40 23 15 10 8 6 5 4 3 3
Genes Power correctly declared as differentially expressed = 0.90
100 53 24 14 9 6 5 4 3 3 2 2
200 60 27 15 10 7 5 4 3 3 2 2
500 70 31 18 12 8 6 5 4 3 3 2
1000 77 34 20 13 9 7 5 4 4 3 3
2000 84 38 21 14 10 7 6 5 4 3 3
5000 93 42 24 15 11 8 6 5 4 4 3
10000 101 45 26 17 12 9 7 5 5 4 3
20000 108 48 27 18 12 9 7 6 5 4 3
Genes Power correctly declared as differentially expressed = 0.95
100 64 29 16 11 8 6 4 4 3 3 2
200 72 32 18 12 8 6 5 4 3 3 2
500 82 37 21 14 10 7 6 5 4 3 3
1000 90 40 23 15 10 8 6 5 4 3 3
2000 98 44 25 16 11 8 7 5 4 4 3
5000 108 48 27 18 12 9 7 6 5 4 3
10000 116 52 29 19 13 10 8 6 5 4 4
20000 123 55 31 20 14 11 8 7 5 5 4
Genes Power correctly declared as differentially expressed = 0.99
100 87 39 22 14 10 8 6 5 4 3 3
200 97 43 25 16 11 8 7 5 4 4 3
500 109 49 28 18 13 9 7 6 5 4 4
1000 118 53 30 19 14 10 8 6 5 4 4
2000 127 57 32 21 15 11 8 7 6 5 4
5000 138 62 35 23 16 12 9 7 6 5 4
10000 147 65 37 24 17 12 10 8 6 5 5
20000 155 69 39 25 18 13 10 8 7 6 5
324 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Distance
0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Genes Power correctly declared as differentially expressed = 0.80
100 37 17 10 6 5 3 3 2 2 2 2
200 43 20 11 7 5 4 3 3 2 2 2
500 52 23 13 9 6 5 4 3 3 2 2
1000 59 26 15 10 7 5 4 3 3 2 2
2000 65 29 17 11 8 6 5 4 3 3 2
5000 74 33 19 12 9 6 5 4 3 3 3
10000 80 36 20 13 9 7 5 4 4 3 3
20000 86 39 22 14 10 8 6 5 4 3 3
Genes Power correctly declared as differentially expressed = 0.90
100 48 22 12 8 6 4 3 3 2 2 2
200 56 25 14 9 7 5 4 3 3 2 2
500 65 29 17 11 8 6 5 4 3 3 2
1000 73 33 19 12 9 6 5 4 3 3 3
2000 80 36 20 13 9 7 5 4 4 3 3
5000 89 40 23 15 10 8 6 5 4 3 3
10000 96 43 24 16 11 8 6 5 4 4 3
20000 103 46 26 17 12 9 7 6 5 4 3
Genes Power correctly declared as differentially expressed = 0.95
100 59 26 15 10 7 5 4 3 3 2 2
200 67 30 17 11 8 6 5 4 3 3 2
500 78 35 20 13 9 7 5 4 4 3 3
1000 86 38 22 14 10 7 6 5 4 3 3
2000 93 42 24 15 11 8 6 5 4 4 3
5000 104 46 26 17 12 9 7 6 5 4 3
10000 111 50 28 18 13 10 7 6 5 4 4
20000 119 53 30 19 14 10 8 6 5 4 4
Genes Power correctly declared as differentially expressed = 0.99
100 81 36 21 13 9 7 6 4 4 3 3
200 91 41 23 15 11 8 6 5 4 3 3
500 103 46 26 17 12 9 7 6 5 4 3
1000 113 50 29 18 13 10 8 6 5 4 4
2000 122 54 31 20 14 10 8 6 5 5 4
5000 133 59 34 22 15 11 9 7 6 5 4
10000 142 63 36 23 16 12 9 7 6 5 4
20000 150 67 38 24 17 13 10 8 6 5 5
APPENDIX A: Sample Size Table for Treatment-control Designs 325
Distance
0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Genes Power correctly declared as differentially expressed = 0.80
100 34 15 9 6 4 3 3 2 2 2 1
200 41 18 11 7 5 4 3 2 2 2 2
500 49 22 13 8 6 4 4 3 2 2 2
1000 56 25 14 9 7 5 4 3 3 2 2
2000 62 28 16 10 7 6 4 4 3 3 2
5000 71 32 18 12 8 6 5 4 3 3 2
10000 77 35 20 13 9 7 5 4 4 3 3
20000 84 37 21 14 10 7 6 5 4 3 3
Genes Power correctly declared as differentially expressed = 0.90
100 45 20 12 8 5 4 3 3 2 2 2
200 53 24 14 9 6 5 4 3 3 2 2
500 62 28 16 10 7 6 4 4 3 3 2
1000 70 31 18 12 8 6 5 4 3 3 2
2000 77 34 20 13 9 7 5 4 4 3 3
5000 86 39 22 14 10 8 6 5 4 3 3
10000 93 42 24 15 11 8 6 5 4 4 3
20000 101 45 26 17 12 9 7 5 5 4 3
Genes Power correctly declared as differentially expressed = 0.95
100 55 25 14 9 7 5 4 3 3 2 2
200 64 29 16 11 8 6 4 4 3 3 2
500 74 33 19 12 9 7 5 4 3 3 3
1000 82 37 21 14 10 7 6 5 4 3 3
2000 90 40 23 15 10 8 6 5 4 3 3
5000 100 45 25 16 12 9 7 5 4 4 3
10000 108 48 27 18 12 9 7 6 5 4 3
20000 116 52 29 19 13 10 8 6 5 4 4
Genes Power correctly declared as differentially expressed = 0.99
100 77 35 20 13 9 7 5 4 4 3 3
200 87 39 22 14 10 8 6 5 4 3 3
500 100 45 25 16 12 9 7 5 4 4 3
1000 109 49 28 18 13 9 7 6 5 4 4
2000 118 53 30 19 14 10 8 6 5 4 4
5000 130 58 33 21 15 11 9 7 6 5 4
10000 138 62 35 23 16 12 9 7 6 5 4
20000 147 65 37 24 17 12 10 8 6 5 5
This page intentionally left blank
Appendix B
Power Table for Multiple-treatment Designs
328 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
power of specific design changes. For example, if blocks are used in lieu of
then recalculation of the non-centrality parameter gives Reference
to the closest cell of the table corresponding to T = 6, and
gives an individual power level of .87, which is somewhat larger than
previously.
Example 2. Should an investigator assume that the estimated differential expression
vectors are possibly dependent then the Bonferroni approach is used. As shown
in equation (14.19), we have in the Bonferroni approach. Thus,
the expected number of false positives is necessarily smaller than 1 and the table
should be entered accordingly. Two values of smaller than 1 are displayed in
the table, namely, 0.1 and 0.5. To illustrate this use of the table, consider the same
situation as in Example 1 but now we set Reference to the closest
cell of the table corresponding to T = 6, and
gives an individual power level of .62.
Treatment-control Design
A treatment-control design corresponds to a multiple-treatment design with two
treatments (i.e., T = 2). The design has equal sample sizes for all treatments and
may be of the completely randomized or randomized block variety. The non-centrality
parameter for this special case has the following form.
For both a completely randomized design and matched-pairs design, we have the
identity where denotes the variance of the difference in log-expression
between treatment and control conditions and the error variance of the associated
ANOVA model.
Example 1. Consider a treatment-control design of the completely randomized vari
ety involving undifferentially expressed genes. The investigator wishes to
control the mean number of false positives at and to detect a two-fold dif
ferential expression between the treatment and control conditions. The experimental
error standard deviation is anticipated to be about on a log-2 scale. The
two-fold difference represents a value of on a log-2 scale.
Thus, the ratio equals 1.000/0.40 = 2.500. Eight replications are to be used
For these specifications, the non-centrality parameter in (B.2) equals
on a log-2 scale for the difference between treatment and control expression levels).
Number of treatments T = 2
Non-centrality
Genes
12 14 16 18 20 22 24 26 28 30
100 .57 .67 .76 .83 .88 .92 .95 .96 .98 .99
200 .49 .60 .70 .78 .84 .89 .92 .95 .96 .98
500 .40 .51 .61 .70 .77 .83 .88 .92 .94 .96
1000 .33 .44 .54 .64 .72 .79 .84 .89 .92 .94
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .74 .80 .85 .89 .92 .95 .96 .97 .98 .99
5000 .66 .74 .80 .85 .89 .92 .94 .96 .97 .98
10000 .61 .69 .75 .81 .86 .89 .92 .94 .96 .97
20000 .55 .63 .70 .77 .82 .86 .90 .92 .95 .96
Number of treatments T = 3
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .46 .56 .66 .74 .81 .86 .90 .93 .95 .97
200 .38 .49 .59 .68 .76 .82 .87 .91 .93 .95
500 .30 .40 .50 .59 .68 .75 .81 .86 .90 .93
1000 .24 .34 .43 .53 .62 .70 .76 .82 .87 .90
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .64 .71 .77 .83 .87 .90 .93 .95 .96 .98
5000 .56 .64 .71 .77 .82 .86 .90 .93 .95 .96
10000 .50 .58 .66 .72 .78 .83 .87 .90 .93 .95
20000 .44 .52 .60 .67 .74 .79 .84 .88 .91 .93
Number of treatments T = 4
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
100 .81 .86 .90 .93 .95 .97 .98 .99 .99 .99
200 .76 .82 .87 .90 .93 .95 .97 .98 .98 .99
500 .68 .75 .81 .86 .89 .92 .94 .96 .97 .98
1000 .62 .70 .76 .82 .86 .90 .92 .94 .96 .97
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .56 .64 .71 .77 .82 .86 .90 .92 .95 .96
5000 .48 .56 .64 .71 .77 .82 .86 .89 .92 .94
10000 .42 .50 .58 .65 .72 .77 .82 .86 .90 .92
20000 .37 .45 .53 .60 .67 .73 .78 .83 .87 .90
332 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Number of treatments T = 5
Non-centrality
Genes
22 24 26 28 30 32 34 36 38 40
100 .77 .82 .87 .91 .93 .95 .97 .98 .98 .99
200 .71 .77 .83 .87 .90 .93 .95 .97 .98 .98
500 .63 .70 .76 .82 .86 .90 .92 .94 .96 .97
1000 .56 .64 .71 .77 .82 .86 .90 .92 .94 .96
Non-centrality
Genes 32 34 36 38 40 42 44 46 48 50
2000 .83 .87 .90 .92 .94 .96 .97 .98 .99 .99
5000 .77 .82 .86 .89 .92 .94 .95 .97 .98 .98
10000 .72 .78 .82 .86 .89 .92 .94 .96 .97 .98
20000 .68 .73 .79 .83 .87 .90 .92 .94 .96 .97
Number of treatments T = 6
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
100 .73 .79 .84 .88 .91 .94 .95 .97 .98 .98
200 .66 .73 .79 .84 .88 .91 .93 .95 .97 .98
500 .58 .65 .72 .78 .83 .87 .90 .93 .95 .96
1000 .51 .59 .66 .73 .78 .83 .87 .90 .93 .95
Non-centrality
Genes 32 34 36 38 40 42 44 46 48 50
2000 .79 .83 .87 .90 .93 .95 .96 .97 .98 .99
5000 .73 .78 .83 .86 .89 .92 .94 .96 .97 .98
10000 .68 .74 .79 .83 .87 .90 .92 .94 .96 .97
20000 .63 .69 .74 .79 .83 .87 .90 .92 .94 .96
Number of treatments T = 7
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
100 .69 .75 .81 .85 .89 .92 .94 .96 .97 .98
200 .62 .69 .76 .81 .85 .89 .92 .94 .96 .97
500 .53 .61 .68 .74 .80 .84 .88 .91 .93 .95
1000 .47 .55 .62 .69 .75 .80 .84 .88 .91 .93
Non-centrality
Genes 32 34 36 38 40 42 44 46 48 50
2000 .75 .80 .84 .88 .91 .93 .95 .96 .97 .98
5000 .69 .74 .79 .84 .87 .90 .92 .94 .96 .97
10000 .63 .70 .75 .80 .84 .87 .90 .92 .94 .96
20000 .58 .64 .70 .76 .80 .84 .87 .90 .92 .94
APPENDIX B: Power Table for Multiple-treatment Designs 333
Number of treatments T = 8
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
100 .65 .72 .78 .83 .87 .90 .93 .95 .96 .97
200 .58 .66 .72 .78 .83 .87 .90 .93 .95 .96
500 .49 .57 .64 .71 .77 .81 .86 .89 .92 .94
1000 .43 .51 .58 .65 .71 .77 .82 .86 .89 .91
Non-centrality
Genes 32 34 36 38 40 42 44 46 48 50
2000 .72 .77 .82 .86 .89 .91 .93 .95 .96 .97
5000 .65 .71 .76 .81 .85 .88 .91 .93 .95 .96
10000 .59 .66 .71 .77 .81 .85 .88 .91 .93 .94
20000 .54 .61 .67 .72 .77 .81 .85 .88 .91 .93
Number of treatments T = 9
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
100 .62 .69 .75 .80 .85 .88 .91 .94 .95 .97
200 .55 .62 .69 .75 .80 .85 .88 .91 .93 .95
500 .46 .53 .61 .68 .74 .79 .83 .87 .90 .92
1000 .39 .47 .54 .62 .68 .74 .79 .83 .87 .90
Non-centrality
Genes 32 34 36 38 40 42 44 46 48 50
2000 .69 .74 .79 .83 .87 .90 .92 .94 .95 .97
5000 .61 .67 .73 .78 .82 .86 .89 .91 .93 .95
10000 .56 .62 .68 .73 .78 .82 .86 .89 .91 .93
20000 .50 .57 .63 .69 .74 .79 .83 .86 .89 .91
Number of treatments T = 10
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
100 .59 .66 .72 .78 .83 .87 .90 .92 .94 .96
200 .52 .59 .66 .72 .78 .82 .86 .90 .92 .94
500 .43 .50 .58 .64 .71 .76 .81 .85 .88 .91
1000 .36 .44 .51 .58 .65 .71 .76 .81 .85 .88
Non-centrality
Genes 32 34 36 38 40 42 44 46 48 50
2000 .65 .71 .76 .81 .85 .88 .91 .93 .95 .96
5000 .58 .64 .70 .75 .80 .84 .87 .90 .92 .94
10000 .52 .59 .65 .70 .75 .80 .84 .87 .90 .92
20000 .47 .53 .60 .66 .71 .76 .80 .84 .87 .90
334 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Number of treatments T = 2
Non-centrality
Genes 2 4 6 8 10 12 14 16 18 20
100 .08 .21 .36 .51 .64 .74 .83 .88 .92 .95
200 .05 .15 .28 .42 .56 .67 .76 .84 .89 .93
500 .03 .10 .20 .32 .45 .57 .67 .76 .83 .88
1000 .02 .07 .15 .26 .38 .49 .60 .70 .78 .84
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
2000 .42 .53 .63 .72 .79 .85 .89 .92 .95 .97
5000 .33 .44 .54 .64 .72 .79 .84 .89 .92 .94
10000 .28 .38 .48 .57 .66 .74 .80 .85 .89 .92
20000 .23 .32 .41 .51 .60 .68 .75 .81 .86 .90
Number of treatments T = 3
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .64 .74 .81 .87 .91 .94 .96 .98 .98 .99
200 .56 .66 .75 .82 .87 .91 .94 .96 .97 .98
500 .46 .56 .66 .74 .81 .86 .90 .93 .95 .97
1000 .38 .49 .59 .68 .76 .82 .87 .91 .93 .95
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .77 .83 .87 .91 .93 .95 .97 .98 .99 .99
5000 .70 .76 .82 .87 .90 .93 .95 .96 .98 .98
10000 .64 .71 .77 .83 .87 .90 .93 .95 .96 .98
20000 .58 .66 .73 .79 .83 .88 .91 .93 .95 .97
Number of treatments T = 4
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .57 .67 .75 .82 .87 .91 .94 .96 .97 .98
200 .48 .59 .68 .76 .82 .87 .91 .94 .96 .97
500 .38 .49 .59 .67 .75 .81 .86 .90 .93 .95
1000 .31 .42 .51 .61 .69 .76 .82 .87 .90 .93
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .70 .77 .82 .87 .90 .93 .95 .97 .98 .98
5000 .62 .70 .76 .82 .86 .90 .92 .94 .96 .97
10000 .56 .64 .71 .77 .82 .86 .90 .92 .95 .96
20000 .50 .58 .66 .72 .78 .83 .87 .90 .93 .95
APPENDIX B: Power Table for Multiple-treatment Designs 335
Number of treatments T = 5
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .51 .61 .70 .78 .84 .88 .92 .94 .96 .97
200 .43 .53 .63 .71 .78 .84 .88 .92 .94 .96
500 .33 .43 .53 .62 .70 .77 .82 .87 .91 .93
1000 .27 .36 .45 .55 .63 .71 .77 .83 .87 .90
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .65 .72 .78 .83 .87 .91 .93 .95 .96 .98
5000 .56 .64 .71 .77 .82 .86 .90 .92 .94 .96
10000 .50 .58 .65 .72 .78 .83 .87 .90 .92 .94
20000 .44 .52 .60 .67 .73 .78 .83 .87 .90 .93
Number of treatments T = 6
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .46 .57 .66 .74 .80 .85 .90 .93 .95 .96
200 .38 .48 .58 .67 .74 .80 .85 .89 .92 .95
500 .29 .38 .48 .57 .65 .73 .79 .84 .88 .91
1000 .23 .32 .41 .50 .58 .66 .73 .79 .84 .88
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .60 .67 .74 .79 .84 .88 .91 .93 .95 .97
5000 .51 .59 .66 .73 .78 .83 .87 .90 .93 .95
10000 .45 .53 .60 .67 .74 .79 .83 .87 .90 .93
20000 .39 .47 .55 .62 .68 .74 .79 .84 .87 .90
Number of treatments T = 7
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
100 .83 .87 .91 .93 .95 .97 .98 .99 .99 .99
200 .77 .83 .87 .90 .93 .95 .97 .98 .98 .99
500 .69 .75 .81 .85 .89 .92 .94 .96 .97 .98
1000 .62 .69 .76 .81 .85 .89 .92 .94 .96 .97
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .55 .63 .70 .76 .81 .85 .89 .92 .94 .95
5000 .47 .55 .62 .69 .75 .80 .84 .88 .91 .93
10000 .41 .48 .56 .63 .70 .75 .80 .84 .88 .91
20000 .35 .42 .50 .57 .64 .70 .76 .81 .85 .88
336 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Number of treatments T = 8
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
100 .80 .85 .89 .92 .94 .96 .97 .98 .99 .99
200 .74 .80 .85 .89 .92 .94 .96 .97 .98 .99
500 .65 .72 .78 .83 .87 .90 .93 .95 .96 .97
1000 .58 .66 .72 .78 .83 .87 .90 .93 .95 .96
Non-centrality
Genes 32 34 36 38 40 42 44 46 48 50
2000 .83 .87 .90 .92 .94 .96 .97 .98 .98 .99
5000 .77 .82 .86 .89 .91 .94 .95 .96 .97 .98
10000 .72 .77 .82 .86 .89 .91 .93 .95 .96 .97
20000 .67 .72 .78 .82 .86 .89 .91 .93 .95 .96
Number of treatments T = 9
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
100 .78 .83 .87 .91 .93 .95 .97 .98 .98 .99
200 .71 .77 .82 .87 .90 .93 .95 .96 .97 .98
500 .62 .69 .75 .80 .85 .88 .91 .94 .95 .97
1000 .55 .62 .69 .75 .80 .85 .88 .91 .93 .95
Non-centrality
Genes 32 34 36 38 40 42 44 46 48 50
2000 .80 .85 .88 .91 .93 .95 .96 .97 .98 .99
5000 .74 .79 .83 .87 .90 .92 .94 .96 .97 .98
10000 .69 .74 .79 .83 .87 .90 .92 .94 .95 .97
20000 .63 .69 .75 .79 .83 .87 .90 .92 .94 .95
Number of treatments T = 10
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
100 .75 .81 .85 .89 .92 .94 .96 .97 .98 .99
200 .68 .75 .80 .85 .88 .91 .94 .95 .97 .98
500 .59 .66 .72 .78 .83 .87 .90 .92 .94 .96
1000 .52 .59 .66 .72 .78 .82 .86 .90 .92 .94
Non-centrality
Genes 32 34 36 38 40 42 44 46 48 50
2000 .78 .82 .86 .89 .92 .94 .95 .97 .98 .98
5000 .71 .76 .81 .85 .88 .91 .93 .95 .96 .97
10000 .65 .71 .76 .81 .85 .88 .91 .93 .95 .96
20000 .60 .66 .72 .77 .81 .85 .88 .91 .93 .94
APPENDIX B: Power Table for Multiple-treatment Designs 337
Number of treatments T = 2
Non-centrality
Genes
2 4 6 8 10 12 14 16 18 20
100 .12 .28 .45 .60 .72 .81 .88 .92 .95 .97
200 .08 .21 .36 .51 .64 .74 .83 .88 .92 .95
500 .05 .14 .26 .40 .53 .65 .74 .82 .88 .92
1000 .03 .10 .20 .32 .45 .57 .67 .76 .83 .88
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
2000 .49 .60 .70 .78 .84 .89 .92 .95 .96 .98
5000 .40 .51 .61 .70 .77 .83 .88 .92 .94 .96
10000 .33 .44 .54 .64 .72 .79 .84 .89 .92 .94
20000 .28 .38 .48 .57 .66 .74 .80 .85 .89 .92
Number of treatments T = 3
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .72 .80 .87 .91 .94 .96 .98 .99 .99 .99
200 .64 .74 .81 .87 .91 .94 .96 .98 .98 .99
500 .53 .64 .73 .80 .86 .90 .93 .95 .97 .98
1000 .46 .56 .66 .74 .81 .86 .90 .93 .95 .97
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .82 .87 .91 .93 .95 .97 .98 .99 .99 .99
5000 .75 .81 .86 .90 .93 .95 .96 .98 .98 .99
10000 .70 .76 .82 .87 .90 .93 .95 .96 .98 .98
20000 .64 .71 .77 .83 .87 .90 .93 .95 .96 .98
Number of treatments T = 4
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .65 .75 .82 .87 .91 .94 .96 .98 .98 .99
200 .57 .67 .75 .82 .87 .91 .94 .96 .97 .98
500 .46 .56 .66 .74 .81 .86 .90 .93 .95 .97
1000 .38 .49 .59 .67 .75 .81 .86 .90 .93 .95
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .76 .82 .87 .90 .93 .95 .97 .98 .98 .99
5000 .68 .75 .81 .86 .89 .92 .94 .96 .97 .98
10000 .62 .70 .76 .82 .86 .90 .92 .94 .96 .97
20000 .56 .64 .71 .77 .82 .86 .90 .92 .95 .96
338 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Number of treatments T = 5
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .60 .69 .77 .84 .89 .92 .95 .96 .98 .99
200 .51 .61 .70 .78 .84 .88 .92 .94 .96 .97
500 .40 .51 .60 .69 .76 .82 .87 .91 .93 .95
1000 .33 .43 .53 .62 .70 .77 .82 .87 .91 .93
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .71 .77 .83 .87 .90 .93 .95 .97 .98 .98
5000 .63 .70 .76 .82 .86 .90 .92 .94 .96 .97
10000 .56 .64 .71 .77 .82 .86 .90 .92 .94 .96
20000 .50 .58 .65 .72 .78 .83 .87 .90 .92 .94
Number of treatments T = 6
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .55 .65 .74 .80 .86 .90 .93 .95 .97 .98
200 .46 .57 .66 .74 .80 .85 .90 .93 .95 .96
500 .36 .46 .55 .64 .72 .78 .84 .88 .91 .94
1000 .29 .38 .48 .57 .65 .73 .79 .84 .88 .91
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .66 .73 .79 .84 .88 .91 .93 .95 .97 .98
5000 .58 .65 .72 .78 .83 .87 .90 .93 .95 .96
10000 .51 .59 .66 .73 .78 .83 .87 .90 .93 .95
20000 .45 .53 .60 .67 .74 .79 .83 .87 .90 .93
Number of treatments T = 7
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .51 .61 .70 .77 .83 .88 .91 .94 .96 .97
200 .42 .52 .62 .70 .77 .83 .87 .91 .93 .95
500 .32 .42 .51 .60 .68 .75 .81 .86 .89 .92
1000 .26 .34 .44 .53 .61 .69 .75 .81 .85 .89
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .62 .69 .76 .81 .85 .89 .92 .94 .96 .97
5000 .53 .61 .68 .74 .80 .84 .88 .91 .93 .95
10000 .47 .55 .62 .69 .75 .80 .84 .88 .91 .93
20000 .41 .48 .56 .63 .70 .75 .80 .84 .88 .91
APPENDIX B: Power Table for Multiple-treatment Designs 339
Number of treatments T = 8
Non-centrality
Genes
12 14 16 18 20 22 24 26 28 30
100 .48 .58 .67 .74 .81 .86 .90 .93 .95 .96
200 .39 .49 .58 .67 .74 .80 .85 .89 .92 .94
500 .29 .38 .48 .56 .65 .72 .78 .83 .87 .91
1000 .23 .31 .40 .49 .57 .65 .72 .78 .83 .87
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .58 .66 .72 .78 .83 .87 .90 .93 .95 .96
5000 .49 .57 .64 .71 .77 .81 .86 .89 .92 .94
10000 .43 .51 .58 .65 .71 .77 .82 .86 .89 .91
20000 .37 .44 .52 .59 .66 .72 .77 .82 .86 .89
Number of treatments T = 9
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
100 .84 .88 .91 .94 .96 .97 .98 .99 .99 .99
200 .78 .83 .87 .91 .93 .95 .97 .98 .98 .99
500 .69 .75 .81 .85 .89 .92 .94 .96 .97 .98
1000 .62 .69 .75 .80 .85 .88 .91 .94 .95 .97
Non-centrality
Genes 32 34 36 38 40 42 44 46 48 50
2000 .85 .88 .91 .93 .95 .96 .97 .98 .99 .99
5000 .79 .83 .87 .90 .92 .94 .96 .97 .98 .98
10000 .74 .79 .83 .87 .90 .92 .94 .96 .97 .98
20000 .69 .74 .79 .83 .87 .90 .92 .94 .95 .97
Number of treatments T = 10
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
100 .82 .86 .90 .93 .95 .96 .97 .98 .99 .99
200 .75 .81 .85 .89 .92 .94 .96 .97 .98 .99
500 .66 .73 .78 .83 .87 .90 .93 .95 .96 .97
1000 .59 .66 .72 .78 .83 .87 .90 .92 .94 .96
Non-centrality
Genes 32 34 36 38 40 42 44 46 48 50
2000 .82 .86 .90 .92 .94 .96 .97 .98 .98 .99
5000 .76 .81 .85 .88 .91 .93 .95 .96 .97 .98
10000 .71 .76 .81 .85 .88 .91 .93 .95 .96 .97
20000 .65 .71 .76 .81 .85 .88 .91 .93 .95 .96
340 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Number of treatments T = 2
Non-centrality
Genes 2 4 6 8 10 12 14 16 18 20
100 .18 .37 .55 .69 .80 .87 .92 .95 .97 .98
200 .12 .28 .45 .60 .72 .81 .88 .92 .95 .97
500 .07 .19 .33 .48 .61 .72 .81 .87 .91 .94
1000 .05 .14 .26 .40 .53 .65 .74 .82 .88 .92
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
2000 .57 .67 .76 .83 .88 .92 .95 .96 .98 .99
5000 .47 .58 .68 .76 .82 .87 .91 .94 .96 .97
10000 .40 .51 .61 .70 .77 .83 .88 .92 .94 .96
20000 .33 .44 .54 .64 .72 .79 .84 .89 .92 .94
Number of treatments T = 3
Non-centrality
Genes 2 4 6 8 10 12 14 16 18 20
100 .13 .28 .44 .58 .70 .80 .86 .91 .94 .97
200 .08 .20 .35 .49 .61 .72 .80 .87 .91 .94
500 .05 .13 .24 .37 .50 .61 .71 .79 .85 .90
1000 .03 .09 .18 .30 .42 .53 .64 .73 .80 .86
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
2000 .46 .56 .66 .74 .81 .86 .90 .93 .95 .97
5000 .36 .47 .57 .66 .74 .80 .85 .90 .93 .95
10000 .30 .40 .50 .59 .68 .75 .81 .86 .90 .93
20000 .24 .34 .43 .53 .62 .70 .76 .82 .87 .90
Number of treatments T = 4
Non-centrality
Genes 2 4 6 8 10 12 14 16 18 20
100 .11 .23 .37 .51 .64 .74 .82 .87 .92 .95
200 .07 .16 .29 .42 .54 .65 .75 .82 .87 .91
500 .04 .10 .19 .31 .43 .54 .64 .73 .80 .86
1000 .02 .07 .14 .24 .35 .46 .56 .66 .74 .81
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
2000 .38 .49 .59 .67 .75 .81 .86 .90 .93 .95
5000 .29 .39 .49 .58 .67 .74 .80 .85 .89 .92
10000 .24 .33 .42 .52 .60 .68 .75 .81 .86 .89
20000 .19 .27 .36 .45 .54 .62 .70 .76 .82 .86
APPENDIX B: Power Table for Multiple-treatment Designs 341
Number of treatments T = 5
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .69 .77 .84 .89 .93 .95 .97 .98 .99 .99
200 .60 .69 .77 .84 .89 .92 .95 .96 .98 .99
500 .48 .59 .68 .76 .82 .87 .91 .94 .96 .97
1000 .40 .51 .60 .69 .76 .82 .87 .91 .93 .95
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .77 .82 .87 .91 .93 .95 .97 .98 .98 .99
5000 .69 .76 .81 .86 .90 .92 .94 .96 .97 .98
10000 .63 .70 .76 .82 .86 .90 .92 .94 .96 .97
20000 .56 .64 .71 .77 .82 .86 .90 .92 .94 .96
Number of treatments T = 6
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .65 .74 .81 .86 .91 .94 .96 .97 .98 .99
200 .55 .65 .74 .80 .86 .90 .93 .95 .97 .98
500 .44 .54 .63 .71 .78 .84 .88 .92 .94 .96
1000 .36 .46 .55 .64 .72 .78 .84 .88 .91 .94
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .73 .79 .84 .88 .91 .94 .95 .97 .98 .98
5000 .64 .71 .77 .83 .87 .90 .93 .95 .96 .97
10000 .58 .65 .72 .78 .83 .87 .90 .93 .95 .96
20000 .51 .59 .66 .73 .78 .83 .87 .90 .93 .95
Number of treatments T = 7
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .61 .70 .78 .84 .89 .92 .95 .96 .98 .98
200 .51 .61 .70 .77 .83 .88 .91 .94 .96 .97
500 .40 .50 .59 .68 .75 .81 .86 .90 .93 .95
1000 .32 .42 .51 .60 .68 .75 .81 .86 .89 .92
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .69 .75 .81 .85 .89 .92 .94 .96 .97 .98
5000 .60 .67 .74 .79 .84 .88 .91 .93 .95 .96
10000 .53 .61 .68 .74 .80 .84 .88 .91 .93 .95
20000 .47 .55 .62 .69 .75 .80 .84 .88 .91 .93
342 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Number of treatments T = 8
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .58 .67 .75 .81 .87 .90 .93 .95 .97 .98
200 .48 .58 .67 .74 .81 .86 .90 .93 .95 .96
500 .37 .46 .56 .64 .72 .78 .83 .88 .91 .94
1000 .29 .38 .48 .56 .65 .72 .78 .83 .87 .91
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .65 .72 .78 .83 .87 .90 .93 .95 .96 .97
5000 .56 .64 .70 .76 .81 .86 .89 .92 .94 .96
10000 .49 .57 .64 .71 .77 .81 .86 .89 .92 .94
20000 .43 .51 .58 .65 .71 .77 .82 .86 .89 .91
Number of treatments T = 9
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .55 .64 .72 .79 .85 .89 .92 .94 .96 .97
200 .45 .55 .64 .72 .78 .84 .88 .91 .94 .96
500 .34 .43 .52 .61 .69 .75 .81 .86 .89 .92
1000 .27 .35 .44 .53 .61 .69 .75 .81 .85 .89
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .62 .69 .75 .80 .85 .88 .91 .94 .95 .97
5000 .52 .60 .67 .73 .79 .83 .87 .90 .93 .95
10000 .46 .53 .61 .68 .74 .79 .83 .87 .90 .92
20000 .39 .47 .54 .62 .68 .74 .79 .83 .87 .90
Number of treatments T = 10
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .52 .61 .70 .77 .83 .87 .91 .93 .95 .97
200 .42 .52 .61 .69 .76 .82 .86 .90 .93 .95
500 .31 .40 .49 .58 .66 .73 .79 .84 .88 .91
1000 .24 .33 .41 .50 .58 .66 .73 .78 .83 .87
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .59 .66 .72 .78 .83 .87 .90 .92 .94 .96
5000 .49 .57 .64 .70 .76 .81 .85 .89 .91 .93
10000 .43 .50 .58 .64 .71 .76 .81 .85 .88 .91
20000 .36 .44 .51 .58 .65 .71 .76 .81 .85 .88
APPENDIX B: Power Table for Multiple-treatment Designs 343
Number of treatments T = 2
Non-centrality
Genes 2 4 6 8 10 12 14 16 18 20
100 .23 .43 .61 .74 .84 .90 .94 .97 .98 .99
200 .15 .33 .51 .65 .77 .85 .90 .94 .96 .98
500 .09 .23 .38 .53 .66 .76 .84 .89 .93 .96
1000 .06 .17 .30 .44 .58 .69 .78 .85 .90 .93
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
2000 .61 .71 .80 .86 .90 .94 .96 .97 .98 .99
5000 .51 .62 .72 .79 .85 .90 .93 .95 .97 .98
10000 .44 .55 .65 .73 .80 .86 .90 .93 .95 .97
20000 .37 .48 .58 .67 .75 .82 .87 .90 .93 .95
Number of treatments T = 3
Non-centrality
Genes 2 4 6 8 10 12 14 16 18 20
100 .17 .34 .50 .64 .76 .84 .90 .93 .96 .98
200 .11 .25 .40 .54 .67 .77 .84 .89 .93 .96
500 .06 .16 .29 .42 .55 .66 .75 .83 .88 .92
1000 .04 .11 .22 .34 .46 .58 .68 .77 .83 .88
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
2000 .50 .61 .70 .78 .84 .89 .92 .95 .96 .98
5000 .40 .51 .61 .70 .77 .83 .88 .91 .94 .96
10000 .33 .44 .54 .63 .71 .78 .84 .88 .92 .94
20000 .27 .37 .47 .57 .65 .73 .79 .84 .89 .92
Number of treatments T = 4
Non-centrality
Genes 2 4 6 8 10 12 14 16 18 20
100 .14 .28 .43 .57 .69 .78 .85 .90 .94 .96
200 .09 .20 .34 .47 .60 .70 .79 .85 .90 .93
500 .05 .13 .23 .35 .48 .59 .69 .77 .84 .88
1000 .03 .09 .17 .28 .39 .51 .61 .70 .78 .84
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
2000 .43 .53 .63 .71 .78 .84 .89 .92 .94 .96
5000 .33 .43 .53 .62 .71 .77 .83 .88 .91 .94
10000 .27 .37 .46 .56 .64 .72 .78 .84 .88 .91
20000 .22 .30 .40 .49 .58 .66 .73 .79 .84 .88
344 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Number of treatments T = 5
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .74 .82 .87 .92 .94 .96 .98 .99 .99 .99
200 .65 .74 .81 .87 .91 .94 .96 .97 .98 .99
500 .53 .63 .72 .79 .85 .89 .93 .95 .97 .98
1000 .45 .55 .65 .73 .80 .85 .89 .92 .95 .96
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .80 .85 .89 .92 .95 .96 .97 .98 .99 .99
5000 .72 .79 .84 .88 .91 .94 .96 .97 .98 .99
10000 .66 .73 .79 .84 .88 .91 .94 .95 .97 .98
20000 .60 .68 .74 .80 .85 .88 .91 .94 .95 .97
Number of treatments T = 6
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .70 .78 .85 .89 .93 .95 .97 .98 .99 .99
200 .61 .70 .78 .84 .89 .92 .95 .96 .98 .98
500 .49 .59 .68 .76 .82 .87 .91 .93 .95 .97
1000 .40 .50 .60 .68 .76 .82 .87 .90 .93 .95
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .76 .82 .86 .90 .93 .95 .96 .98 .98 .99
5000 .68 .75 .80 .85 .89 .92 .94 .96 .97 .98
10000 .61 .69 .75 .81 .85 .89 .92 .94 .96 .97
20000 .55 .63 .70 .76 .81 .85 .89 .92 .94 .96
Number of treatments T = 7
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .67 .75 .82 .87 .91 .94 .96 .97 .98 .99
200 .57 .66 .75 .81 .86 .90 .93 .95 .97 .98
500 .45 .55 .64 .72 .79 .84 .88 .92 .94 .96
1000 .36 .46 .56 .65 .72 .79 .84 .88 .91 .94
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .72 .79 .84 .88 .91 .93 .95 .97 .98 .98
5000 .64 .71 .77 .82 .86 .90 .92 .95 .96 .97
10000 .57 .65 .71 .77 .82 .86 .90 .92 .94 .96
20000 .50 .58 .66 .72 .78 .82 .86 .90 .92 .94
APPENDIX B: Power Table for Multiple-treatment Designs 345
Number of treatments T = 8
Non-centrality
Genes
12 14 16 18 20 22 24 26 28 30
100 .64 .72 .80 .85 .90 .93 .95 .97 .98 .99
200 .53 .63 .72 .79 .84 .89 .92 .94 .96 .97
500 .41 .51 .60 .69 .76 .82 .86 .90 .93 .95
1000 .33 .43 .52 .61 .69 .76 .81 .86 .90 .92
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .69 .76 .81 .86 .89 .92 .94 .96 .97 .98
5000 .60 .67 .74 .79 .84 .88 .91 .93 .95 .96
10000 .53 .61 .68 .74 .79 .84 .88 .91 .93 .95
20000 .47 .54 .62 .68 .74 .80 .84 .88 .91 .93
Number of treatments T = 9
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .61 .70 .77 .83 .88 .91 .94 .96 .97 .98
200 .51 .60 .69 .76 .82 .87 .90 .93 .95 .97
500 .38 .48 .57 .66 .73 .79 .84 .88 .91 .94
1000 .31 .40 .49 .58 .66 .73 .79 .84 .88 .91
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .66 .73 .78 .83 .87 .90 .93 .95 .96 .97
5000 .57 .64 .71 .77 .82 .86 .89 .92 .94 .96
10000 .50 .57 .64 .71 .77 .81 .86 .89 .92 .94
20000 .43 .51 .58 .65 .71 .77 .81 .85 .89 .91
Number of treatments T = 10
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .58 .67 .75 .81 .86 .90 .93 .95 .97 .98
200 .48 .57 .66 .74 .80 .85 .89 .92 .94 .96
500 .36 .45 .54 .63 .70 .77 .82 .87 .90 .93
1000 .28 .37 .46 .55 .63 .70 .76 .82 .86 .89
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .63 .70 .76 .81 .85 .89 .92 .94 .96 .97
5000 .53 .61 .68 .74 .79 .84 .87 .90 .93 .95
10000 .46 .54 .61 .68 .74 .79 .83 .87 .90 .92
20000 .40 .48 .55 .62 .68 .74 .79 .83 .87 .90
346 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Number of treatments T = 2
Non-centrality
Genes 2 4 6 8 10 12 14 16 18 20
100 .26 .48 .65 .78 .87 .92 .95 .97 .99 .99
200 .18 .37 .55 .69 .80 .87 .92 .95 .97 .98
500 .11 .26 .42 .57 .70 .79 .86 .91 .94 .97
1000 .07 .19 .33 .48 .61 .72 .81 .87 .91 .94
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
2000 .65 .74 .82 .88 .92 .95 .96 .98 .99 .99
5000 .54 .65 .74 .81 .87 .91 .94 .96 .97 .98
10000 .47 .58 .68 .76 .82 .87 .91 .94 .96 .97
20000 .40 .51 .61 .70 .77 .83 .88 .92 .94 .96
Number of treatments T = 3
Non-centrality
Genes 2 4 6 8 10 12 14 16 18 20
100 .20 .38 .55 .69 .79 .86 .91 .95 .97 .98
200 .13 .28 .44 .58 .70 .80 .86 .91 .94 .97
500 .07 .18 .32 .46 .59 .70 .78 .85 .90 .93
1000 .05 .13 .24 .37 .50 .61 .71 .79 .85 .90
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
2000 .53 .64 .73 .80 .86 .90 .93 .95 .97 .98
5000 .43 .54 .64 .72 .79 .85 .89 .92 .95 .96
10000 .36 .47 .57 .66 .74 .80 .85 .90 .93 .95
20000 .30 .40 .50 .59 .68 .75 .81 .86 .90 .93
Number of treatments T = 4
Non-centrality
Genes 2 4 6 8 10 12 14 16 18 20
100 .17 .32 .48 .62 .73 .82 .88 .92 .95 .97
200 .11 .23 .37 .51 .64 .74 .82 .87 .92 .95
500 .06 .15 .26 .39 .51 .63 .72 .80 .86 .90
1000 .04 .10 .19 .31 .43 .54 .64 .73 .80 .86
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
2000 .46 .56 .66 .74 .81 .86 .90 .93 .95 .97
5000 .36 .46 .56 .65 .73 .80 .85 .89 .92 .95
10000 .29 .39 .49 .58 .67 .74 .80 .85 .89 .92
20000 .24 .33 .42 .52 .60 .68 .75 .81 .86 .89
APPENDIX B: Power Table for Multiple-treatment Designs 347
Number of treatments T = 5
Non-centrality
Genes
2 4 6 8 10 12 14 16 18 20
100 .15 .29 .43 .57 .68 .78 .85 .90 .93 .96
200 .09 .20 .33 .46 .58 .69 .77 .84 .89 .93
500 .05 .12 .22 .34 .46 .57 .67 .75 .82 .87
1000 .03 .08 .16 .26 .37 .48 .59 .68 .76 .82
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
2000 .40 .51 .60 .69 .76 .82 .87 .91 .93 .95
5000 .31 .41 .50 .59 .68 .75 .81 .86 .89 .92
10000 .25 .34 .43 .52 .61 .69 .76 .81 .86 .90
20000 .20 .28 .37 .46 .54 .63 .70 .76 .82 .86
Number of treatments T = 6
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .74 .81 .87 .91 .94 .96 .98 .98 .99 .99
200 .65 .74 .81 .86 .91 .94 .96 .97 .98 .99
500 .52 .62 .71 .78 .84 .89 .92 .94 .96 .97
1000 .44 .54 .63 .71 .78 .84 .88 .92 .94 .96
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .78 .84 .88 .91 .94 .96 .97 .98 .99 .99
5000 .71 .77 .82 .87 .90 .93 .95 .96 .97 .98
10000 .64 .71 .77 .83 .87 .90 .93 .95 .96 .97
20000 .58 .65 .72 .78 .83 .87 .90 .93 .95 .96
Number of treatments T = 7
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .71 .79 .85 .89 .93 .95 .97 .98 .99 .99
200 .61 .70 .78 .84 .89 .92 .95 .96 .98 .98
500 .48 .58 .67 .75 .81 .86 .90 .93 .95 .97
1000 .40 .50 .59 .68 .75 .81 .86 .90 .93 .95
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .75 .81 .86 .89 .92 .94 .96 .97 .98 .99
5000 .67 .73 .79 .84 .88 .91 .93 .95 .97 .98
10000 .60 .67 .74 .79 .84 .88 .91 .93 .95 .96
20000 .53 .61 .68 .74 .80 .84 .88 .91 .93 .95
348 ANALYSIS OF MICROARRAY GENE EXPRESSION DATA
Number of treatments T = 8
Non-centrality
Genes
12 14 16 18 20 22 24 26 28 30
100 .68 .76 .83 .88 .91 .94 .96 .97 .98 .99
200 .58 .67 .75 .81 .87 .90 .93 .95 .97 .98
500 .45 .55 .64 .72 .79 .84 .88 .92 .94 .96
1000 .37 .46 .56 .64 .72 .78 .83 .88 .91 .94
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .72 .78 .83 .87 .91 .93 .95 .96 .98 .98
5000 .63 .70 .76 .81 .86 .89 .92 .94 .96 .97
10000 .56 .64 .70 .76 .81 .86 .89 .92 .94 .96
20000 .49 .57 .64 .71 .77 .81 .86 .89 .92 .94
Number of treatments T = 9
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .65 .74 .80 .86 .90 .93 .95 .97 .98 .99
200 .55 .64 .72 .79 .85 .89 .92 .94 .96 .97
500 .42 .52 .61 .69 .76 .82 .86 .90 .93 .95
1000 .34 .43 .52 .61 .69 .75 .81 .86 .89 .92
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .69 .75 .81 .85 .89 .92 .94 .96 .97 .98
5000 .59 .67 .73 .79 .83 .87 .90 .93 .95 .96
10000 .52 .60 .67 .73 .79 .83 .87 .90 .93 .95
20000 .46 .53 .61 .68 .74 .79 .83 .87 .90 .92
Number of treatments T = 10
Non-centrality
Genes 12 14 16 18 20 22 24 26 28 30
100 .63 .71 .78 .84 .89 .92 .94 .96 .97 .98
200 .52 .61 .70 .77 .83 .87 .91 .93 .95 .97
500 .39 .49 .58 .66 .73 .80 .84 .88 .92 .94
1000 .31 .40 .49 .58 .66 .73 .79 .84 .88 .91
Non-centrality
Genes 22 24 26 28 30 32 34 36 38 40
2000 .66 .73 .78 .83 .87 .90 .93 .95 .96 .97
5000 .56 .64 .70 .76 .81 .85 .89 .92 .94 .95
10000 .49 .57 .64 .70 .76 .81 .85 .89 .91 .93
20000 .43 .50 .58 .64 .71 .76 .81 .85 .88 .91
Glossary of Notation
where the
denote the interaction terms in an ANOVA
model.
Population concentration of genetic material in specimen attribut
able to gene
True gene expression component of instrument measurement
Statistic denotes
a summary measure of the estimated
differential expression vector for gene where
is a function
specified by the investigator.
denotes a column vector of observed expression
levels across G genes in specimen where represents the intensity
of gene measured from specimen and the prime denotes a vector
or matrix transpose.
denotes a row vector of observed expression lev
els across N specimens for gene where represents the intensity
of gene measured from specimen
A variable representing a transformation of the instrument mea
surement for gene expression. The transformation is selected to give
desired statistical properties.
A variable representing the instrument measurement of expression
for gene under condition possibly having undergone various ad
justments (calibration, background correction, etc.) in the internal
software of the instrument.
References
Affymetrix (2001). Affymetrix Microarray Suite 5.0 User Guide, Affymetrix Inc.,
Santa Clara, CA.
Affymetrix GeneChip, 700228 rev.2.
Afifi, A.A. and Clark, V. (1990). Computer-aided Multivariate Analysis. 2nd edition,
Chapman and Hall, New York.
Alberts, B., Johnson, A. Lewis, J., Raff, M., Roberts, K., Walter, P. (2002). Molecular
Biology of the Cell, 4th edition, Garland Publishing.
Alizadeh, A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick,
J.C., Sabet, H., Tran, T., Yu, X., Powell, J.I., Yang, L., Marti, G.E., Moore, T.,
Hudson, J., Lu, L., Lewish, D.B., Tibshrani, R., Sherlock, G., Chan, W.C., Greiner,
T.C., Weisenburger D.D., Armitage, J.O., Warnke, R., Levy, R., Wilson, W., Gr-
ever, M.R., Byrd, J.C., Botstein, D., Brown, P.O., and Staudt, L.M. (2000). Dis
tinct types of diffuse large B-cell lymphoma identified by gene expression profiling.
Nature, 403, 503-511.
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.,
(1999). Broad patterns of gene expression revealed by clustering analysis of tu
mor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the
National Academy of Sciences, USA, 96, 6745-6750.
Alter, O. , Brown, P. O. and Botstein, D. (2000). Singular value decomposition for
genome-wide expression data processing and modeling, Proceedings of the National
Academy of Sciences, USA, 97, 10101-10106.
Alter, O., Brown, P.O., and Botstein, D. (2003). Generalized singular value decompo
sition for comparative analysis of genome-scale expression data sets of two different
organisms, Proceedings of the National Academy of Sciences, USA, 100, 3351-3356.
Anderson, T.W. (1958). An Introduction to Multivariate Statistical Analysis. John
Wiley & Sons, Inc., New York.
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler H., Cherry, J.M., Davis,
A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver,
L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin,
G.M., Sherlock, G. (2000). Gene ontology: tool for the unification of biology. The
Gene Ontology Consortium. Nature Genetics, 25, 25-29.
Ausubel, F. M. et al. (editors) (1993). Current Protocols in Molecular Biology, John
Wiley & Sons, Inc., New York.
352
Baggerly, K.A., Coombes, K.R., Hess, K.R., Stivers, D.N., Abruzzo, L.V., and Zhang,
W. (2001). Identifying differentially expressed genes in cDNA microarray experi
ments. Journal of Computational Biology, 8, 639-659.
Baldi, P., Hatfield, G.W. (2002). DNA Microarrays and Gene Expression, Cambridge
University Press, Cambridge, U.K.
Baldi, P., Long, A.D. (2001). A Bayesian framework for the analysis of microarray
expression data: regularized t-test and statistical inferences of gene changes. Bioin
formatics, 17, 509-519.
Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: A practical
and powerful approach to multiple testing, Journal of Royal Statistical Society, B
57, 289-300.
Beran, R. (1988). Balanced simultaneous confidence sets, Journal of the American
Statistical Association, 83, 679-686.
Bernardo, J.M., Giron, J. (1988). A Bayesian approach to cluster analysis, Questiio,
12, 97-112.
Binder, D.A. (1978). Bayesian cluster analysis, Biometrika, 65, 31-38.
Bishop, C.M. (1995). Neural Networks for Pattern Recognition. Clarendon Press, Ox
ford.
Bittner, M., Chen, Y., Amundson, S.A., Khan, J., Forance, A.J., Dougherty, E.R.,
Meltzer, P.S., and Trent, J.M., (2000). Obtaining and evaluating gene expression
profiles with cDNA microarrays. In Genomics and Proteomics: Functional and
Computational Aspects, ed. Suhai, S., pp 5-25. Kluwer Academic, Plenum Pub
lisher, New York.
Bittner M., Meltzer P., Chen Y., Jiang Y., Seftor E., Hendrix M., Radmacher M.,
Simon R., Yakhini Z., Ben-Dor A., Sampas N., Dougherty E., Wang E., Marincola
F., Gooden C., Lueders J., Glatfelter A., Pollock P., Carpten J., Gillanders E.,
Leja D., Dietrich K., Beaudry C., Berens M., Alberts D., Sondak V. (2000). Mole
cular classification of cutaneous malignant melanoma by gene expression profiling.
Nature, 406, 536-540.
Björkbacka, H., personal communication, 2003.
Bowtell, D.D. (1999). Options available–from start to finish–for obtaining expression
data by microarray. Nature Genetics, 21, 25-32.
Bowtell, D., Sambrook, J. (Editors) (2002). DNA Microarrays: A Molecular Cloning
Manual., Cold Spring Harbor Laboratory.
Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spelllman, P., Stoeckert
C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenis
son, P., Holstege, F.C.P., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H.,
Robinson, A., Sarkans, U., Schulze-Kremer, S., Steward, J., Taylor, R., Vilo, J.,
and Vingron, M. (2001). Minimum information about a microarray experiment
(MIAME): toward standards for microarray data. Nature Genetics, 29, 365-371.
Breiman, L. (1996a). Bagging predictors. Machine Learning, 24: 123-140.
Dudley, A.M., Aach, J., Steffen, M.A. and Church, G.M. (2002). Measuring absolute
expression with microarrays using a calibrated reference sample and an extended
signal intensity range. Proceedings of the National Academy of Sciences, USA,
99,7554-7559.
Dudoit, S., Fridlyand, J., and Speed, T.P. (2002). Comparison of discrimination meth
ods for the classification of tumors using gene expression data. Journal of the
American Statistical Association, 97, 77-87.
Dudoit, S., Shaffer, J.P., Boldrick, J.C. (2003). Multiple hypothesis testing in micror
ray experiments. Technical Report No. 110, Division of Biostatistics, University of
California, Berkeley.
Dudoit, S., Shaffer, J.P., Boldrick, J.C. (2003). Multiple hypothesis testing in mi
croarray experiments. Statistical Sciences, 18, 71-103.
Dudoit, S., Yang Y.H., Callow, M.J., Speed, T.P. (2002). Statistical methods for iden
tifying differentially expressed genes in replicated cDNA microarray experiments.
Statistica Sinica, 12, 111-139.
Duggan D.J., Bittner M., Chen Y., Meltzer P., Trent J.M. (1999). Expression profiling
using cDNA microarrays. Nature Genetics, 21,10-14.
Dunn, G. and Everitt, B.S. (1982). An Introduction to Mathematical Taxonomy, Cam
bridge University Press, Cambridge.
Dunteman, G.H. (1989), Principal Component Analysis, Sage University Papers, Sage,
Newbury Park, California.
Durbin, B.P., Hardin, J.S., Hawkins, D.M., and Rocke, D.M. (2002). A variance-
stabilizing transformation for gene-expression microarray data. Bioinformatics, 18,
S105-S110.
Efron, B. (2001). Robbins, empirical Bayes, and microarrays. Technical Report No.2001-
30B/219, Department of Statistics, Stanford University, Stanford, California.
Efron, B., Morris, C. (1973). Stein’s estimation rule and its competitors - an empirical
Bayes approach. Journal of American Statistical Association, 68, 117-130.
Efron, B., Morris, C. (1975). Data analysis using Stein’s estimator and its generaliza
tions. Journal of American Statistical Association, 70, 311-319.
Efron, B., Storey, J.D., Tibshirani, R. (2001). Microarrays empirical Bayes meth
ods and false discovery rates. Technical Report No.2001-23B/217, Department of
Statistics, Stanford University, Stanford, California.
Efron, B., Tibshirani, R., Storey, J.D., Tusher, V. (2001). Empirical Bayes analysis
of a microarray experiment. Journal of the American Statistical Association, 96,
1151-1160.
Eisen, M.B. (1999). ScanAlyze User Manual, Version 2.32; Stanford University: Stan
ford, CA.
Eisen, M.B., Brown, P.O. (1999). DNA arrays for analysis of gene expression. Methods
in Enzymology, 303, 179-205.
Eisen, M., Spellman, P.T., Brown, P.O. and Botstein, D. (1998). Cluster analysis and
display of genome-wide expression patterns, Proceedings of the National Academy
of Sciences, USA, 95, 14863-14868.
Emptage, M.R., Hudson-Curtis, B., Sen, K. (2003). Treatment of microarray experi
ments as split-plot designs. Journal of Biopharmaceutical Statistics, 13, 159-178.
Everitt, B.S. (1993). Cluster Analysis. Edward Arnold, New York.
Fisher, R.A., (1936). The use of multiple measurements in taxonomic problems, Annal
of Eugenics, 7, 179-188.
Fisher, R.A., (1947). The Design of Experiments, Oliver and Boyd, Edinburgh, 4th
ed.
356
Heden, B., Ohlin, H., Rittner, R., and Edenbrandt, L. (1997). Acute myocardial in
fartion detected in the 120lead ECG by artificial neural network. Circulation, 96,
1798-1802.
Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer,
P., Gusterson, B., Esteller, M., Raffeld, M. , et al. (2001). Gene-expression profiles
in hereditary breast cancer. New England Journal of Medicine, 344, 539-548.
Hedenfalk, I., Ringner, M., Ben-Dor, A., Yakhini, Z., Chen, Y., Chebil, G., Ach, R.,
Loman, N., Olsson, H., Meltzer, P., Borg, A., Trent, J. (2003). Molecular classifi
cation of familial non-BRCA1/BRCA2 breast cancer. Proceedings of the National
Academy of Sciences, USA, 100, 2532-2537.
Held, G.A., Grinstein, G., Tu, Y. (2003). Modeling of DNA microarray data by us
ing physical properties of hybridization. Proceedings of the National Academy of
Sciences, USA, 100, 7575-7580.
Heller, M.J. (2002). DNA microarray technology: devices, systems, and applications.
Annu. Rev. Biomed. Eng., 4, 129-153.
Hertz, J., Krogh, A., Palmer, R.G. (1991). Introduction to the Theory of Neural Com
putation, Addison-Wesley, Redwood City, CA.
Herzel, H., Beule, D., Kielbasa, S., Korbel, J., Sers, C., Malik, A., Eickhoff, H.,
Lehrach, H., Schuchhardt, J. (2001). Extracting information from cDNA arrays.
CHAOS, 11.
Hessner, M.J., Wang, X., Hulse, K., Meyer, L., Wu, Y., Nye, S., Guo, S.-W., Ghosh,
S. (2003). Three color cDNA microarrays: quantitative assessment through the use
of fluorescein-labeled probes. Nucleic Acids Research, 31, No. 4, e14.
Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance,
Biometrika, 75, 800-802.
Hochberg, Y. and Tamhane A.C. (1987). Multiple Comparison Procedures. John Wiley
& Sons, New York.
Holm, S. (1979). A simple sequentially rejective multiple test procedure, Scandinavian
Journal of Statistics, 6, 65-70.
Holter, N.S., Maritan, A., Cieplak, M., Fedoroff, N.V., Banavar, J.R. (2001). Dynamic
modeling of gene expression data. Proceedings of the National Academy of Sciences,
USA, 98, 1693-1698.
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal com
ponents. Journal of Educational Psychology, 24, 417-441.
Huber, W., von Heydebreck, A., Sültmann, H., Poustka, A., and Vingron, M. (2002).
Variance stabilization applied to microarray data calibration and to the quantifi
cation of differential expression. Bioinformatics, 18, Suppl. S-96-S104.
Hudson, D.L. and Cohen, M.E. (2000). Neural Networks and Artificial Intelligence
for Biomedical Engineering, The Institute of Electric and Electronics Engineers,
Inc., New York.
Hughes, T.R., Roberts, C.J., Dai, H., Jones, A.R., Meyer, M.R., Slade, D., Bur-
chard, J., Dow, S., Ward, T.R., Kidd, M.J., Friend, S.H., Marton, M.J. (2000).
Widespread aneuploidy revealed by DNA microarray expression profiling. Nature
Genetics, 25, 333-337.
Ibrahim, J.G., Chen M.-H., Gray, R.J. (2002). Bayesian models for gene expression
with DNA microarray data. Journal of the American Statistical Association, 97,
88-99.
Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., Speed, T.P. (2003).
Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research, 31,
No.4, 1-8.
358
Irizarry, K., Kustanovich, V., Li, C., Brown, N., Nelson, S., Wong, W., Lee, C.J.
(2000). Genome-wide analysis of single-nucleotide polymorphisms in human ex
pressed sequences. Nature Genetics, 26, 233-236.
Jackson, J.E. (1991). A User’s Guide to Principal Components, John Wiley & Sons,
New York.
Jain, A.K. and Dubes, R.C. (1988). Algorithms for Clustering Data, Prentice Hall,
Englewood Cliff, New Jersey.
Jenssen, T.-K., Langaas, M., Kuo, W.P., Smith-Sorensen, B., Myklebost, O., Hovig,
E. (2002). Analysis of repeatability in spotted cDNA microarrays. Nucleic Acids
Research, 30, 3235-3244.
Jin, W., Riley, R.M., Wolfinger, R.D., White, K.P., Passador-Gurgel, G., and Gibson,
G. (2001). The contributions of sex, genotype and age to transcriptional variance
in Drosophila melanogaster, Nature Genetics, 29, 389 - 395.
Jobson, J.D. (1992). Applied Multivariate Data Analysis, Springer-Verlag, New York.
Johnson, R.A., Wichern, D.W. (1998). Applied Multivariate Statistical Analysis, 4th
ed., Prentice-Hall, Inc., New Jersey.
Johnson, S.C. (1967). Hierarachical clustering schemes. Psychometrika, 32, 241-254.
Joliffe, I.T. (2002). Principal Component Analysis, 2nd ed., Springer-Verlag, New
York.
Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data, John Wiley &
Sons, New York.
Kendziorski, C.M., Newton, M.A., Lan, H., and Gould, M.N. (2003). On parametric
empirical Bayes methods for comparing multiple groups using replicated gene ex
pression profiles. Technical Report #166, Department of Biostatistics and Medical
Informatics, University of Wisconsin - Madison (submitted). Statistics in Medicine.
(In press).
Kerr, M.K. and Churchill, G.A. (2001). Statistical design and the analysis of gene
expression microarrays, Genetical Research, 77:123-128.
Kerr, M.K., and Churchill, G.A. (2001a). Bootstrapping cluster analysis; assessing the
reliability of conclusions from microarray experiments. Proceedings of the National
Academy of Sciences, USA, 98, 8961-8965.
Kerr, M.K., and Churchill, G.A. (2001b). Experimental design for gene expression
microarrays. Biostatistics, 2, 183-201.
Kerr, M.K., Martin, M., and Churchill, G.A. (2001c). Analysis of variance for gene
expression microaray data. Journal of Computational Biology, 7, 819-837.
Kerr, M.K., Afshari, C.A., Bennett, L., Bushel, P., Martinez, J., Walker, N.J., and
Churchill, G.A. (2002). Statistical analysis of a gene expression microarray exper
iment with replication. Statistics Sinica, 12, 203-218.
Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold,
F., Schwab, M., Antonescu, C.R., Peterson, C., and Meltzer, P.S. (2001). Classi
fication and diagnostic prediction of cancers using gene expression profiling and
artificial neural networks. Nature Medicine, 7, 673-679.
Kohonen, T. (1997). Self-Organizing Maps, Springer, Berlin.
Kong, C.F., Bowtell, D. (2002). Genome-wide gene expression analysis using cDNA
microarrays. Methods in Molecular Medicine, 68, 195-204.
Kruglyak, L., Nickerson, D.A. (2001). Variation is the spice of life. Nature Genetics,
27, 234-236.
Krzanowski, W.J. (1988). Principles of Multivariate Analysis: A User’s Perspective,
Oxford University Press, Oxford.
REFERENCES 359
Lance, G.N. and Williams, W.T. (1967). A general theory of classificatory sorting
strategies: 1. Hierarchical systems. Comp. J., 9, 373-380.
Lander, E.S. (1996). The new genomics: global views of biology. Science, 274, 536-539.
Lander, E.S. (1999). Array of hope, Nature Genetics, 21, 3-4.
Lee, M.-L.T., Kuo, F.C., Whitmore, G.A., Sklar, J. (2000). The importance of replica
tion in microarray gene expression studies. Statistical methods and evidence from
repetitive cDNA hybridizations. Proceedings of the National Academy of Sciences,
USA, 97, 9834-9839.
Lee, M.-L.T., Bulyk, M.L., Whitmore, G.A., Church, G.M. (2002a). A statistical
model for investigating binding probabilities of DNA nucleotide sequences using
microarrays. Biometrics, 58, 129-136.
Lee, M.-L.T., Lu, W., Whitmore, G.A., Beier, D. (2002b). Models for microarray gene
expression data. Journal of Biopharmaceutical Statistics, 12(1), 1-19.
Lee, M.-L.T., Whitmore, G.A. (2002c). Power and sample size for microarray studies.
Statistics in Medicine, 21, 3543-3570.
Lee, M.-L.T., Whitmore, G.A., Yukhananov, R.Y. (2003). Analysis of unbalanced
microarray data. Journal of Data Science, 1, 103-121.
Lehninger, A.L., Nelson, D.L., and Cox, M.M. (2000). Principles of Biochemistry,
Worth Publishing, 3rd ed.
Lewin, B. (2000). Genes VII, Oxford University Press, New York.
Li, C. and Wong, W.H. (2001). Model-based analysis of oligonucleotide arrays: model
validation, design issues and standard error application, Genome Biology 2(8):
research0032.1-0032.11. www.genomebiology.com.
Li, C., Wong, W.H. (2001). Model-based analysis of oligonucleotide arrays: expression
index computation and outlier detection. Proceedings of the National Academy of
Sciences, USA, 98, 31-36.
Lipshutz, R.J., Fodor, S.P., Gingeras, T.R., Lockhart, D.J. (1999). High density syn
thetic oligonucleotide arrays. Nature Genetics, 21, 20-24.
Little, R.J.A. and Rubin, D.B. (1987). Statistical Analysis with Missing Data, John
Wiley & Sons, New York.
Liu, L., Hawkins, D.M., Ghosh, S., Young, S.S. (2003). Robust singular value de
composition analysis of microarray data, Proceedings of the National Academy of
Sciences, USA, 100, 13167-13172.
Lockhart, D.J., Dong, H., Bryne. M.C., Follettie, M.T., Gallo, M.V., Chee, M.S.,
Mittmann, M., Wang, C., Kobayashi, M., Horton, H., Brown, E.L. (1996). Expres
sion monitoring by hybridization to high-density oligonucleotide arrays. Nature
Biotechnology, 14, 1675-1680.
Ludbrook, J. and Dudley, H. (1998). Why permutation tests are superior to and F
tests in biomedical research, The American Statistician, 52, 127-132.
MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate
observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statis
tics and Probability, 1, 281-297, University of California Press, Berkeley, California.
Mardia, K.V., Kent, J.T., Bibby, J.M. (1979). Multivariate Analysis, Academic Press.
Inc., San Diego.
Martin, K.J., Graner, E., Li, Y., Price. L.M., Kritzman, B.M., Fournier, M.V., Rhei,
E., and Pardee, A.B. (2001). High-sensitivity array analysis of gene expression
for the early detection of disseminated breast tumor cells in peripheral blood.
Proceedings of the National Academy of Sciences, USA, 98, 2646-2651.
360
McShane, L.M., Radmacher, M.D., Freidlin, B., Yu, R., Li, M.-C., and Simon, R.
(2002). Methods for assessing reproducibility of clustering patterns observed in
analyses of microarray data. Bioinformatics, 18, 1462-1469.
McCulloch, W.S. and Pitts, W. (1943). A logical calculus of the ideas immanent in
neural nets. Bull. Math. Biophys, 5, 115-137.
McLachlan, G.J. (1992). Discriminant Analysis and Statistical Pattern Recognition.
John Wiley & Sons, New York.
Miesfeld, R. L. (1999). Applied Molecular Genetics. Wiley-Liss, New York.
Milligan, G.W. (1980). An examination of the effect of six types of error perturbation
on fifteen clustering algorithms. Psychometrika, 45, 325-342.
Milligan, G.W. and Cooper, M.C. (1985). An examination of procedures for deter
mining the number of clusters in a data set. Psychometrika, 50, 159-179.
Milligan, G.W. and Cooper, M.C. (1988). A study of standardization of variables in
cluster analysis. Journal of Classification. 5, 181-204.
Milliken, G.A., Johnson, D.E. (1992). Analysis of Messy Data, Chapman and Hall/CRC,
Boca Raton.
Moler, E.J., Chow, M. L., and Mian, I.S. (2000). Physiol. Genomics, 4, 109-126.
Morgan, B.J.T. and Ray, A.P.G. (1995). Non-uniqueness and inversions in cluster
analysis. Applied Statistics, 44, 117-134.
Morrison, D.F. (1976). Multivariate Statistical Methods, 2nd ed., New York, McGraw-
Hill.
Morton, C.C. (2000). personal communication.
Nadeau, J.H., Frankel, W.N. (2000). The roads from phenotypic variation to gene
discovery: mutagenesis versus QTLs. Nature Genetics, 25, 381-384.
Naef, F., Hacker, C.R., Patil, N. and Magnasco, M. (2002a). Empirical characteriza
tion of the expression ratio noise structure in high density oligonucleotide arrays.
Genome Biology, 3; research0018.1-0018.11.
Naef, F., Lim D.A., Patil, N. and Magnasco, M. (2002b). DNA hybridization to mis
matched templates: a chip study. Phys. Rev. E., 65, 040902.
Naef, F., Socci, N.D., Magnasco. M. (2003). A study of accuracy and precision in
oligonucleotide arrays: extracting more signal at large concentrations, Bioinfor
matics, 19; 178-184.
Neter, J., Kutner, M.H., Nachtscheim, C.J., Wasserman, W. (1996). Applied Linear
Statistical Models, 4th edition, Richard D. Irwin.
Newton, M.A., Kendziorski, C.M., Richmond, C.S., Blattner, F.R., Tsui, K.W. (2001).
On differential variability of expression ratios, improving statistical inference about
gene expression changes from microarray data. Journal of Computational Biology,
8, 37-52.
Newton, M.A., Kendziorski, C.M. (2003). Parametric empirical Bayes methods for
microarrays. In: The Analysis of Gene Expression Data: Methods and Software,
Parmigiani, G., Garrett, E.S., Irizarray, R.A., and Zeger, S.L., eds. 255-271. Springer,
New York.
Newton, M.A., Noueiry, A., Sarkar, D., Ahlquist, P. (2003). Detecting differential gene
expression with a semiparametric hierarchical mixture method. Technical Report
No. 1074, Department of Statistics, University of Wisconsin, Medison.
Nguyen, D.V., Arpat, A.B., Wang, N, Carroll, R.J. (2002). DNA microarray experi
ments: biological and technical aspects. Biometrics, 58, 701-717.
Nguyen, D.V., Rocke, D.M. (2002). Partial least squares proportional hazard regres
sion for application to DNA microarray survival data. Bioinformatics, 18, 1625
1632.
REFERENCES 361
Shavlik, J.W., Dieterich, T.G. (1990). Readings in Machine Learning, Morgan Kauf
mann, San Mateo, CA.
Shipp, M.A., Ross, K.N., Tamayo, P., Weng, A.P., Kutok, J.L., Aguiar, R.C.T.,
Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G.S., Ray, T.S., Koval, M.A.,
Last, K.W., Norton, A., Lister, T.A., Mesirov, J., Neuberg, D.S., Lander, E.S.,
Aster, J.C., Golub, T.R. (2002). Diffuse large B-cell lymphoma outcome prediction
by gene-expression profiling and supervised machine learning. Nature Medicine, 8,
68-74.
Silipo, R., Gori, M., Taddei, A., Varanini, M., and Marchesi, C. (1995). Classifica
tion of arrhythmic events in ambulatory electrocardiogram, using artificial neural
networks. Computational Biomedical Research, 28, 305-318.
Simes, R.J. (1986). An improved Bonferroni procedure for multiple tests of signifi
cance, Biometrika, 73, 751-754.
Simon, R., Radmacher, M.D., Dobbin, K. (2002). Design of studies using DNA mi
croarrays. Genetic Epidemiology, 23, 21-36.
Soille, P. (1999). Morphological Image Analysis: Principles and Applications. Springer.
Somogyi, (1999). Pharma Informatics, p.17.
Southern, E., Mir, K., Shchepinov, M. (1999). Molecular interactions on microarrays.
Nature Genetics, 21, 5-9.
Speed, T., Editor (2003). Statistical Analysis of Gene Expression Microarray Data,
Chapman and Hall/CRC Press, New York.
Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K. , Eisen, M.B.,
Brown, P.O., Botstein, D. and Futcher, B. (1998). Comprehensive identification
of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray
hybridization, Molecular Biology of the Cell, 9, 3273-3297.
Stern, H.S. (1996). Neural networks in applied statistics, (with discussion), Techno-
metrics, 38, 205-220.
Storey, J.D. (2003) The positive false discovery rate: A Bayesian interpretation and
the q-value. Annals of Statistics, (in press).
Storey, J.D., Taylor, J.E., and Siegmund, D. (2003). Strong control, conservative point
estimation, and simultaneous conservative consistency of false discovery rates: A
unified approach. Journal of the Royal Statistical Society, Series B, (in press).
Storey, J.D. and Tibshirani, R. (2001). Estimating false discovery rates under depen
dence, with applications to DNA microarrays, Technical Report, Stanford Univer
sity, 2001.
Storey, J.D. and Tibshirani, R. (2003). Statistical significance for genome-wide studies.
Proceedings of the National Academy of Sciences, USA, 100, 9440-9445.
Sudarsanam, P., Vishwanath, R.T., Brown, P.O., and Winston, F. (2000). Whole-
genome expression analysis of snf/swi mutants in Saccharomyces cerevisiae. Pro
ceedings of the National Academy of Sciences, USA, 97, 3364-3369.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dimitrovsky, E., Lan
der, E.S., Golub, T.R. (1999). Interpreting patterns of gene expression with self-
organizing maps: methods and application to hematopoietic differentiation. Pro
ceedings of the National Academy of Sciences, USA, 96, 2907-2912.
Tibshirani, R., Hastie, T., Narasimhan, B., Chu G. (2002). Diagnosis of multiple
cancer types by shrunken centroids of gene expression. Proceedings of the National
Academy of Sciences, USA, 99, 6567-6572.
Tibshirani, R., Hastie, T., Narasimhan, B., Eisen, M., Sherlock, G., Brown, P., Bot-
stein, D. (2002). Exploratory screening of genes and clusters from microarray ex
periments. Statistics Sinica, 12, 47-59.
364
Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M. (1999). System
atic determination of genetic network architecture. Nature Genetics, 22, 281-285.
Troendle, J.F. (1996). A permutational step-up method of testing multiple outcomes,
Biometrics, 52, 846-859.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R.,
Botstein, D., Altman, R.B. (2001). Missing value estimation methods for DNA
microarrays. Bioinformatics, 17 (6), 520-525.
Tseng, G.C., Oh, M.-K., Rohlin, L., Liao, J.C., and Wong, W.H. (2001). Issues in
cDNA microarray analysis: quality filtering, channel normalization, models of vari
ation and assessment of gene effects. Nucleid Acids Research, 29, 2549-2557.
Tusher, V.G., Tibshirani, R., Chu, G. (2001). Significance analysis of microarrays
applied to the ionizing radiation response. Proceedings of the National Academy of
Sciences, USA, 98, 5116-5121.
Vapnik, V. (1998). Statistical Learning Theory, John Wiley & Sons, New York.
Wang, X., Ghosh, S., and Guo, S. (2001). Quantitative quality control in microarray
image processing and data acquisition. Nucleic Acids Research, 29, No. 15, e75.
Wang, X., Hessner, M.J., Pati, N., and Ghosh, S. (2003). Quantitative quality control
in microarray experiments and the application in data filtering, normalization and
false positive rate prediction. Bioinformatics, 19, 1341-1347.
Ward, J.H. (1963). Hierarchical grouping to optimize an objective function. Journal
of American Statistical Association, 58, 236-244.
Watson, J.D., and Crick, F.H.C. (1953). A structure for DNA. Nature, 171, 737-738.
Weiss, K.M., Terwilliger, J.D. (2000). How many diseases does it take to map a gene
with SNPs? Nature Genetics, 26, 151-157.
Welsh, J.B., Zarrinkar, P.P., Sapinoso, L.M., Kern, S.G., Behling, C.A., Monk, B.J.,
Lockhart, D.J., Burger, R.A., Hampton, G.M. (2001). Analysis of gene expression
profiles in normal and neoplastic ovarian tissue samples identifies candidate mole
cular markers of epithelial ovarian cancer. Proceedings of the National Academy of
Sciences, USA, 98, 1176-1181.
Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Baker, J.L., and Somogyi,
R. (1998). Large-scale temporal gene expression mapping of central nervous system
development. Proceedings of the National Academey of Sciences, USA, 95, 334-339.
Wernisch, L., Kendall, S.L., Soneji, S., Wietzorrek, A., Parish, T., Hinds, J., Butcher,
P.D., and Stoker, N.G. (2003). Analysis of whole-genome microarray replicates
using mixed models. Bioinformatics, 19, 53-61.
Westfall, P.H. and Young, S.S. (1993). Re-sampling Based Multiple Testing: Examples
and Methods for P-value Adjustment, John Wiley & Sons, New York.
Winer, B.J. (1971). Statistical Principles in Experimental Design, 2nd ed., New York,
McGraw-Hill.
Wishart, D. (1969). An algorithm for hierarchical classifications. Biometrics, 25, 165
170.
Wit, E., McClure, J. (2003). Statistical adjustment of signal censoring in gene expres
sion experiments. Bioinformatics, 19, 1055-1060.
Wodicka, L., Dong, H., Mittmann, M., Ho, M., and Lockhart, D. (1997). Genome-
wide expression monitoring in Saccharomyces cerevisiae. Nature Biotechnology, 15,
1359-1367.
Wolfinger, R.D., Gibson,G., Wolfinger, E.D., Bennett, L., Hamadeh, H., Bushel, P.,
Afshari, C., Paules, R. (2001). Assessing gene significance from cDNA microarray
expression data via mixed models. Journal of Computational Biology, 8, 625-637.
REFERENCES 365
Wright, S.P. (1992). Adjusted p-values for simultaneous inference, Biometrics, 48,
1005-1013.
Xiong, M.M., Jin, L., Li, W., and Boerwinkle, E. (2000). BioTechniques, 29, 1264-1270
Yang, Y.H., Buckeley, M.J., Dudoit, S., Speed, T.P. (2000). Comparison of meth
ods for image analysis of cDNA microarray data. Journal of Computational and
Graphical Statistics, 11, 108-136.
Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J., and Speed, T.P. (2002).
Normalization for cDNA microarray data: a robust composite method addressing
single and multiple slide systematic variation. Nucleid Acids Research, 30, No.4,
e15.
Yang, Y.H., Dudoit, S., Luu, P., Speed, T.P. (2000). Normalization for cDNA mi
croarray data. In M. L. Bittner, Y. Chen, A. N. Dorsel, and E. R. Dougherty
(eds), Microarrays: Optical Technologies and Informatics, Vol. 4266 of Proceed
ings of SPIE. Society for Optical Engineering, San Jose, CA.
Yates, F. (1935). Complex experiments. Journal of Royal Statistical Society, Suppl.
2, 181-247.
Yekutieli, D., Banjamini, Y. (1999). Resampling-based false discovery rate control
ling multiple test procedures for correlated test statistics, Journal of Statistical
Planning and Inference, 82, 171-196.
Zhang, H. P. , Holford, T. and Bracken, M. (1995). A tree-based method of analysis
for prospective studies. Statistics in Medicine, 15, 37-50.
Zhang, H. P., Singer, B. (1999) Recursive Partitioning in the Health Sciences, Springer,
New York.
Zhou, X., Kao, M.J., Wong, W.H. (2002) Transitive functional annotation by shortest
path analysis of gene exxpression data. Proceedings of the National Academy of
Sciences, USA, 99, 12783-12788.
Zhu, H., Klemic, J.F., Chang, S., Bertone, P., Casamayor, A., Klemic, K.G., Smith,
D., Gerstein, M., Reed, M.A., Snyder, M. (2000). Analysis of yeast protein kinases
using protein chips. Nature Genetics, 26, 283-289.
This page intentionally left blank
Author Index
Abruzzo, L.V., 52
259
Ach, R., 6
Bowtell, D., 43
Adams, C.L., 43
Brazma, A., 52
Afshari, C.A., 84
Brown, P., 92
Aguilar, F., 43
Brown, P.O., 6, 43, 52, 65, 119, 141, 250,
Alberts, B., 18
Bryne, M.C., 43
Alizadeh, A.A., 6
Bushel, P., 84, 141, 155
Altman, R.B., 92
Byrd, J.C., 119
Ansorge, W., 52
Callow, M.J, 84
Arif, M., 52
Callow, M.J., 234
Calvo, E., 65
Ausubel, F.M., 43
Cantor, M., 92
Baggerly, K.A., 52
Ball, C.A., 52
Causton, H.C., 52
Ben-Dor, A., 6
Chee, M., 43
Beule, D., 52
Childs, G., 43
Bittner, M., 43
Chu, S., 43, 250
Bolstad, B.M., 65
Cochran, W.G., 119
Borg, A., 6
Cohen, M.E., 300
368
Collin, F., 65 Getz, G., 250
Comstock, R.E., 119 Ghosh, S., 43, 52, 259
Coombes, K.R., 52 Gibson, G., 84, 119, 141, 155
Cooper, G.M., 18 Gibson, G., 141
Cooper, M.C., 250 Gingeras, T.R., 43
Cope, L.M., 65 Ginzinger, D.G., 43
Corbeil, J., 65 Gish, K., 250
Cover, T., 286 Glenisson, P., 52
Cox, G.M., 119 Golub, T.R., 6, 273, 286
Cox, M.M., 18 Good, P., 170
Crick, F.H.C., 18 Goodwin, P.C., 65
Cristianini, N., 315 Gori, M., 300
Croux, C., 259 Gould, M.N., 191
Dale, J.W., 18 Gower, J.C., 250
Davis, R.E., 119, 250 Graner, E., 6
Davis, R.W., 65 Gray, R.J., 191
Denison, D.G.T., 286 Gregg, J.P., 84
DeRisi, J., 43, 119, 250 Greiner, T.C., 119
DeRisi, J.L., 65 Grever, M.R., 119
Dieter, B., 52 Griffiths, A.J.F., 18
Dimitrovsky, E., 273 Grigorenko, E.V., 43
Dobbin, K., 119 Grundy, W.N., 315
Domany, E., 250 Guo, S.-W., 43
Dong, H., 43 Guo, S., 52
Dougherty, E.R., 141 Hagerman P., 84
Dubes, R.C., 250 Hamadeh, H., 84, 141, 155
Dudley, A.M., 99, 170 Hannan, J., 191
Dudley, H., 170 Harcia, J.G., 6
Dudoit, S., 52, 84, 141, 155, 234 Hardin, J.S., 84
Duggan, D.J., 43 Hart, P., 286
Dunn, G., 250 Hastie, T., 92, 250, 286, 300, 315
Durbin, B., 84, 191 Hatfield, G.W., 43
Durbin, B.P., 84 Hawkins, D.M., 84, 259
Efron, B., 141, 155, 191, 234 Hedayat, A.S., 119
Eisen, M., 43, 141, 250 Hedenfalk, I., 6
Eisen, M.B., 6, 43, 52, 65, 119, 250 Herskowitz, I., 43, 250
Ellis, B., 84 Hertz, J., 300
Emptage, M.R., 141 Herzel, H., 52
Everitt, B.S., 250 Hess, K.R., 52
Filzmoser, P., 259 Hessner, M.J., 43, 52
Fisher, R.A., 119 Hinds, J., 84, 141
Fix, E., 286 Hingamp, P., 52
Fodor, S.P., 43 Ho, M., 43
Follettie, M., 43 Hobbs, B., 65
Follettie, M.T., 43 Hochberg, Y., 155, 234
Fournier, M.V., 6 Hodges, J., 286
Freidlin, B., 250 Holmes, C.C., 286
Fridlyand, J., 141, 155 Holmes, C.P., 43
Friedman, J., 250, 286, 300, 315 Holstege, F.C.P., 52
Fritsch, E.F., 43 Horton, H., 43
Fuhrman, S., 250 Huang, X., 52
Futcher, B., 6, 52, 250 Huang, X.C., 43
Gaasterland, T., 52 Hudson-Curtis, B., 141
Gabriel, K.R., 259 Hudson, D.L., 300
Gallo, M., 43 Hudson, J., 119
Gelbart, W.M., 18 Hulse, K., 43
Geller, S.C., 84 Ibrahim, J.G., 191
AUTHOR INDEX 369
Irizarry, R.A., 65 Ma, C., 119, 250
Iyer, V.R., 6, 52, 65, 119, 250 Mack, D., 250
Jain, A.K., 250 MacQueen, J.B., 250
Jin, W., 119, 141 Magnasco, M., 65, 99
Jobson, J.D., 250 Majumdar, D., 119
Johnson, A., 18 Mallick, B.K., 286
Johnson, D.E., 119 Maniatis T., 43
Johnson, R.A., 286 Marchesi, C., 300
Johnson, S.C., 250 Mardia, K.V., 286
Kaufman, L., 250 Markowitz, V., 52
Kendall, S.L., 84, 141 Marti, G.E., 119
Kendziorski, C.M., 191 Martin, K.J., 6
Kent, J.T., 286 Martin, M., 84, 119, 141
Kerr, M.K., 84, 119, 141, 250 Martinez, J., 84
Kielbasa, S., 52 Massimi, A., 43
Kim, I.F., 52 Matese, J.C., 52
Kitareewan, S., 273 McCulloch, W.S., 300
Kobayashi, M., 43 McLachlan, G.J., 286
Kohonen, T., 250, 273 McShane, L.M., 250
Korbel, J., 52 Meltzer, P., 6, 43
Kritzman, B.M., 6 Mesirov, J., 273
Krogh, A., 300 Meyer, L., 43
Kucherlapati, R., 43 Michaels, G.S., 250
Kuo, F.C., 52, 65, 191, 234 Miesfeld, R. L., 43
Kutner, M.H., 119 Miller, J.H., 18
Lan, H., 191 Milligan, G.W., 250
Lander, E.S., 43, 273 Milliken, G.A., 119
Lee, M.-L.T., 52, 65, 84, 141, 155, 191, 234 Mir, K., 43
Lehninger, A.L., 18 Mittmann, M., 43
Levine, A.J., 250 Moore, T., 119
Levine, E., 250 Morgan, B.J.T., 250
Levy, R., 119 Morley, M., 43
Lewin, B., 18 Morris, C., 191
Lewis, J., 18 Morton, C.C., 119
Lewish, D.B., 119 Mulholland, J., 43, 250
Lewontin, R.C., 18 Nachtscheim, C.J., 119
Li, C., 43, 65, 84 Naef, F., 65, 99
Li, M.-C., 250 Narasimhan, B., 92, 141, 170
Li, Y., 6 Nelson, D.L., 18
Liao, J.C., 84 Neter, J., 119
Lim, D.A., 99 Newton, M.A., 191
Lin, D., 315 Ngai, J., 84
Lin, D.M., 84 Notterman, D.A., 250
Lipshutz, R.J., 43 Noueiry, A., 191
Little, R.J.A., 92 Nye, S., 43
Liu, L., 259 Oh, M.-K., 84
Lockhart, D., 43 Olsson, H., 6
Lockhart, D.J., 43 Palmer, R.G., 300
Loman, N., 6 Pardee, A.B., 6
Long, A.D., 191 Parish, T., 84, 141
Lorenzato, S., 84 Parkinson, H., 52
Lossos, I.S., 119 Passador-Gurgel, G., 119, 141
Louis, T.A., 191 Pati, N., 52
Lu, L., 119 Patil, N., 99
Lu, W., 84, 141, 155, 191 Paules, R., 84, 141, 155
Ludbrook, J., 170 Pease, A.C., 43
Luu, P., 52, 84 Peng, V., 84
370
Phimister, B., 43 Southern, E., 43
Pison, G., 259 Speed, T.P., 52, 65, 84, 141, 155, 234
Pitts, W., 300 Spellman, P.T., 6, 52, 141, 250
Powell, J.I., 119 Staudt, L.M., 119
Price, L.M., 6 Steffen, M.A., 99, 170
Quackenbush, J., 52 Steward, J., 52
Radmacher, M.D., 119, 250 Stivers. D.N., 52
Raff, M., 18 Stoeckert C., 52
Ramaswaray, S., 273, 315 Stoker, N.G., 84, 141
Rava, R.P., 43 Storey, J.D., 141, 155, 191, 234
Ray, A.P.G., 250 Sudarsanam, P., 141
Rhei, E., 6 Suzuki, D.T., 18
Richmond, C.S., 191 Taddei. A., 300
Rifkin, R., 273, 315 Tamayo, P., 6, 273, 286, 315
Riley, R.M., 119, 141 Tamhane A.C., 155, 234
Ringner, M., 6 Taylor, R., 52
Ripley, R.B., 300 Tibshirani, R., 92, 119, 141, 155, 170, 191,
Robbins, H., 191 234, 250, 286, 300, 315
Roberts, K., 18 Tran, T., 119
Robinson, A., 52 Trent, J., 6
Rocke, D.M., 84, 191 Trent, J.M., 43
Rohlin, L., 84 Troyanskaya, O., 92
Rosenblatt, F., 300 Tseng, G.C., 84
Rosenwald, A., 6, 119 Tsui, K.W., 191
Rousseeuw, P.J., 250, 259 Tusher, V., 92, 141, 155, 170, 191, 234
Roweis, S.T., 99 Tusher, V.G., 141
Rubin, D.B., 92 Vapnik, V., 300, 315
Sabet, H., 119 Varanini, M., 300
Sambrook, J., 43 Vilo, J., 52
Sapinoso, L.M., 6 Vingron, M., 52
Sarkans, U., 52 Vishwanath, R.T., 141
Sarkar, D., 191 von Schantz, M., 18
Sasik, R., 65 Walker, N.J., 84
Saul, L.K., 99 Walter, P., 18
Schadt, E., 84 Wang, C., 43
Schalon, D., 43 Wang, X., 43, 52
Schena, M., 43, 65 Ward, J. H., 250
Schuchhardt, J., 52 Warnke, R., 119
Schulze-Kremer, S., 52 Wasserman, W., 119
Sen, K., 141 Watson, J.D., 18
Shaffer, J.P., 155 Weisenburger D.D., 119
Shalon, D., 65 Welsh, J.B., 6
Shawe-Tylor, J., 315 Wen, X., 250
Shchepinov, M., 43 Wernisch, L., 84, 141
Sherlock, G., 6, 52, 92, 119, 250 Westfall, P.H., 155
Silipo, R., 300 White, K.P., 119, 141
Simon, R., 119, 250 Whitmore, G.A., 52, 65, 84, 141, 155, 191,
Sklar, J., 52, 65, 191, 234 234
Slonim, D., 273 Wichern, D.W., 286
Slonim, D.K., 6, 273, 286 Widhopf, G., 6
Smith, A.F.M., 286 Wietzorrek, A., 84, 141
Smith, S., 250 Wilson, W., 119
Smith, S.J., 43 Winer, B.J., 119
Socci, N.D., 65, 99 Winston, F., 141
Somogyi, R., 250 Winters, L.M., 119
Soneji, S., 84, 141 Wishart, D., 250
Sorger, P.K., 65 Wodicka, L., 43
AUTHOR INDEX 371
Wolfinger, E.D., 84, 141, 155
Ybarra, S., 250
Wu, Y., 43
Yakhini, Z., 6
Topic Index
absolute call, 38
Bonferroni approach, 227
adenine, 9
Box-Cox family of transformations, 69
alleles, 10
cell, 7
anticodon, 15
class discovery, 283
arraylet, 257
class distinction, 282
attributes, 261–262
cluster average distance, 244
average difference, 38
co-regulation, 146
balanced, 107
complete factorial design, 104
108
completely randomized design, 107, 217,
bias, 288
control genes, 83
biological specimen, 47
convex optimization problem, 305
blocking, 107
cross-validation, 289
bond, 14
dbEST, 25
translation, 8, 14–15
UniGene, 25