Nothing Special   »   [go: up one dir, main page]

WO2000079007A9 - Procedes ameliores d'assemblage de sequences pour le sequençage par hybridation - Google Patents

Procedes ameliores d'assemblage de sequences pour le sequençage par hybridation

Info

Publication number
WO2000079007A9
WO2000079007A9 PCT/US2000/016899 US0016899W WO0079007A9 WO 2000079007 A9 WO2000079007 A9 WO 2000079007A9 US 0016899 W US0016899 W US 0016899W WO 0079007 A9 WO0079007 A9 WO 0079007A9
Authority
WO
WIPO (PCT)
Prior art keywords
probes
probe
nucleic acid
sequence
hybridization
Prior art date
Application number
PCT/US2000/016899
Other languages
English (en)
Other versions
WO2000079007A1 (fr
Inventor
Radoje T Drmanac
Original Assignee
Hyseq Inc
Radoje T Drmanac
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hyseq Inc, Radoje T Drmanac filed Critical Hyseq Inc
Priority to AU54971/00A priority Critical patent/AU5497100A/en
Publication of WO2000079007A1 publication Critical patent/WO2000079007A1/fr
Publication of WO2000079007A9 publication Critical patent/WO2000079007A9/fr

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • C12Q1/6874Methods for sequencing involving nucleic acid arrays, e.g. sequencing by hybridisation

Definitions

  • the invention relates generally to novel methods, materials and devices for nucleic acid sequence analysis by hybridization, in which the sequence information obtainable not only from perfectly matched oligonucleotide probes but also from oligonucleotide probes that are not perfectly matched to the target nucleic acid is taken into account.
  • the rate of determining the sequence of the four nucleotides in nucleic acid samples is a major technical obstacle for further advancement of molecular biology, medicine, and biotechnology.
  • Nucleic acid sequencing methods which involve separation of nucleic acid molecules in a gel have been in use since 1978.
  • the other proven method for sequencing nucleic acids is sequencing by hybridization (SBH).
  • SBH sequencing by hybridization
  • the traditional method of determining a sequence of nucleotides i.e., the order of the A, G, C and T nucleotides in a sample
  • Resulting nucleic acid fragments in the range of 1 to 500 bp are then separated on a gel to produce a ladder of bands wherein the adjacent samples differ in length by one nucleotide.
  • the array based approach of SBH does not require single base resolution in separation, degradation, synthesis or imaging of a nucleic acid molecule.
  • K bases in length lists of constituent K-mer oligonucleotides may be determined for target nucleic acid. Sequence for the target nucleic acid may be assembled by uniquely overlapping scored oligonucleotides.
  • probes are arrayed at locations on a substrate which correspond to their respective sequences, and a labelled nucleic acid sample fragment is hybridized to the arrayed probes.
  • sequence information about a fragment may be determined in a simultaneous hybridization reaction with all of the arrayed probes.
  • the same oligonucleotide array may be reused.
  • the arrays may be produced by spotting or by in situ synthesis of probes.
  • a set may be in the form of arrays of probes with known positions, and another, labelled set may be stored in multiwell plates.
  • target nucleic acid need not be labelled.
  • Target nucleic acid and one or more labelled probes are added to the arrayed sets of probes. If one attached probe and one labelled probe both hybridize contiguously on the target nucleic acid, they are covalently ligated, producing a detected sequence equal to the sum of the length of the ligated probes.
  • the process allows for sequencing long nucleic acid fragments, e.g. a complete bacterial genome, without nucleic acid subcloning in smaller pieces.
  • SBH sequence long nucleic acids unambiguously, SBH involves the use of long probes. As the length of the probes increases, so does the number of probes required to generate sequence information. Each 2-fold increase in length of the target requires a one-base increase in the length of the probe, resulting in a four-fold increase in the number of probes required (the complete set of all possible sequences of probes of length k contains 4 probes).
  • sequencing 100 bases of DNA requires 16,384 7-mers; sequencing 200 bases requires 65,536 8-mers; 400 bases, 262,144 9-mers; 800 bases, 1,048,576 10-mers; 1600 bases, 4,194,304 11-mers; 3200 bases, 16,777,216 12-mers; 6400 bases, 67,108,864 13-mers; and 12,800 bases requires 268,435,456 14-mers.
  • the present invention provides novel methods and materials, including apparatus, for performing sequence analysis by hybridization (referred to herein as "SBH')-
  • SBH' sequence analysis by hybridization
  • Conventional methods of SBH utilize hybridization conditions selected to discriminate probe :target hybrids that are perfectly complementary in the information region (informative region) of the probe from probe.target hybrids that have even a single base pair mismatch.
  • Conventional methods also assemble sequence information using a scoring system for the probes that gives a "positive" score to fully matched (perfectly complementary) probes and a "negative” score to all other probes (i.e., probes with a single, double, or more base pair mismatch compared to target).
  • the efficiency, sensitivity and accuracy of these methods is improved by taking into account the sequence information that is obtainable not only from probes that are perfectly matched to the target nucleic acid sequence, but also from probes that have single, double or more mismatches compared to the target.
  • the present invention provides methods for analyzing the sequence of a target nucleic acid, comprising the steps of: (a) contacting a target nucleic acid with a plurality of oligonucleotide probes of predetermined length and predetermined sequence, wherein each probe comprises an information region, under conditions which produce, on average, relatively more probe .target hybrids per probe for probes that are perfectly complementary in the information region of the probe than for probes that are substantially perfectly complementary in the information region of the probe, and relatively fewer probe:target hybrids that are significantly mismatched in the information region of the probe;
  • step (d) determining the sequence of the target nucleic acid, comprising the step of summing the numerical voting scores of the probes in relation to their sequences.
  • the numerical voting score of the probe or pool of probes may be modified by a voting factor selected based on the relationship of the probe to the hypothetical sequence.
  • the number of probes used in the hybridization step may be at least about 10, at least about 100, at least about 1000, at least about 10 4 , at least about 10 5 , at least about 10 6 , or at least about 10 7 different probes (meaning the number of distinct probe sequences), and may potentially range up to about 10 10 different probes or even more.
  • the plurality of probes may be hybridized individually with the target or in groups or pools of probes. Probes may optionally be associated with or labeled with identification tags. Each of the probes may be labeled with a unique identification tag; alternatively, a pool or a part of a pool of probes may be labeled with the same identification tag.
  • Another aspect of the invention provides an improvement over conventional SBH methods wherein a subpool of probes is collectively given one score, and wherein the subpool if differentiated from other subpools within the pool via labeling with distinct identification tags.
  • Pools and subpools can be formed either with respect to probes in solution or with respect to probes in fixed arrays. For example, a pool of 1024 pentamer probes is divided into 4 pools of 256 pentamer probes each.
  • the probes are labeled with one of four different identification tags, so that one subpool of 64 probes will bear one tag, another subpool of 64 probes will bear a second tag, a further subpool of 64 probes will bear a third tag, and a final subpool of 64 probes will bear a fourth tag.
  • This can easily be accomplished by dividing the starting number of 1024 probes into four pools of 256 probes, labeling each pool with one of four tags, dividing the pools into four aliquots of 64 probes each, and combining an aliquot from each of the four pool to create a mixture of 256 probes labeled with 4 different tags.
  • subpools within a physical pool of probes By creating subpools within a physical pool of probes through differential labeling of probes with identification tags, the advantages of using smaller subpools of probes can be obtained without requiring physical separation of pools into subpools.
  • Any suitable identification tag known in the art can be used; presently preferred tags are fluorescent labels of different wavelengths or "colors.” This aspect of the invention is illustrated in Figure 1.
  • These subpools created through differential labeling can be used in the methods of the present invention, can be used as "informative pools", and can also be used in any SBH methods (including format 1, format 2 and format 3 methods) known in the art and for any SBH applications known in the art, including de novo sequencing, resequencing, polymorphism detection, etc., but are preferably utilized in resequencing applications.
  • this aspect of the invention provides methods of sequencing by hybridization wherein the method comprises an additional step of detecting label (identification tag).
  • the label detection step may occur before, after, or concurrently with the steps of detecting measuring hybridization signal.
  • This aspect of the invention also provides methods of sequencing by hybridization using subsets of probes that have been labeled with different tags and pooled in one or more pools by mixing all or a number of probes from each subset.
  • This aspect of the invention further provides a pool of probes comprising a mixture of at least 100, or at least 200, or a least 300, or at least 400 distinct probes, each probe being associated with an identification tag.
  • the probes in the pool may all be labeled with the same tag, or may be divided into two, three, four or more subpools in which all probes in a subpool are labeled with the same tag, but each subpool is associated with a different tag.
  • the entire subpool is given one collective score, and when sequence information is assembled it is assumed that each probe within the subpool is assigned the same score, which may be either a positive/negative score or a numerical voting score.
  • Another aspect of the invention provides an apparatus comprising means for carrying out the hybridization step and means for carrying out the detecting and/or measuring step(s), as described above.
  • a further aspect of the invention provides an apparatus comprising means for carrying out the sequence analysis step, e.g. a computer programmed as set forth in Appendix A herein.
  • Examples of applications that require very large numbers of probes are: (1) sequencing or resequencing of the entire human genome and other complex genomes, (2) sequencing or resequencing of total mRNA or cDNA in a human or other complex cell, (3) genotyping thousands or millions of single nucleotide polymorphisms in individual human genomes, (4) de novo sequencing of thousands of bases.
  • Figure 1 illustrates a pooling schema for the use of subpools in Format 3 sequencing by hybridization.
  • arrays of 1024 pentamers, or arrays of 1024 pools of 4 hexamers (4096 hexamers) or arrays of 4096 pools of 4 heptamers may be used.
  • the read length depends on the number of arrayed probes or arrayed probe pools and can range from 1 kb to several kb or more.
  • Figure 2 depicts the result of sequence analysis of human apo-B gene in which probes were assigned numerical voting scores based on strength of hybridization, the voting scores were further modified by a voting factor based on relationship of the probe sequence to the hypothetical target sequence, and sequence information was assembled using the informational content of both fully matched and mismatched probes.
  • the three major steps of conventional SBH are biochemical hybridization of probes to target polynucleotide, detection of positive results, and informational sequence assembly from the results.
  • biochemical hybridization step a set of oligonucleotide probes of known sequence is allowed to hybridize with a target polynucleotide of unknown sequence.
  • detection step a subset of these probes is scored as positive (predominantly those that hybridize to and fully match the target polynucleotide sequence).
  • the target sequence is assembled using different algorithms, usually executed by computer programs, that uniquely overlap all sequences of the real (or true) positive probe subset.
  • Drmanac et al. U.S. Patent No. 5,525,464 (hereby incorporated by reference herein) - Issued June 11, 1996; Drmanac, PCT Patent Appln. No. WO 95/09248 (hereby incorporated by reference); Drmanac et al., Genomics, 4, 114-128 (1989); Drmanac et al., Proceedings of the First Intl. Conf. Electrophoresis Supercomputing Human Genome Cantor et al. eds, World
  • Drmanac et al. J. Biomol. Struct. Dyn., 5, 1085 (1991); Hoheisel et al., Mol. Gen., 4, 125-132 (1991); Strezoska et al., Proc. Nafl. Acad. Sci. (USA), 88, 10089 (1991); Drmanac et al., Nucl. Acids Res., 19, 5839 (1991); and Drmanac et al., Int. J. Genome Res., 1, 59-79 (1992), all of which are inco ⁇ orated by reference herein.
  • Target samples which are hybridized to labeled probes (Format 1), or arrays of probes which are hybridized to labeled target samples (Format 2), for efficient parallel scoring of multiple hybridization events.
  • target samples or probes are attached to solid supports in the form of beads that serve to separate parallel hybridization reactions in the reading or detection step. Beads or other markers can be used as tags to identify probes.
  • Mass spectrometry technology can also be used to distinguish probe species on the basis of their mass even when the probes are not tagged.
  • a Format 3 type SBH method two sets of shorter probes are combinatorially connected to simulate a much larger set of longer probes.
  • the first set of probes is fixed in the array, and the second set of probes is labeled.
  • the labeled probes are added together with target nucleic acid, and a labeled probe is ligated to a fixed probe only if the two probes hybridize contiguously on the target nucleic acid.
  • probes can be scored in the form of informative pools with minimal loss of information, as described in U.S. Application Serial No. 60/115,284 entitled "Enhanced Sequencing by Hybridization Using Informative Pools of Probes" filed January 6, 1999, inco ⁇ orated herein by reference. Other types of pools may be used.
  • the present invention utilizes hybridization information obtainable not only from probes that are perfectly matched to the target nucleic acid sequence, but also from probes that have single, double or more mismatches compared to the target.
  • the setting of a single threshold may have the undesirable effects of creating false positive probes (i.e., probes that form strongly hybridizing single mismatch probe:target hybrids) and false negative probes (i.e., probes that form weakly hybridizing full match probe:target hybrids).
  • a system that scores only full match probe:target hybrids ignores the useful sequence information that can be provided by mismatched probe:target hybrids, particularly single mismatches. For example, if 10-mer probes are being used, for all of the single mismatch probes, 9 of the 10 bases will be a correct identification of the true base at that position. The numerical scoring of all probes according to the relative level or strength of hybridization allows the informational content of all probes to be taken into account.
  • the present invention provides methods for analyzing the sequence of a target nucleic acid, comprising the steps of: (a) contacting a target nucleic acid with a plurality of oligonucleotide probes of predetermined length and predetermined sequence, wherein each probe comprises an information region, under conditions which produce, on average, relatively more probe:target hybrids per probe for probes that are perfectly complementary in the information region of the probe than for probes that are substantially perfectly complementary in the information region of the probe, and relatively fewer probe:target hybrids that are significantly mismatched in the information region of the probe;
  • step (d) determining the sequence of the target nucleic acid, comprising the step of analyzing the numerical voting scores of the probes in relation to their sequences.
  • step (a) the hybridization and/or wash conditions are selected to produce, on average, a relatively higher number of probe :target hybrids for fully matched (perfectly complementary) probes than for probes with a single mismatch compared to target, and a relatively higher number of hybrids for single mismatch probes compared to double mismatch probes, and a relatively higher number of hybrids for double mismatch probes compared to triple mismatch probes, etc.
  • step (b) the hybridization signal of the probe:target hybrids is measured and the relative level (or strength) of the signal is determined.
  • a numerical voting score (also referred to herein as "voting power”) is assigned to each probe or to a pool of probes based on the strength of the hybridization signal.
  • the assignment of scores is described below in more detail in the section entitled “Assigning a Numerical Voting Score to Probes and Sequence Analysis.”
  • the sequence may be analyzed by aligning all possible sequences, summing the numerical voting scores of all probes voting for a particular hypothetical base identity at a particular position, and determining which of the hypothesized bases is correct (i.e., has the most votes).
  • the numerical voting score of the probes as assigned in step (c) may be further modified, or weighted, by a voting factor as described in the section entitled "Assigning a Numerical Voting Score to Probes and Sequence Analysis," wherein the modification of the score for a probe depends on the relationship of that probe to the hypothesized sequence.
  • Target nucleic acid refers to the nucleic acid of interest, typically the nucleic acid that is sequenced in the SBH assay.
  • Potential target polynucleotides include naturally occurring or artificially created DNA (e.g., genomic DNA and cDNA) and RNA (e.g., mRNA), including nucleic acids used as part of DNA computing.
  • the target nucleic acid may be composed of ribonucleotides, deoxyribonucleotides or mixtures thereof.
  • the target nucleic acid is a DNA. While the target nucleic acid can be double-stranded, it is preferably single stranded.
  • the "read length" of the target nucleic acid can be any number of nucleotides, depending on the length of the probes, but is typically on the order of 100, 200, 400, 800, 1600, 3200, 6400, or even more nucleotides in length, up to the entire human genome.
  • target nucleic acid can be obtained from virtually any source and can be prepared using methods known in the art.
  • target nucleic acids can be isolated by PCR methodology, or by cloning into plasmids (for a convenient target nucleic acid fragment length of 500 to 5,000 base pairs), or by cloning into yeast or bacterial artificial chromosomes (for a convenient target nucleic acid fragment length of up to lOOkb).
  • the target nucleic acid may be sheared into fragments prior to use in an SBH assay. Fragmentation may be accomplished by nonspecific endonuclease digestion, restriction enzyme digestion (e.g., by Cvi JI), physical shearing (e.g., by ultrasound) or NaOH treatment. Fragments may be separated by size (e.g., by gel electrophoresis) to obtain the desired fragment length. Fragmentation of the target nucleic acid also may avoid hindrance to hybridization from secondary structure in the sample.
  • the sizes of the target nucleic acid fragments used in the hybridization reaction optimally range in length from slightly longer than the probe length to twice the probe length, e.g., 10-100 or 10-40 bases.
  • Probes refers to relatively short pieces of nucleic acids, preferably DNA.
  • Probes are preferably shorter than the target DNA by at least one base, and more preferably they are 25 bases or fewer in length, still more preferably 20 bases or fewer in length. Of course, the optimal length of a probe will depend on the length of the target nucleic acid being analyzed.
  • the probes are at least 7-mers; for a target of about 100-200 bases, the probes are at least 8-mers; for a target nucleic acid of about 200-400 bases, the probes are at least 9-mers; for a target nucleic acid of about 400-800 bases, the probes are at least 10-mers; for a target nucleic acid of about 800-1600 bases, the probes are at least 11-mers; for a target of about 1600-3200 bases, the probes are at least 12-mers, for a target of about 3200-6400 bases, the probes are at least 13-mers; and for a target of about 6400-12,800 bases, the probes are at least 14-mers.
  • the optimal probe length is one additional base.
  • the above-delineated probe lengths are post-ligation.
  • specific probe lengths refer to the actual length of the probes for Format 1 and 2 SBH applications and the lengths of ligated probes for Format 3 or Format 3 -like SBH applications.
  • Probes are normally single stranded, although double-stranded probes may be used in some applications.
  • Probes may be prepared using standard chemistry procedures known in the art.
  • the length of the probes described above refers to the length of the informational content of the probes, not necessarily the actual physical length of the probes.
  • the probes used in SBH frequently contain degenerate ends [e.g., one to three non-specified (mixed A,T,C and G) or universal (e.g. M base or inosine) bases at the ends] that aid hybridization but do not contribute to the information content of the probes.
  • Hybridization discrimination of mismatches in these degenerate probe mixtures refers only to the length of the informational content, not the full physical length.
  • NxByNz represents any of the four bases and varies for the polynucleotides in a given mixture
  • B represents any of the four bases but is the same for each of the polynucleotides in a given mixture
  • x, y, and z are all integers.
  • Nx and Nz represent the degenerate ends of the probe and By represents the information content of the probe (e.g., a uniquely arrayed probe in conventional SBH).
  • the probes may consist solely of naturally-occurring nucleotides and native phosphodiester backbones, or the probes may be modified or tagged to enhance specificity of detection.
  • the probes may be composed of one or more modified bases, such as 7-deazaguanosine, or one or more modified backbone interlinkages, such as a phosphorothioate. The only requirement is that the probes be able to hybridize to the target nucleic acid.
  • modified bases and backbone interlinkages that can be used in conjunction with the present invention are known, and will be apparent to those of skill in the art.
  • oligonucleotides to increase specificity or efficiency
  • cycling hybridizations to increase the hybridization signal, for example by performing a hybridization cycle under conditions (e.g. temperature) optimally selected for a first set of labeled probes followed by hybridization under conditions optimally selected for a second set of labeled probes.
  • Shifts in reading frame may be determined by using mixtures (preferably mixtures of equimolar amounts) of probes ending in each of the four nucleotide bases A, T, C and G.
  • the oligonucleotide probes are preferably labeled with identification tags to enhance detection or discrimination.
  • Suitable labels include fluorescent dyes, chemiluminescent systems, radioactive labels (e.g., 35 S, 3 H, 32 P or 33 P), or isotopes detectable by mass spectrometry, nanobeads, polymers or molecules of different size, shape, electrical, magnetic or other properties, attached by any of a variety of methods that are well known in the art.
  • a complete set of all possible probes of a given length (4 N , where N is the length) or a subset of this complete set may be used in the hybridization step. Probes of differing lengths may also be used. A large number of probes may be synthesized in a small number of reactions. For example, a complete set of all possible 10-mers (about 1 million probes) may be synthesized as follows. 1000 5- mers, each uniquely associated with a 10-digit DNA bar code, are synthesized in 1000 reactions and mixed.
  • the mixture is divided into 1000 aliquots which then undergo 1000 reactions, during which the informational length of the probe is extended by an additional 5 nucleotides and the 10-digit barcode is extended by a further 10 digits, to form 1 million uniquely tagged 10-mers synthesized in only
  • the number and type of probes that are used in each hybridization reaction depends on the detection power of the reader, the statistics of numerical scoring, including use of informative pools (or other pools), the length of target nucleic acid sequence, and the SBH application (e.g., whether de novo sequencing, resequencing or genotyping is desired).
  • a complete set of all possible probe sequences of the same length may be used, or only a portion of this complete set may be used. Alternatively, probes of differing length may be used.
  • Hybridization and washing conditions are selected to provide a range of hybridization signals such that a gradation of signals is provided wherein full match probes have higher signals than single mismatch probes, which in turn have higher signals than double mismatch probes, which in turn have higher signals than triple mismatch probes, etc.
  • Conditions may be selected so as to detect substantially perfect match hybrids (such as those wherein the fragment and probe hybridize at six out of seven positions). Alternatively, slightly less stringent conditions than those that permit detection only of perfect match hybrids may be used.
  • Suitable hybridization conditions may be routinely determined by optimization procedures or pilot studies. Such procedures and studies are routinely conducted by those skilled in the art to establish protocols for use in a laboratory. See e.g., Ausubel et al., Current Protocols in Molecular Biology, Vol.
  • the probes which have hybridized to the target polynucleotide during the hybridization reaction step can be assigned a numerical voting score (voting power) based on the strength of the hybridization signal.
  • Data may be obtained by scoring each probe individually or by scoring pools of probes, including informative pools as described in U.S. Application Serial No. 60/115,284 entitled "Enhanced Sequencing by Hybridization Using Informative Pools of Probes” filed January 6, 1999, inco ⁇ orated herein by reference.
  • Probes can numerically scored and their sequences analyzed as follows. Probes are sorted by descending hybridization signal value, and a numerical voting score (voting power) is assigned to each probe based on the hybridization signal and a converting function.
  • One possible converting function involves dividing the signal range into several segments (by setting a certain number of threshold steps) and to define the voting power for probes in each segment as the inverse of the number of probes in that segment.
  • the lower boundary of the signal range that is taken into account is set at least at the background level of signal, but may be set higher as desired in order to simplify or speed computation without losing significant information.
  • Probe sequences are aligned allowing for mismatches.
  • identity of a base at a particular position is voted on by the number of occurrences of a base in aligned probes in combination with the numerical voting score (or voting power) of each probe.
  • the numerical voting scores of all probes voting for a particular hypothetical base identity at a particular position may be summed, and the base identity may be confirmed by determining which of the hypothesized bases has the most votes. For example, because the voting power of a strongly hybridizing probe is higher, its vote as to the identity of the base is given more weight.
  • the assigned numerical voting score of the probes may be further modified, or weighted, by a voting factor, wherein the modification of the score for a probe depends on the relationship of that probe to the hypothesized sequence. For example, when 10-mers are used in resequencing, if the base at position 100 is hypothesized to be an adenine (A), there will be 10 full match (perfectly complementary) probes for the A at that position, and 270 (9 x 3 x 10) single mismatch probes for the A that position.
  • the voting factor by which the numerical voting score is modified may be set to a multiplier of 100 for full match probes, a multiplier of 20 for single mismatch probes, and a multiplier of 2 for all other probes.
  • the votes are then summed after modification by the appropriate voting factor, and the voting process is repeated for each of the four hypothesized bases at position 100 (A, T, C and G).
  • the hypothesized base that has the highest number of votes may be declared the correct base.
  • This voting process can be used to solve single-position base problems or can be used to determine the identity of two consecutive base positions (in which case there are sixteen, rather than four, hypothesized two-base combinations).
  • a correct solution requires an absolute minimum number of votes. If none of the hypothetical base candidates receives the minimum number of votes, then the process needs to be repeated. In addition, a correct solution requires the "winning" candidate to have a sufficiently high number of votes in comparison to the other candidates. In the case of a heterozygous position (two genes are present, and each gene has a different base at that position), there can be two “winning” candidates, but each of the “winners” must still have a sufficiently high number of votes compared to the other candidates.
  • the voting power of probes can be used to determine whether a probe is a full match probe, for example, in de novo sequencing.
  • the selected probe in question is aligned with all probes that have a single or more mismatches (mismatches when compared to the probe in question).
  • Statistics are applied that take into account the voting power of each of the mismatched probes (and may include the vote of the selected probe itself). If the selected probe is actually a full match probe, probes that have a single mismatch compared to the selected probes should still hybridize strongly.
  • probes that have a single mismatch compared to the selected probe will be double mismatched probes compared to the target nucleic acid sequence and thus will hybridize relatively more weakly. For example, summing the numerical voting score (or the numerical voting score as modified by voting factor) of all probes having a single mismatch in comparison to the selected probe thus will indicate whether the selected probe is a full match.
  • Probes with end mismatches typically have a stronger hybridization signal than probes with internal mismatches and thus it may be desirable to set voting factors so that these probes are given relatively more voting power.
  • bridging across a branching point of a repeated 6-mer sequence can be done by sorting all probes by the central 6-mer and determining which of the positive probes are full match probes. If, for example, there are 6 positive probes sharing the same middle 6-mer sequence, and one assumes that only two probes are true full match probes, then single and double mismatches may be taken into account when voting for the full match probes. This approach has the advantages of eliminating false positives and reducing the occurrence of false sequence assembly.
  • DNA was prepared by PCR with one phosphorylated primer. Lambda exonuclease was used to degrade the phosphorylated strand and the remaining single stranded DNA was randomly fragmented by endonuclease DNAse I.
  • the target DNA was mixed with 16 pools containing 64 TAMRA labeled 5-mer probes and hybridized to 4 HyChips each containing four 5-mer arrays.
  • the hybridization image was detected using a fluorescent scanner and a hybridization score for each of about 16,000 test dots was determined using an image analysis program. Probes were sorted by descending hybridization signal value, and a numerical voting score was assigned to each probe based on its hybridization signal and a converting function. In this case, for the top 2000 dots (each corresponding to a pool of 64 pentamers scoring 64 10-mers), the probes were assigned an initial numerical voting score that was equal to their hybridization signal, while all other probes were assigned a numerical voting score of zero.
  • the assigned numerical voting score of each probe was further weighted by a voting factor.
  • the voting factor was set to a multiplier of 100 for full match probes and a multiplier of 1 for all other probes.
  • the votes were then summed after modification by the appropriate voting factor.
  • the modified numerical voting scores (modified by the appropriate voting factor) of all probes voting for a particular hypothetical base identity at a particular position were summed, and the base identity was confirmed by determining which of the hypothesized bases had the most votes.
  • FIG. 2 The results of the sequence analysis according to this voting schema are shown in Figure 2.
  • the figure depicts a 400-500 base segment of the 700 base pair fragment sequenced. For each base position, all four nucleotide options were tested. The sum of the votes is plotted for each nucleotide at each position, and the points in the graph are marked with corresponding nucleotide letters. The capital letters denote the apo-B reference sequence.
  • positions 485 and 486 the sequence is undeterminable because none of the four bases received a minimum number of votes (each base received a total score of approximately 2000 votes each).
  • Format 1 SBH is appropriate for the simultaneous analysis of a large set of samples. Parallel scoring of thousands of samples on large arrays may be performed in thousands of independent hybridization reactions using small pieces of membranes.
  • the identification of DNA may involve 1-20 probes per reaction and the identification of mutations may in some cases involve more than 1000 probes specifically selected or designed for each sample. For identification of the nature of the mutated DNA segments, specific probes may be synthesized or selected for each mutation detected in the first round of hybridizations.
  • DNA samples may be prepared in small arrays which may be separated by appropriate spacers, and which may be simultaneously tested with probes selected from a set of oligonucleotides which may be arrayed in multiwell plates.
  • Small arrays may consist of one or more samples. DNA samples in each small array may include mutants or individual samples of a sequence. Consecutive small arrays may be organized into larger arrays. Such larger arrays may include replication of the same small array or may include arrays of samples of different DNA fragments.
  • a universal set of probes includes sufficient probes to analyze a DNA fragment with prespecified precision, e.g. with respect to the redundancy of reading each base pair ("bp"). These sets may include more probes than are necessary for one specific fragment, but may include fewer probes than are necessary for testing thousands of DNA samples of different sequence.
  • DNA or allele identification and a diagnostic sequencing process may include the steps of: 1) Selection of a subset of probes from a dedicated, representative or universal set to be hybridized with each of a plurality of small arrays;
  • This approach provides fast identification and sequencing of a small number of nucleic acid samples of one type (e.g. DNA, RNA), and also provides parallel analysis of many sample types in the form of subarrays by using a presynthesized set of probes of manageable size.
  • Two approaches have been combined to produce an efficient and versatile process for the determination of DNA identity, for DNA diagnostics, and for identification of mutations.
  • a small set of shorter probes may be used in place of a longer unique probe.
  • a universal set of probes may be synthesized to cover any type of sequence.
  • a full set of 6-mers includes only 4,096 probes, and a complete set of 7-mers includes only 16,384 probes.
  • Full sequencing of a DNA fragment may be performed with two levels of hybridization. One level is hybridization of a sufficient set of probes that cover every base at least once. For this pu ⁇ ose, a specific set of probes may be synthesized for a standard sample. The results of hybridization with such a set of probes reveal whether and where mutations (differences) occur in non-standard samples. To determine the identity of the changes, additional specific probes may be hybridized to the sample.
  • all probes from a universal set may be scored.
  • a universal set of probes allows scoring of a relatively small number of probes per sample in a two step process without an undesirable expenditure of time.
  • the hybridization process may involve successive probings, in a first step of computing an optimal subset of probes to be hybridized first and, then, on the basis of the obtained results, a second step of determining additional probes to be scored from among those in a universal set.
  • K -1 oligonucleotides which occur repeatedly in analyzed DNA fragments due to chance or biological reasons may be subject to special consideration. If there is no additional information, relatively small fragments of DNA may be fully assembled in as much as every base pair is read several times.
  • ambiguities may arise due to the repeated occurrence in a set of positively-scored probes of a K-l sequence (i.e., a sequence shorter than the length of the probe). This difficulty does not exist if mutated or similar sequences have to be determined. Knowledge of one sequence may be used as a template to correctly assemble a sequence known to be similar
  • the location of certain probes may be interchangeable when determined by overlapping the sequence data, resulting in an ambiguity as to the position of the partial sequence.
  • sequence information is determined by SBH, either: (i) long read length, single-pass gel sequencing at a fraction of the cost of complete gel sequencing; or (ii) comparison to related sequences, may be used to order hybridization data where such ambiguities ("branch points") occur.
  • segments in junk DNA (which is not found in genes) may be repeated many times in tandem.
  • single-pass gel sequencing may be used to determine the number of tandem repeats where tandemly-repeated segments occur. As tandem repeats occur rarely in protein-encoding portions of a gene, the gel-sequencing step will be performed only when a commercial value for the sequence is determined.
  • an array of sample arrays avoids consecutive scoring of many oligonucleotides on a single sample or on a small set of samples. This approach allows the scoring of more probes in parallel by manipulation of only one physical object.
  • Subarrays of DNA samples 1000 bp in length may be sequenced in a relatively short period of time. If the samples are spotted at 50 subarrays in an array and the array is reprobed 10 times, 500 probes may be scored. In screening for the occurrence of a mutation, approximately 335 probes may be used to cover each base three times. If a mutation is present, several covering probes will be affected. The use of information about the identity of negative probes may map the mutation with a two base precision.
  • an additional 15 probes may be employed. These probes cover any base combination for two questionable positions (assuming that deletions and insertions are not involved). These probes may be scored in one cycle on 50 subarrays which contain a given sample. In the implementation of a multiple label color scheme (i.e., multiplexing), two to six probes, each having a different label such as a different fluorescent dye, may be used as a pool, thereby reducing the number of hybridization cycles and shortening the sequencing process.
  • a multiple label color scheme i.e., multiplexing
  • subarrays to be analyzed include tens or hundreds of samples of one type, then several of them may be found to contain one or more changes (mutations, insertions, or deletions). For each segment where mutation occurs, a specific set of probes may be scored. The total number of probes to be scored for a type of sample may be several hundreds. The scoring of replica arrays in parallel facilitates scoring of hundreds of probes in a relatively small number of cycles. In addition, compatible probes may be pooled. Positive hybridizations may be assigned to the probes selected to check particular DNA segments because these segments usually differ in 75% of their constituent bases.
  • targets may be conveniently analyzed. These targets may represent pools of shorter fragments such as pools of exon clones.
  • a specific hybridization scoring method may be employed to define the presence of heterozygotes (sequence variants) in a genomic segment to be sequenced from a diploid chromosomal set.
  • Two variations are where: i) the sequence from one chromosome represents a basic type and the sequence from the other represents a new variant; or, ii) both chromosomes contain new, but different variants.
  • the scanning step designed to map changes gives a maximal signal difference of two-fold at the heterozygotic position.
  • there is no masking but a more complicated selection of the probes for the subsequent rounds of hybridizations may be indicated.
  • Scoring two-fold signal differences required in the first case may be achieved efficiently by comparing corresponding signals with controls containing only the basic sequence type and with the signals from other analyzed samples. This approach allows determination of a relative reduction in the hybridization signal for each particular probe in a given sample. This is significant because hybridization efficiency may vary more than two-fold for a particular probe hybridized with different DNA fragments having the same full match target.
  • heterozygotic sites may affect more than one probe depending upon the number of oligonucleotide probes. Decrease of the signal for two to four consecutive probes produces a more significant indication of heterozygotic sites.
  • Results may be checked by testing with small sets of selected probes among which one or few probes selected to give a full match signal which is on average eight-fold stronger than the signals coming from mismatch-containing duplexes.
  • Partitioned membranes allow a very flexible organization of experiments to accommodate relatively larger numbers of samples representing a given sequence type, or many different types of samples represented with relatively small numbers of samples.
  • a range of 4-256 samples can be handled with particular efficiency.
  • Subarrays within this range of numbers of dots may be designed to match the configuration and size of standard multiwell plates used for storing and labeling oligonucleotides. The size of the subarrays may be adjusted for different number of samples, or a few standard subarray sizes may be used.
  • Obtaining information about the degree of hybridization exhibited for a set of only about 200 oligonucleotides probes defines a unique signature of each gene and may be used for sorting the cDNAs from a library to determine if the library contains multiple copies of the same gene.
  • signatures identical, similar and different cDNAs can be distinguished and inventoried.
  • Format 3 Sequencing by Hybridization In Format 3, a first set of oligonucleotide probes of known sequence is immobilized on a solid support under conditions which permit them to hybridize with nucleic acids having respectively complementary sequences. A labeled, second set of oligonucleotide probes is provided in solution. Both within the sets and between the sets the probes may be of the same length or of different lengths.
  • a nucleic acid to be sequenced or intermediate fragments thereof may be applied to the first set of probes in double-stranded form (especially where a recA protein is present to permit hybridization under non-denaturing conditions), or in single-stranded form and under conditions which permit hybrids of different degrees of complementarity (for example, under conditions which discriminate between full match and one base pair mismatch hybrids).
  • the nucleic acid to be sequenced or intermediate fragments thereof may be applied to the first set of probes before, after or simultaneously with the second set of probes.
  • a ligase or other means of causing chemical bond formation between adjacent, but not between nonadjacent, probes may be applied before, after or simultaneously with the second set of probes.
  • fragments and probes which are not immobilized to the surface by chemical bonding to a member of the first set of probe are washed away, for example, using a high temperature (up to 100 degrees C) wash solution which melts hybrids.
  • the bound probes from the second set may then be detected using means appropriate to the label employed (which may be chemiluminescent, fluorescent, radioactive, enzymatic or densitometric, for example).
  • nucleotide bases "match” or are "complementary” if they form a stable duplex by hydrogen bonding under specified conditions.
  • adenine matches thymine (“T"), but not guanine (“G”) or cytosine (“C”).
  • G matches C, but not A or T.
  • Other bases which will hydrogen bond in less specific fashion such as inosine or the Universal Base (“M” base, Nichols et al 1994), or other modified bases, such as methylated bases, for example, are complementary to those bases for which they form a stable duplex under specified conditions.
  • a probe is said to be “perfectly complementary” or is said to be a "perfectly match” if each base in the probe forms a duplex by hydrogen bonding to a base in the nucleic acid to be sequenced.
  • Each base in a probe that does not form a stable duplex is said to be a "mismatch" under the specified hybridization conditions.
  • a list of probes may be assembled wherein each probe is a perfect match to the nucleic acid to be sequenced.
  • the probes on this list may then be analyzed to order them in maximal overlap fashion. Such ordering may be accomplished by comparing a first probe to each of the other probes on the list to determine which probe has a 3' end which has the longest sequence of bases identical to the sequence of bases at the 5' end of a second probe.
  • the first and second probes may then be overlapped, and the process may be repeated by comparing the 5' end of the second probe to the 3' end of all of the remaining probes and by comparing the 3' end of the first probe with the 5 1 end of all of the remaining probes.
  • the process may be continued until there are no probes on the list which have not been overlapped with other probes.
  • more than one probe may be selected from the list of positive probes, and more than one set of overlapped probes ("sequence nucleus") may be generated in parallel.
  • the list of probes for either such process of sequence assembly may be the list of all probes which are perfectly complementary to the nucleic acid to be sequenced or may be any subset thereof.
  • sequence nuclei may be overlapped to generate longer stretches of sequence. Where ambiguities arise in sequence assembly due to the availability of alternative proper overlaps with probes or sequence nuclei, hybridization with longer probes spanning the site of overlap alternatives, competitive hybridization, ligation of alternative end to end pairs of probes spanning the site of ambiguity or single pass gel analysis (to provide an unambiguous framework for sequence assembly) may be used.
  • a pattern of hybridization which may be correlated with the identity of a nucleic acid sample to serve as a signature for identifying the nucleic acid sample
  • overlapping or non-overlapping probes up through assembled sequence nuclei and on to complete sequence for an intermediate fragment or an entire source DNA molecule (e.g. a chromosome).
  • Sequencing may generally comprise the following steps:
  • ligation may be implemented by a chemical ligating agent (e.g. water-soluble carbodiimide or cyanogen bromide).
  • a ligase enzyme such as the commercially available T4 DNA ligase from T4 bacteriophage, may be employed.
  • the washing conditions which are selected to distinguish between adjacent versus nonadjacent labeled and immobilized probes are selected to make use of the difference in stability of continuously stacked or ligated adjacent probes.
  • $baseVote getVotes ( $pos2 , $mut , $Full2 , ⁇ %Tens ) ;
  • my ( SrefScore , SbestSol , SarrRef ) ⁇ ortVotes ( substr ( $mut , $pos2 , 1 ) , SbaseVote ) ;
  • my Sweight 1 -abs ( $pos2 - $pos ) * . 1 ;
  • $max length($seq) -10 if ($max > (length ($seq) -10) ) ;
  • my $theChar substr ($seq, $pos, 1) ; foreach my $ch (A, C, G, T, a, c, g, t , d)

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention porte sur des procédés d'analyse séquentielle d'acides nucléiques par hybridation selon lesquels on prend en compte non seulement les séquences d'informations obtenues de sondes d'oligonucléotides parfaitement complémentaires, mais aussi, de sondes d'oligonucléotides imparfaitement complémentaires à l'acide nucléique cible.
PCT/US2000/016899 1999-06-19 2000-06-19 Procedes ameliores d'assemblage de sequences pour le sequençage par hybridation WO2000079007A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU54971/00A AU5497100A (en) 1999-06-19 2000-06-19 Improved methods of sequence assembly in sequencing by hybridization

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US33655899A 1999-06-19 1999-06-19
US09/336,558 1999-06-19

Publications (2)

Publication Number Publication Date
WO2000079007A1 WO2000079007A1 (fr) 2000-12-28
WO2000079007A9 true WO2000079007A9 (fr) 2002-05-02

Family

ID=23316634

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/016899 WO2000079007A1 (fr) 1999-06-19 2000-06-19 Procedes ameliores d'assemblage de sequences pour le sequençage par hybridation

Country Status (2)

Country Link
AU (1) AU5497100A (fr)
WO (1) WO2000079007A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9012144B2 (en) 2003-11-12 2015-04-21 Fluidigm Corporation Short cycle methods for sequencing polynucleotides
US9540689B2 (en) 1998-05-01 2017-01-10 Life Technologies Corporation Method of determining the nucleotide sequence of oligonucleotides and DNA molecules

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6780591B2 (en) 1998-05-01 2004-08-24 Arizona Board Of Regents Method of determining the nucleotide sequence of oligonucleotides and DNA molecules
CA2557177A1 (fr) 2004-02-19 2005-09-01 Stephen Quake Procedes et kits pour analyser des sequences de polynucleotides
US7666593B2 (en) 2005-08-26 2010-02-23 Helicos Biosciences Corporation Single molecule sequencing of captured nucleic acids

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5525464A (en) * 1987-04-01 1996-06-11 Hyseq, Inc. Method of sequencing by hybridization of oligonucleotide probes
EP0723598B1 (fr) * 1993-09-27 2004-01-14 Arch Development Corporation Procedes et compositions pour le sequencage efficace d'acide nucleique
US5795716A (en) * 1994-10-21 1998-08-18 Chee; Mark S. Computer-aided visualization and analysis system for sequence evaluation
US20020042048A1 (en) * 1997-01-16 2002-04-11 Radoje Drmanac Methods and compositions for detection or quantification of nucleic acid species
AU2496900A (en) * 1999-01-06 2000-07-24 Hyseq, Inc. Enhanced sequencing by hybridization using pools of probes

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9540689B2 (en) 1998-05-01 2017-01-10 Life Technologies Corporation Method of determining the nucleotide sequence of oligonucleotides and DNA molecules
US9012144B2 (en) 2003-11-12 2015-04-21 Fluidigm Corporation Short cycle methods for sequencing polynucleotides

Also Published As

Publication number Publication date
AU5497100A (en) 2001-01-09
WO2000079007A1 (fr) 2000-12-28

Similar Documents

Publication Publication Date Title
US6537755B1 (en) Solution-based methods and materials for sequence analysis by hybridization
US6864052B1 (en) Enhanced sequencing by hybridization using pools of probes
US6270961B1 (en) Methods and apparatus for DNA sequencing and DNA identification
Chetverin et al. Oligonucleotide arrays: New concepts and possibilities
US5503980A (en) Positional sequencing by hybridization
AU745201B2 (en) Methods and compositions for detection or quantification of nucleic acid species
US7501253B2 (en) DNA fingerprinting using a branch migration assay
US20020034737A1 (en) Methods and compositions for detection or quantification of nucleic acid species
Drmanac et al. Sequencing by hybridization
EP1141399A1 (fr) Methode de sequen age utilisant des marques grossissantes
WO1999031272A1 (fr) Methodes de detection de sequences polymorphes clivees, amplifiees et modifiees
US6692915B1 (en) Sequencing a polynucleotide on a generic chip
Maldonado-Rodriguez et al. Mutation detection by stacking hybridization on genosensor arrays
EP0967291A1 (fr) Méthode de criblage parallèle de variations alléliques
WO2000039333A1 (fr) Methode de sequençage utilisant des marques grossissantes
WO2000079007A9 (fr) Procedes ameliores d'assemblage de sequences pour le sequençage par hybridation
WO1999028494A1 (fr) Procedes d'utilisation de sondes permettant d'analyser une sequence de polynucleotide
US20050176007A1 (en) Discriminative analysis of clone signature
CN118667932A (zh) 一种基于mgi平台的兼容不同接头文库测序方法
Uitterlinden et al. TWO-DIMENSIONAL DNA TYPING OF HUMAN INDIVIDUALS FOR MAPPING GENETIC TRAITS
CN117625763A (zh) 准确地平行定量变体核酸的高灵敏度方法

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

COP Corrected version of pamphlet

Free format text: PAGES 31-35, DESCRIPTION, REPLACED BY NEW PAGES 31-35D; PAGES 1/2-2/2, DRAWINGS, REPLACED BY NEW PAGES 1/2-2/2; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP