Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.
In embodiments of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or an implicit indication of the number of features being indicated; features defining "first", "second" may include one or more of the stated features, either explicitly or implicitly. Unless otherwise indicated, "a set" or "a plurality" refers to two or more.
It should be noted that "connected" and "connected" are to be construed broadly, and may be, for example, fixed, detachable, or integrally connected, unless otherwise specified; may be mechanically connected, may be electrically connected, or may be in communication with each other; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the respective examples will be understood by those skilled in the art according to the specific circumstances.
The present invention may repeat reference numerals and/or letters in the various examples, and this repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
In embodiments of the invention, the terms "sequencing," "nucleic acid sequencing," and "gene sequencing" are interchangeable, and refer to nucleic acid sequencing; including sequencing by synthesis (sequencing by synthesis, SBS) and/or sequencing by ligation (sequencing by ligation, SBL), including DNA sequencing and/or RNA sequencing, including long fragment sequencing and/or short fragment sequencing, the long and short fragments being referred to as being opposite, e.g., a nucleic acid molecule longer than 1Kb, 2Kb, 5Kb or 10Kb may be referred to as a long fragment, and shorter than 1Kb or 800bp may be referred to as a short fragment; including double-ended sequencing, single-ended sequencing, and/or paired-ended sequencing, etc., where double-ended sequencing or paired-ended sequencing may refer to the readout of any two segments or portions of the same nucleic acid molecule that do not overlap completely; the term sequencing includes the process of binding nucleotides (including nucleotide analogs) to a template and collecting the corresponding signals emitted.
Sequencing generally involves multiple rounds of sequencing to achieve sequencing of the order of the multiple nucleotides/bases on the template; "round of sequencing" (cycle), also known as a "sequencing round," can be defined as one base extension of four nucleotides/bases, in other words, can be defined as the completion of a determination of the base type at any one specified position on a template, and for a sequencing platform that implements sequencing based on a polymerization or ligation reaction, one round of sequencing includes the process of achieving four nucleotides (including nucleotide analogs) at a time binding to the referred template and collecting the corresponding signal emitted; for a platform for realizing sequencing based on polymerization reaction, a reaction system comprises a reaction substrate nucleotide, polymerase and a template, wherein a section of sequence (sequencing primer) is combined on the template, and the added reaction substrate nucleotide is connected to the sequencing primer to realize the combination of the nucleotide and a specific position of the template under the catalysis of the polymerase based on a base pairing principle and a polymerization reaction principle; typically, a round of sequencing may include one or more base extensions (repeats), e.g., four nucleotides are added sequentially to the reaction system, each base extension and corresponding collection of a reaction signal, a round of sequencing including four base extensions; for another example, four nucleotides are added into the reaction system in any combination, such as two-by-two combination or one-by-three combination, wherein the two combinations respectively carry out base extension and corresponding reaction signal acquisition, and one round of sequencing comprises two base extensions; for another example, four nucleotides are added simultaneously to the reaction system for base extension and collection of reaction signals, and a round of sequencing includes one base extension.
What is referred to as a "bright spot" or "spot" on an image refers to a location on the image where the signal is relatively strong, e.g., where the signal is stronger than the surrounding, and appears as a relatively bright spot or spot on the image, where a bright spot or location occupies one or more pixels. The signal of the bright spot/position may be from the target molecule or from a non-target substance. Detection of "hot spots" includes detection of optical signals of target molecules such as extended bases or base clusters.
The term "chromatic aberration" (chromatic aberration, CA) refers to a phenomenon that an optical lens cannot focus color light of various wavelengths on the same point [ Max Born; emilWolf.principles of Optics Electromagnetic Theory of Propagation, interference and Diffraction of Light (7 th Edition) Cambridge University Press.October 13,1999:334.ISBN 0521642221.]; imaging, chromatic aberration is the inability of each color in the spectrum to focus on the same point on the optical axis, and for sequencing platforms that involve imaging the same object (e.g., one or more nucleic acid molecules) with multiple wavelengths of colored light, at least, chromatic aberration can cause an object acquired at different wavelengths to have different positions/coordinates or to not actually move, but to appear to be moving in multiple images at different wavelengths due to chromatic aberration.
The term "cross color" (cross talk or laser-cross talk or spectra-cross talk), also known as "spectral cross color" or "spectral cross", refers to the phenomenon in which a signal corresponding to one base diffuses into a signal of another base; for sequencing platforms that use different fluorescent molecules labeled to identify different bases, if the emission spectra of two or more selected fluorescent molecules overlap, it is possible to detect the diffusion of the signal of one fluorescent molecule into another fluorescent channel in one round of sequencing.
The terms "phase loss", "phase imbalance", "phase loss", "phase difference" refer to the phenomenon of reaction dyssynchrony between nucleic acid molecules in a population, such as a cluster of nucleic acid molecules, in a chemical reaction, including lag/lag (phase or sequence lag) and lead/lead (prephar or sequence lead); in a sequencing platform that uses different fluorescent molecules to identify different bases, the phenomenon is exhibited that the signal of the fluorescent molecule corresponding to a base at a specific position in more than one round of sequencing is not zero. Typically, sequencing is performed using nucleotides with fluorescent molecular labels and blocking groups, which prevent other nucleotides from binding to the next position of the template, such as azides attached to the 3 'position of the nucleotide's sugar group, and either the blocking groups are removed or cannot be removed prior to extension of the next base, which can cause phase loss.
In an embodiment of the invention, the images are from a platform that implements nucleic acid sequencing based on an optical imaging detection chip, which includes, but is not limited to, one or more series of sequencing platforms from companies or institutions such as BGI/CG (Complete Genomics), illumina/Solexa, thermoFisher/Life Technologies/ABI SOLiD, and Roche 454.
In some platforms, a solid support such as a chip has immobilized thereon a plurality of sequences (probes or sequencing primers) and a template (nucleic acid molecule to be tested) is attached to the chip by binding to the probes, for example by hybridization, optionally amplifying the template on the chip, and then loading the chip carrying the template into a sequencing device comprising an imaging system and a liquid path system, and conducting a controlled polymerase chain reaction under suitable conditions by controlling the liquid path system to the chip, e.g. the introduced nucleotide solution comprises nucleotides comprising modified nucleotides with blocking groups and fluorescent molecules, which, under catalysis of the polymerase, bind to specific positions of a certain template, the blocking groups on which are capable of blocking the binding of other nucleotides (including modified nucleotides) to the next positions of the template, according to the base complementation principle; furthermore, the imaging system is utilized to excite fluorescent molecules to enable the fluorescent molecules to emit fluorescent signals, and the fluorescent signals are collected, for example, a reaction area on a chip is photographed, so that an image is obtained; finally, the cutting reagent is introduced through a control liquid path system to remove the blocking group and fluorescent molecules of the modified nucleotide bound to the template; thus, one base extension was completed, and a solution containing a polymerase and a nucleotide was again introduced into the chip, and the above base reaction was repeated. Based on the captured images and the time sequence of each shot and/or the type of base added, the type of nucleotide/base to which the specific position of each template is bound, i.e., the nucleotide/base of the specific position of the template is determined.
The reaction efficiency of each step based on biochemical reactions is less than one hundred percent, for example, even if modified nucleotides not bound to the template are cleared before signal acquisition, for example, the reaction area on the chip is cleared with a buffer that does not affect base extension, it will be appreciated that the position on the acquired image that appears as a bright spot may correspond to the modified nucleotides not bound to the template, or to modified nucleotides or fluorescent molecules not bound to the template but not removed, and may correspond to the presence of other non-target species signaling the detection area on the chip.
In one embodiment of the invention, the images are from a second generation sequencing platform, such as the Illumina HiSeq/MiSeq series and the BGI MGISeq series, the input raw data is that the relevant parameters such as the position and intensity of the acquired signals comprise pixel-related information of the images, and the detection of what is called "bright spots" on the images comprises the detection of optical signals corresponding to clusters of nucleic acid molecules.
Referring to FIG. 1, a method of identifying one or more bases in a nucleic acid by detecting an image obtained from sequencing according to an embodiment of the present invention comprises: s11, mapping coordinates of each bright spot in the bright spot set corresponding to the template to an image to be detected, and determining the position of the corresponding coordinates on the image to be detected; s21, determining the intensity of a signal of the position of the corresponding coordinate on the image to be detected, wherein the intensity is corrected intensity; and S31, comparing the intensity of the signal of the position of the corresponding coordinate on the image to be detected with the first preset value, and judging the base type corresponding to the position based on the comparison result to realize base identification.
The set of the bright spots corresponding to the template is constructed and obtained based on a group of images, and each image in the group of images contains a plurality of bright spots; the set of images and the image to be examined are both from sequencing and correspond to a same Field of view (FOV), the set of images being from at least one round of sequencing, at least a portion of the signals appearing as at least a portion of the light spots on the set of images.
The method can rapidly and accurately identify bases, and thereby rapidly and accurately determine the nucleotide/base order of at least a portion of the sequence of the template.
Specifically, in S11, the set of bright spots corresponding to the template includes a plurality of bright spots corresponding to the template, including intensity and coordinate information of each bright spot.
The coordinate mapping is to establish a mapping relation between an original image such as a bright spot set corresponding to a template and a target image such as an image to be detected, wherein the mapping relation comprises the step of determining the coordinate position of any bright spot of the original image after mapping.
The present embodiment does not limit both the method for determining the coordinates and the method for implementing the coordinate mapping. For coordinate mapping, this may be implemented, for example, by a remap function of Opencv. For the determination of the coordinates of the bright spots, typically, one bright spot on the image occupies one or more pixels, and the coordinates of a certain pixel may be used as the coordinates of the bright spot, or the coordinates of the center of the subpixel of the bright spot may be determined as the coordinates of the bright spot by using, for example, a quadratic function interpolation method.
Specifically, in some embodiments, the input image to be inspected may be a 512×512 or 2048×2048 16-bit tiff format image, and the tiff format image may be a gray scale image. For gray scale images, the pixel values are the same as the gray scale values. The input image can also be a color image, one pixel point of the color image has three pixel values, the color image can be converted into a gray image, and then the subsequent processing detection is carried out, so that the calculation amount and the complexity of the image processing process are reduced. Alternatively, but not limited to, converting the non-gray scale image into a gray scale image using a floating point algorithm, an integer method, a shift method, an average method, or the like.
The set of the so-called bright spots corresponding to the template may be constructed at the time of this base recognition, or may be constructed and stored in advance. Here, a collection of hot spots corresponding to the template is pre-constructed using a set of images acquired from at least one round of sequencing, and saved for later use.
In some examples, four nucleotides bear different labels that are excited to emit different color signals corresponding to different types of nucleotides/bases when sequencing is performed. The set of bright spots corresponding to the template is referred to as four sets of bright spots corresponding to four nucleotides, respectively.
In one example, construction of a set of hot spots corresponding to a template using a set of images from a round of sequencing includes: sequentially or simultaneously adding four nucleotides into a reaction system to perform one-round sequencing to obtain a group of images, wherein the group of images comprises a first image, a second image, a third image and a fourth image, the first image, the second image, the third image and the fourth image respectively acquire signals emitted by the reaction of the four nucleotides, and the reaction system comprises a template and polymerase; respectively carrying out bright spot detection on the first image, the second image, the third image and the fourth image, and determining bright spots of the images, wherein the method comprises the steps of determining coordinates of the bright spots; aligning the set of images such that the bright spots of the set of images are in a same coordinate system; combining the bright spots on the aligned group of images to obtain a first-level bright spot set; according to the first order bright spot set, a bright spot set corresponding to four nucleotides respectively is established, namely, a template of four nucleotides/base is established.
When the set of bright spots corresponding to the template is constructed, bright spot detection is performed on the set of images and the set of images is aligned, without limitation of the order. The alignment of the set of images may or may not be performed using the bright spots on the set of images, for example, marking specific locations of the detection area, and aligning the set of images according to the information of the marks of each image.
The sequencing can comprise four base extensions, for example, four nucleotides are sequentially added into a reaction system to respectively and independently complete the base extension and comprise the acquisition of corresponding reaction signals, two base extensions, for example, four nucleotide pairwise combinations, the nucleotides in each combination enter the reaction system to perform the base extension at the same time, and only one base extension, for example, four nucleotides simultaneously perform the base extension in the reaction system.
In one example, four nucleotides are added simultaneously to a so-called response system, and corresponding response signals are acquired by an imaging system to obtain a set of images and/or images to be measured, wherein the imaging system comprises a first laser, a second laser, a first camera and a second camera.
Further, the template is DNA, and the four nucleotides are respectively provided with a first label, a second label, a third label and a fourth label, for example, four fluorescent molecules with different emission spectrums or incomplete overlapping; in one round of sequencing, two of the four nucleotides emit a first signal and a second signal respectively, the first camera and the second camera operate synchronously to acquire the first signal and the second signal respectively to obtain a first image and a second image, and the other two of the four nucleotides emit a third signal and a fourth signal respectively, the first camera and the second camera operate synchronously to acquire the third signal and the fourth signal respectively to obtain a third image and a fourth image. The first laser and the second laser can be from two lasers capable of emitting different wavelengths, or from one laser capable of emitting multiple wavelengths.
Specifically, for example, four deoxyribonucleotides dATP (sometimes abbreviated as A), dTTP (sometimes abbreviated as T), dGTP (sometimes abbreviated as G) and dCTP (sometimes abbreviated as C) carry four fluorescent dyes of ATTO-532, ROX, CY5 and IF700, respectively, whose spectral curves are shown in FIG. 2, the absorption spectra of ATTO-532, ROX, CY5 and IF700 are shown as dotted curves from left to right, the peak wavelengths of the absorption spectra are 531nm, 577nm, 651nm and 692nm, respectively, the radiation spectra/emission spectra of ATTO-532, ROX, CY5 and IF700 are shown as solid curves from left to right, and the peak wavelengths of the radiation spectra are 551nm, 602nm, 670nm and 712nm, respectively. When the light path structure of the imaging system is designed, the four dyes are excited in a two-by-two mode by adopting at least two wavelengths of lasers in consideration of the excitation efficiency of the dyes, and two cameras are used for collecting time-sharing fluorescent signals through a beam-splitting dichroic mirror and a double-band-pass filter; in other words, the first laser and the second laser can operate asynchronously, and the first camera and the second camera can operate synchronously, so that excitation of four dyes and collection of corresponding signals can be realized efficiently.
The identification and detection of the bright spots on the image is a signal which can be detected from the target molecules. The detection method of the bright spots according to this embodiment of the present invention is not limited, and may be performed, for example, by the method disclosed in CN107918931 a.
In some embodiments, detecting the bright spot includes detecting each image in the set of images using a k1 x k2 matrix, including: the matrix for judging that the relation midS between the center intensity and the edge intensity meets the first preset condition corresponds to a called bright spot, the center intensity reflects the intensity of a center area of the matrix, the edge intensity reflects the intensity of an edge area of the matrix, a center area and an edge area form a called k1 k2 matrix, k1 and k2 are natural numbers larger than 1, and the k1 k2 matrix comprises k1 k2 pixels.
The values of k1 and k2 are related to the density and distribution of template molecules on a solid phase matrix and imaging resolution, and generally, the k1 x k2 matrix is expected to be not smaller than the size of a target bright spot, and the target bright spot corresponds to a target signal or corresponds to a target molecule/molecule cluster; preferably, it is also generally desirable that the k1 x k2 matrix is smaller than the size of two separate bright spots on the image.
k1×k2 matrix, k1 and k2 may be equal or unequal. Generally, the range of values of k1 and k2 is greater than 1 and less than 10.
In one example, the imaging system related parameters are: the size of the objective lens is 60 times that of the electronic sensor, the size of the electronic sensor is 6.5 μm, the minimum size (resolution) of the image which can be seen by the electronic sensor is about 0.1 μm, the obtained image or the input image can be a 16-bit gray scale or color image of 512 x 512, 1024 x 1024 or 2048 x 2048, one target bright spot corresponds to a single molecule, the corresponding size is usually less than 10nm, the single molecule comprises one or a few molecules/nucleic acid fragments, and is usually less than 10 molecules, for example 1, 2, 3, 4 or 5 molecules, and one target bright spot approximately occupies 3*3 pixels on the image.
In another example, the imaging system related parameters are: the resolution of the image formed by the microscope is about 0.3 μm after 20 times of the objective lens passes through the electronic sensor, and the obtained image or the input image can be a gray scale or color image of 512 x 512, 1024 x 1024, 2048 x 2048 or 2560 x 2048, one target bright spot corresponds to one molecular cluster, and one target bright spot approximately occupies 5*5 pixels on the image.
k1 and k2 may be odd or even, and in some embodiments, both k1 and k2 are odd. In this way, the setting of the central area and the edge area of the matrix is facilitated, and the subsequent calculation is facilitated.
In one example, k1=k2=3.
The term center region and edge region are defined relatively, and for example, a region of a certain size centered on the center pixel or center subpixel of the matrix may be used as the center region, while the other regions constitute the edge regions of the matrix.
The intensity, or signal intensity, including the center and edge intensities herein, is reflected on the image and is generally related to the size of the pixel, e.g., as a pixel value for one or more pixels, an average or median of multiple pixel values, a sum of multiple pixel values, or a positive correlation with the pixel size.
In some examples, the first preset condition is mids+.s1, mids=midint-sumins (1:n)/N, midins (1:n)/N denotes the center intensity, sumins (1:n)/N denotes the edge intensity, sumins (1:n) denotes the sum of pixel values of 1 st to nth pixels of the edge region, N is a natural number not less than 4, and S1 is any value of [2,4 ]. The first preset condition is that the inventor obtains through a large amount of image data training summary, and is suitable for detecting the bright spots of images with different signal intensities, bright spot densities and distributions from various sequencing platforms.
Specifically, k1 and k2 are each an odd number greater than 3, and the central region is a 3*3 region centered on the central pixel of the matrix. In one example, referring to fig. 3, k1=k2=5, fig. 3 illustrates a 5*5 matrix, the central area is a 3*3 area centered on the pixel labeled midS in the figure, the pixel value of any one pixel in the central area is the intensity (central intensity) of the central area, for example, the pixel value of the pixel labeled midS in the figure is the central intensity, n is 12, such as the pixel points labeled 1-12 in the figure, and S1 is 2. Therefore, the method can rapidly and effectively detect the bright spots corresponding to the target molecules, is beneficial to the construction of a bright spot set corresponding to the template, and is beneficial to the accurate identification of subsequent bases.
In other embodiments, the hot spot detection comprises: respectively convolving each image in the group of images to obtain convolved images; searching all pixels containing peaks in a k3 k4 region in the convolved image, wherein k3 and k4 are natural numbers larger than 1, and the k3 k4 region contains k3 k4 pixels of the convolved image; and determining that the k5 x k6 region centered on the peak pixel meets a second preset condition corresponds to a so-called bright spot, wherein the second preset condition is that the pixel of the peak pixel of the k5 x k6 region is not less than S2, both k5 and k6 are natural numbers greater than 1, and S2 can be determined by the pixel of the convolved image.
The image is convolved by a convolution kernel, which is also referred to as a convolution template, a filter, a filtering template, or a scanning window, and the embodiment is not limited to the way in which the convolution is performed, and is performed by a correlation function in Matlab after the convolution kernel is set. The image is convolved, which typically involves the computation of optionally turning over the convolution template, then sliding the convolution template over the original image, multiplying the elements at the corresponding positions, and then adding up to obtain the final result. For example, what is commonly referred to as filtering may be implemented using a gaussian template.
In some examples, the target molecule is a cluster of nucleic acid molecules, for example, a cluster of nucleic acid molecules formed by amplification, such as strand displacement amplification or bridge amplification, of a nucleic acid molecule, the resolution of the imaging system used for image acquisition being about 0.3 μm, with k3=k4=k5=k6=5; further, after studying the law of the morphology and/or intensity variation of a large number of such target molecules on an image, the inventors set a convolution kernel of 5*5 size to perform the convolution, the convolution kernel of 5*5 size being shown in fig. 4, the mark on the convolution kernel shown in fig. 4 showing the coordinates/position of the pixel where the mark is located with respect to the center pixel, expressed laterally as x, expressed vertically as y, expressed in pixels, and performed a convolution operation using a convolution check image of such a 5*5 size, including reassigning each pixel in the convolution check image. In this way, the difference between the center pixel and the edge pixel (e.g., outermost peripheral pixel) of the 5*5 region in the image can be enhanced.
Specifically, in one example, the inventors set the intensity value/pixel value of a position/pixel without a coordinate mark on the convolution kernel shown in fig. 4 to 0 with a large amount of training data, and after performing a convolution operation set by this convolution kernel, the intensity/pixel value Ints (x, y) of a pixel in the image, for example, a pixel with a coordinate of (x, y) becomes newInts (x, y), newInts (x, y) = (12×ints (x, y) -Edge8Ints (x, y, 2)) ×200/(Ints (x, y) +edge8Ints (x, y, 2)),
Ints (x, y) represents the pixel value/intensity value of the pixel/location with (x, y) coordinates before convolution; in order to facilitate the fast operation, the range of newins (x, y) can be further set to be 0,255, and the value of newins (x, y) is 0 when the newins (x, y) is smaller than 0, and the value of newins (x, y) is 255 when the newins (x, y) is larger than 255;
edge8 ins (x, y, 2) represents the sum of the pixel values/intensity values of 12 pixels of the center coordinate (x, y) that are not less than 2 pixels apart from the center coordinate (x, y) in 8 directions (8 neighborhood) of the center coordinate, in this example, the Edge8 ins (x, y, 2) may be represented as Edge8 ins (x, y, 2) in the form of Edge8 ins (x, y, 2) x-2, y-1) +ins (x-2, y) +ins (x-1, y+1) +ins (x+2, y-1) +ins (x+2, y-2) +ins (x, y-2) +ins (x+2, y+1) +ins (x-1, y-2) +ins (x, y-2) +y-1, y+1), y-2, y-1, y-2, y-1+ins (x, y-2), x-2, y-1+ins (x-2, y+2), y-1+ins (x-2, y-2), y-1+ins (x-2, y-2+1, y-1+ins (x-2, y+2) ins (x-2, y-2+1, y-1+y-1+ins (x, y-2) ins (x+2, y-2, y+1+2) ins, y+2) and Ints (x+1, y+2) represent the intensity values/pixel values before the convolution of the positions/pixels having coordinates of (x-2, y-1), (x-2, y), (x-2, y+1), (x+2, y), (x+2, y+1), (x-1, y-2), (x, y-2), (x+1, y-2), (x, y+2), and (x+1, y+2), respectively.
Optionally, gaussian filtering the image before the convolution; and carrying out convolution operation on the obtained Gaussian filtered image.
Fig. 5 shows a comparison of images before and after convolution in the manner described above, with the upper image before convolution and the lower image after convolution, and the boxes in the figures illustrate changes in signal strength and/or morphology over the same region in the images before and after convolution.
It will be appreciated that the size of the convolution kernel, the values in the convolution kernel, and the size of n, e.g., edge8 ins (x, y, n), may be adjusted to have different characteristics, e.g., the morphology and/or intensity variation of the target molecule across the image, as desired, and that for adjusting n, typically, if the size of the ideal bright spot is known to be m x m, n=m/2 may be adjusted and rounded down.
For the settings of k3 and k4 or k5 and k6, similarly, the values of k3 and k4 or k5 and k6 are related to the density and distribution of template molecules on the solid phase matrix and imaging resolution, and it is generally desirable that k3 x k4 or k5 x k6 be no less than the size of a target bright spot, referred to as a target bright spot corresponding to a target signal or corresponding to a target molecule/cluster of molecules; preferably, it is also generally desirable that k3 x k4 or k5 x k6 be smaller than the size of two separate bright spots on the image.
k3 and k4, or k5 and k6 may or may not be equal. Generally, the values of k3, k4, k5 and k6 are all in the range of more than 1 and less than 10.
In some embodiments, k3 is equal to k4, and/or k5 is equal to k6.
In some embodiments, k3 and k4 are each an odd number greater than 1, and/or k5 and k6 are each an odd number greater than 1. Further, for a target bright spot corresponding to a platform of one molecular cluster, for example, one template is amplified to form one molecular cluster, the molecular cluster is fixed on a microsphere or a chip surface, typically, the size of the one molecular cluster is hundreds of nanometers, k3 and k4 can be both odd numbers greater than 3 and/or k5 and k6 can be both odd numbers greater than 3 under an imaging light path of 20 times magnification. Thus, the method is convenient for calculation, is beneficial to the construction of a bright spot set corresponding to the template, and is also beneficial to the accurate identification of subsequent bases.
S2 is associated with the pixels of the transformed image, e.g. S2 may be determined by all pixels of the transformed image. In some embodiments, S2 is not less than the median of all pixels of the convolved image ordered in ascending pixel values and/or is not greater than the eighteenth median of all pixels of the transformed image ordered in ascending pixel values. In one example, the input image is converted to a 256 color map (16 bitmap), S2 may be set to any of values 19-25. Thus, the detection of the bright spots can be effectively performed.
In one example, the original image is subjected to Gaussian filtering and then subjected to convolution operation, so that a convolved image is obtained; finding out all the points (bright spots) with peaks on the convolved graph, and ensuring that the peaks are larger than a specific value, for example, setting the specific value to be any value of 19-25, wherein in general, the larger the peak is, the brighter the point is, the better the morphology is, specifically, each pixel on the transformed graph has a midS, the higher the value of the midS corresponding to the convex position, and here setting 25 as the filtering threshold, and the position larger than 25 can be considered as a convex point; further, for all points satisfying the above conditions, the subpixel coordinates thereof were determined on the original image using the 3*3 area barycenter method.
In some embodiments, after detecting the bright spot on the image using the method of any of the examples above, further comprising screening the detected bright spot based on the intensity of the area on the original image where the bright spot is located. Thus, removing relatively dark or particularly bright spots or removing signals which are not or not only probably from target molecules is beneficial to reducing the calculated amount and improving the proportion of high-quality machine-on-demand data.
In still other embodiments, the so-called speckle detection includes detecting each image of the so-called set of images using a k7 x k8 matrix, including: judging that a k 7-k 8 matrix with a plurality of pixels in a designated direction being monotonically fluctuated corresponds to one candidate bright spot; the candidate bright spots are screened by pixels of at least a part of the area in the corresponding k 7-k 8 matrix to determine the so-called bright spots, wherein k7 and k8 are natural numbers larger than 1, and the k 7-k 8 matrix comprises k 7-k 8 pixels.
Similarly, k7 and k8 or values generally relate to the density and distribution of template molecules on the solid phase matrix and imaging resolution, with the general expectation that k7 x k8 will not be smaller than the size of a target bright spot, which corresponds to the target signal or to the target molecule/cluster of molecules; preferably, it is also generally desirable that k7 k8 be smaller than the size of two separate bright spots on the image.
In some examples, k7 is equal to k8, and/or k7 and k8 are each an odd number greater than 1. Thus, the method is convenient for calculation, construction of a bright spot set corresponding to the template and subsequent base recognition.
The designated direction may be any direction through the center of the k7 x k8 matrix, e.g., the center pixel or subpixel; the monotonic fluctuation refers to that the pixel values of the plurality of pixels in the specified direction do not fluctuate around the center of the k 7-k 8 matrix, fluctuate symmetrically, or fluctuate approximately symmetrically.
In an example, referring to fig. 6, fig. 6 shows that the pixel values of a plurality of pixels in a specified direction of a 5*5 matrix are monotonically fluctuated, and a specific specified direction may be a direction indicated by any arrow, where a0, a1, a2, and a3 indicate the pixel values of the pixels, and the matrix corresponds to a candidate bright spot.
The candidate bright spots are screened by using pixels of at least a part of areas in the corresponding k 7-k 8 matrix, so that relatively dark or particularly bright spots or signals which are probably not or not purely from target molecules can be further removed, the calculation amount is reduced, and the proportion of high-quality machine-down data is improved.
For example, the intensity of the center of the candidate bright spot is compared with the size of the background by taking the average value or the high frequency value of all pixels, any row or column of pixels in the k 7-k 8 matrix corresponding to the candidate bright spot as the background, and the candidate bright spot is screened, for example, a screening condition is set as the background that the intensity of the center of the candidate bright spot is not less than 3 times, and the candidate bright spot satisfying the condition is called bright spot. Thus, the proportion of high-quality data in the off-line data can be improved.
In some examples, the detection of the bright spots further includes determining sub-pixel coordinates of the detected bright spots using a barycenter method. Thus, coordinate information of the bright spots is obtained.
A system for realizing imaging based on an optical path system generally inevitably has chromatic aberration, and the chromatic aberration generally causes a static signal to have different positions in a plurality of images acquired at different time points; in addition, if the sequencing platform is used to perform multiple image acquisitions on one field of view on the chip based on the relative motion of the imaging system and the chip, the image acquisitions of the same field of view for different rounds of sequencing involve the mechanical motion of the related structures, and generally cause the same field of view to have different positions in the images acquired at different time points. The alignment of a set of images may at least to some extent correct for positional deviations due to the reasons mentioned above.
In some embodiments, aligning the set of images includes transforming the coordinate systems of the second, third, and fourth images, respectively, with respect to any of the set of images, such as with respect to the coordinate system of the first image, such that the coordinate systems of the set of images are the same.
The method of converting the coordinate system according to the embodiment of the present invention is not limited, and may be performed using a MatLab correlation function, for example.
Specifically, in one round of sequencing, four images of one field of view come from four wavebands of two cameras, although optical adjustment is performed as much as possible, pixel offset (chromatic aberration) still exists between the four images, generally, the optical setting is unchanged, and the offset caused by the corresponding chromatic aberration can be considered to be fixed; IF the set of images is from a first round of sequencing (cycle 1) or a first few rounds of sequencing, in which case there is typically no or insignificant cross-talk between the designated two of the four signals corresponding to the four bases, e.g., ATGC with four fluorescent dyes of ATTO-532, ROX, CY5 and IF700, respectively, in any of the first few rounds of sequencing, at one point in time, while the first and second cameras take a, at another point in time, while the first and second cameras take C, cross-talk will typically occur in the a and T signals or in the G and C signals from the images/signals acquired in that round, but the C and T or a and G signals are not significant in cross-talk, and the so-called C and T signals are not significant in cross-talk, which is manifested as an absence of acquisition of a T signal (C is not bright) at a certain location, thus it is typically difficult to determine the amount of shift in the first round of sequencing by several images acquired in a second round of sequencing.
Thus, in some examples, aligning a so-called set of images includes using images from an mth round of sequencing. M is, for example, greater than 20, 30 or 50. One round of sequencing can generally determine the base type at one position on the template, and when the sequencing is carried out to round M (cycle M), for example, round 20, 50, 80, 100 or 150, cross-color due to partial overlapping of emission spectra of fluorescent dyes and/or phase imbalance due to asynchronous chemical reactions, and the like, the cross-talk between signals representing four bases is generally obvious due to accumulation or superposition, and the images acquired by the round M can be used for determining the offset and aligning the set of images.
In some examples, the image of the mth round of sequencing includes a fifth image, a sixth image, a seventh image, and an eighth image, referred to as the fifth image, the sixth image, the seventh image, and the eighth image correspond to the same nucleotide as the first image, the second image, the third image, and the fourth image, respectively, of the set of images.
In one example, constructing a set of images corresponding to the set of bright spots of the template also comes from the mth round, and the fifth, sixth, seventh, and eighth images are respectively identical to the first, second, third, and fourth images in the set of images.
In one example, the set of images is referred to as from cycle1, and the offset is determined using the 100 th round (cycle 100) of images of the same field of view to align the set of images. Specifically, for example, converting coordinate systems of the sixth image, the seventh image, and the eighth image with reference to the coordinate system of the fifth image, respectively, may include: dividing the fifth image and the sixth image into a group of blocks of size k9 x k10 in the same way, wherein k9 and k10 are natural numbers greater than 30, and k9 x k10 comprises k9 x k10 pixels; determining an offset of each block of the sixth image relative to a corresponding block of the fifth image, respectively; based on the offset, the second image and the first image are aligned. Similarly, the third image and the first image, the fourth image and the first image are aligned to quickly and accurately achieve the alignment of the set of images.
k9 and k10 may be equal or unequal. The values of k9 and k10 are limited by the distribution, density and imaging resolution of the target molecules/clusters over the detection area, and it is desirable that the number of bright spots present on a block of k9 x k10 is statistically significant, e.g. greater than 30, 50, 100 or 500.
Assuming that the offset of the multiple images of a particular field of view of a round of sequencing due to chromatic aberration is fixed, it will be appreciated that whenever there is signal crosstalk between each of the two images of a round of sequencing of that field, the images of the round of sequencing can be used to determine what is called the fixed offset, regardless of the morphology of the crosstalk signals on the images, and thus align to construct a set of images corresponding to the set of hot spots of the template. In some cases, grid signals regularly distributed in the transverse and longitudinal directions can be set on different detection areas, such as different chips or the same detection area, as characteristic information (information sources), the characteristic information can be imaged on different channels or wave bands, namely, when base signals are acquired, the characteristic information can be acquired, and the images can be easily aligned by utilizing the distribution rules of the characteristic information. After dividing the image into a plurality of blocks, by aligning the feature information, the offsets of a set of corresponding blocks can be determined.
The image is divided into blocks, and adjacent blocks may or may not overlap. In one example, adjacent blocks do not overlap and there is a common edge or vertex between adjacent blocks.
Fig. 7 illustrates a process of dividing an image into a plurality of blocks, determining an offset between the blocks to align a round of images of one field of view, in one embodiment, black squares on the diagram represent a so-called block, specifically, the offset of the corresponding block on the a, C, and T diagrams with respect to the block on the G diagram is determined with reference to the coordinate system of the block on the image corresponding to the base G (abbreviated G diagram).
In the test it was found that the offset between the corresponding blocks is not fixed, i.e. the offset of two combined blocks located at different positions on the image is not the same, e.g. the offset of two blocks located in the central region of the two images is 5 pixels (pixels) and the offset of two blocks located in the edge region of the two images is 10pixels; moreover, the difference in the offset of adjacent block combinations is small. For example, for an image 4112 x 2176, the long side 4112 is offset by 4-5 pixels and the short side 2176 is offset by approximately 2-3 pixels. In one example, k9=k10=100, and in general, it can be considered that the offset is constant within one block of 100×100 size, fig. 8 illustrates the offset between at least a portion of the corresponding block combinations after dividing the two images into blocks of 100×100 size, and the offset table illustrated in fig. 8 may represent the coordinate system relationship between the two images.
In some embodiments, the combining the bright spots on the aligned set of images includes combining a plurality of bright spots within a predetermined range k11 by k12 into one bright spot, where k11 and k12 are natural numbers greater than 1, and k11 by k12 includes k11 by k12 pixels.
Typically, k11 k12 is set to be no greater than the size of two separate target bright spots, and preferably k11 k12 is set to be no greater than the size of one target bright spot.
In some imaging system, for example, the size of the electronic sensor is 6.5 μm, the magnification of the microscope is 60 times, the resolution is 0.1 μm, and the size of a bright spot corresponding to a target molecule including a cluster of molecules is generally less than 10×10 or 5*5. K11=k12=3 can be set, that is, the preset range is set as 3*3 to perform the speckle combination, so that the speckle set corresponding to the template can be accurately constructed.
Specifically, when combining the bright spots within the preset range, a blank set/blank image/blank template (template vec) may be set first, then the bright spots on the first image, the second image, the third image or the fourth image (abbreviated as a image, C image, G image and T image) are marked on the blank image in sequence, when a bright spot is marked, if a bright spot is found in a nearby position (within the preset range), the position of a new bright spot after combining the two bright spots can be determined according to the intensity of the two bright spots, for example, the intensity of the bright spot 1 is 350, the coordinate is 3.0,5.0, the intensity of the bright spot 2 is 150, the coordinate is 4.0,7.0, the two bright spots are marked as a new bright spot, and the intensity of the new bright spot is 290 and the coordinate is 3.3,5.6. Thus, the combination of the bright spots meeting the preset conditions on the group of images is realized, so that the bright spot set corresponding to the template is conveniently obtained.
Referring to fig. 9, fig. 9 illustrates a process of constructing a set of bright spots corresponding to a template by using a set of images from a round of sequencing, including detecting and identifying bright spots on a graph a, a graph C, a graph G, and a graph T in a set of images to obtain a set of bright spots of each image, aligning the set of images with a coordinate system of the graph G as a reference coordinate system includes merging the sets of bright spots of each image to obtain a first set of bright spots, and converting the coordinate system of the first set of bright spots into the original coordinate system of the graph a, the graph C, the graph G, and the graph T to obtain a set of bright spots corresponding to four nucleotides/bases, i.e., obtaining a template of four nucleotides/bases.
In this embodiment, the intensity in S21 is the corrected intensity. In some embodiments, the correction intensity includes cross color correction and/or phase correction.
Specifically, the image to be inspected is aligned to a set of so-called bright spots corresponding to the templates before correcting the intensity of the corresponding coordinate positions on the image to be inspected. Thus, the subsequent steps are facilitated.
In one example, ATGC has four fluorescent dyes, ATTO-532, ROX, CY5, and IF700, respectively, that are excited separately with two bands of laser light in sequencing, with two cameras to collect fluorescent signals simultaneously after each excitation; FIG. 10 shows, according to this example, the cross plot between the four images of one field of view in round 50, from top to bottom, base A-C cross plot (abscissa is the relative intensity of the A signal, ordinate is the relative intensity of the C signal), base A-G cross plot (abscissa is the relative intensity of the A signal, ordinate is the relative intensity of the G signal), base A-Tcrossalk plot (abscissa is the relative intensity of the A signal, ordinate is the relative intensity of the T signal), base C-G cross plot (abscissa is the relative intensity of the C signal, ordinate is the relative intensity of the G signal), base C-T cross plot (abscissa is the relative intensity of the C signal, ordinate is the relative intensity of the T signal), and base G-Tcrossalk plot (abscissa is the relative intensity of the G signal, ordinate is the relative intensity of the T signal), one point on each plot representing a corresponding position on the image to be referred to; as can be seen from the two arms of each cross talk plot and the dispersion of the points on the plot, the a signal (plot a) is more significantly crosstalk from the T signal, the C signal (plot C) is more significantly crosstalk from the G signal, the positions of the corresponding coordinates on the multiple plots a have more significantly T signals, and the positions of the corresponding coordinates on the multiple plots C have more significantly G signals.
In some examples, the correction for intensity includes cross color (cross talk) correction based on at least one of images from the same round of sequencing, the same field of view, and corresponding to different types of nucleotides/bases.
And the correction of the cross talk is beneficial to the accurate identification of the base. In some examples, the image Xi and the image to be detected come from the same round of sequencing, the image Xi and the image to be detected correspond to the same field of view, the image to be detected is crosstalked by signals of nucleotides corresponding to the image Xi, and cross-color correction is performed on the image to be detected, including: fitting signals of positions of a plurality of corresponding coordinates in a specific area of the image to be detected to obtain a fitting result; and correcting signals of positions of corresponding coordinates on the image to be detected based on the fitting result. Therefore, the crosstalk of signals from the bases corresponding to the images Xi on the images to be detected can be eliminated, so that the signals in the images to be detected only correspond to one base as much as possible, the accurate identification of the bases is facilitated, and the accurate determination of the nucleotide sequence is facilitated.
As stated without exception, the a signal crosstalk (i.e., the crosstalk of the corrected a signal to the C signal) experienced at the location of the corresponding coordinates of the corrected C plot is denoted by "AC correction" or "a- > C" or "a-C"; similarly, "TA correction" or "T- > a" means that crosstalk of a T signal received at a position of a corresponding coordinate on the corrected a-graph (i.e., crosstalk of a corrected T signal to a signal), and "CG correction" or "C- > G" means that crosstalk of a C signal received at a position of a corresponding coordinate on the corrected G-graph (i.e., crosstalk of a corrected C signal to a G signal), and so on.
Four-dimensional data are corrected two by two, 12 cases exist, and the correction process can be expressed asWherein->For a matrix of cross talk coefficients, the values in the matrix representing the result of a fit (coefficient of correction) of two signals, e.g. R AC Representing the fitting result/correction factor from which crosstalk of the a signal received at the corresponding coordinate position on the C-chart is corrected,for observations->Is a true value (corrected value).
The specific area may be the whole image to be inspected or a part of the image to be inspected. Preferably, the specific region is selected from at least a part of a central region of the image to be inspected, the central region of the image may be generally understood, for example, for an image having a size of 4000×2000, the central region of the image may be 3000×1500, 2056×1024, 2000×1500, 1024×1024, 1024×512, 1000×500, 1000×1000, 512×512 or 512×256, and the other regions of the image may be referred to as edge regions. In general, the intensity value of the position of the corresponding coordinate within the center region of the image fluctuates less, appearing more convergent on the cross talk chart, as the dots in the black circles in the a-G cross talk scatter chart illustrated in fig. 11. The color difference correction can be quickly and accurately realized by correcting the fitting result/correction coefficient determined by fitting the intensity values of at least a part of the positions in the region.
The fitting method is not limited by the embodiment, and for example, the fitting can be performed by using software such as MatLab cftool curve fitting toolbox, aTool, curveExpert and the like; the fitting may be linear or nonlinear. There is no particular limitation as to the amount of data or the amount of samples used for the fitting, i.e. how many intensities of the positions of the corresponding coordinates in a particular region on the image are selected for the fitting, as long as in principle Y coefficients of the Y-ary equation to be fitted can be solved, for example 2, 5, 10, 20, 30 or 50 can be taken for the linear fitting; preferably, the desired sample size can be statistically significant, such as not less than 20, 30, or 50; alternatively, the sampling amount may be limited to less than 200 or less than 100 in order not to be too large in calculation amount. In this way, correction can be accurately achieved using the corresponding fitting result (correction coefficient).
In some examples, a linear fit is performed. Therefore, the method is convenient to calculate, takes less time and is beneficial to quick correction.
Specifically, referring to fig. 12 and 13, in one example, the image to be inspected is an a-image, the image Xi is a T-image, the intensities of signals at 20 corresponding coordinates of a central region on the image to be inspected are selected for linear fitting, fig. 12 shows the result of the fitting, the relative signal intensity value of the abscissa is a, the relative signal intensity value of the ordinate is T, the result of the fitting determines the slope k of a straight line of the fitting, and the slope is used as a correction coefficient to correct the intensity of signals at each corresponding coordinate of the image to be inspected, for example, I T' =I T -I A ×k,I T' To correct the T signal intensity at the position, I T For the observed T signal intensity (observed value) of the position, I A A signal intensity (observed value) for the observed position; fig. 13 illustrates the results before correction and after correction by correction using this means of signals of positions of the 20 corresponding coordinates on the image to be inspected. In this way, the contribution of the T signal to the fluctuation of the signal intensity of the position of the corresponding coordinate of the image to be inspected can be eliminated or reduced, and the corrected image to be inspected can be obtained.
Comparing fig. 10, fig. 14 shows a cross talk plot between two of four images of the same field of view of the round of sequencing after chromatic aberration correction by way of the above example. It can be seen that through the chromatic aberration correction, the signal crosstalk between images corresponding to different bases in the same visual field and the same round is obviously reduced, which is beneficial to accurately identifying the bases and measuring and reading longer sequences.
Referring to fig. 15-18, fig. 15-18 show signal crosstalk diagrams between two images of adjacent wheels corresponding to the same base for the same field of view in one example, where a point on the diagram represents the location of a so-called corresponding coordinate, and the abscissas and ordinates are relative signal intensities; from top to bottom, from left to right, the four phase scatter diagrams in fig. 15 are respectively the signal intensity relation diagrams of cycle1 and cycle2, the two C diagrams, the two G and the two T diagrams, the four phase scatter diagrams in fig. 16 are respectively the signal intensity relation diagrams of cycle30 and cycle31, the two C diagrams, the two G and the two T diagrams, and the four phase scatter diagrams in fig. 17 are respectively the signal intensity relation diagrams of cycle60 and cycle61, the two C diagrams, the two G and the two T diagrams, and the four phase scatter diagrams in fig. 18 are respectively the signal intensity relation diagrams of cycle90 and cycle91, the two a diagrams, the two C diagrams, the two G and the two T diagrams.
It can be seen that in this example, the phase loss phenomenon (phase or prephasing) is more pronounced with respect to a or G, C or T; and as the number of sequencing rounds increases, the crosstalk of signals due to the phase imbalance of the chemical reactions of the various bases becomes more severe, and as can be seen in connection with fig. 18, by the time of the 91 st round of sequencing, the loss of phase has made it difficult to accurately distinguish whether the signal at a certain position in the T-plot is from the 90 th round of sequencing or from the 91 st round of sequencing. In general, the sequencing proceeds to the end, where all the positions of the corresponding coordinates are bright and have uniform brightness, and in this case, the correct base cannot be identified, that is, the sequencing cannot be continued, and the phase loss is a main reason for limiting the read length of the sequencing while synthesizing.
The upper and lower graphs in FIG. 19 show the relationship between the ratio of the phasings or the ratio of the prephasings of the four bases in the sequencing of a nucleic acid sample and the number of sequencing rounds, respectively, and the ratio of the phasings and the ratio of the prephasings of each base increases as the number of sequencing rounds increases.
And the correction of the phasing or prephasing is performed, so that the correct identification of the base and the measurement and reading of a longer sequence are facilitated. The phase correction may be performed before or after the cross talk correction.
In some examples, the correction for intensity includes a phase correction based on at least one of the images from adjacent rounds of sequencing and corresponding to the same kind of nucleotide.
Specifically, in one example, the image Yj and the image to be tested are from two adjacent rounds of sequencing, e.g., the image Yj is from round 31 of sequencing, the image to be tested is from round 30 of sequencing, the image Yj and the image to be tested correspond to the same field of view, the image Yj and the image to be tested correspond to the same kind of nucleotide/base, e.g., a, the so-called phase correction comprises: fitting signals of positions of a plurality of corresponding coordinates in a specific area of the image to be detected to obtain a fitting result; and correcting signals of positions of corresponding coordinates on the image to be detected based on the fitting result.
Similarly, the specific area referred to herein may be the entire image to be inspected or may be a part of the image to be inspected. Preferably, the specific region is selected from at least a part of a central region of the image to be inspected, the central region of the image may be generally understood, for example, for an image having a size of 4000×2000, the central region of the image may be 3000×1500, 2056×1024, 2000×1500, 1024×1024, 1024×512, 1000×500, 1000×1000, 512×512 or 512×256, and the other regions of the image may be referred to as edge regions. In general, the intensity value fluctuation of the position of the corresponding coordinate within the center area of the image is small, and appears as a comparatively convergent on the phasings scatter diagram, as the dots in the black circles in the phasings scatter diagram of the a diagram of the cycles 30 and 31 illustrated in fig. 20. The correction is performed by using the fitting result/correction coefficient determined by fitting the intensity values of at least a part of the positions in the region, so that the phase correction can be quickly and accurately realized.
Similarly, this embodiment does not limit the method of fitting; the fitting may be linear or nonlinear. There is no particular limitation as to the amount of data or the amount of samples used for the fitting, i.e. how many intensities of the positions of the corresponding coordinates in a particular region on the image are selected for the fitting, as long as in principle Y coefficients of the Y-ary equation to be fitted can be solved, for example 2, 5, 10, 20, 30 or 50 can be taken for the linear fitting; preferably, the desired sample size can be statistically significant, such as not less than 20, 30, or 50; alternatively, the sampling amount may be limited to less than 200 or less than 100 in order not to be too large in calculation amount. In this way, correction can be accurately achieved using the corresponding fitting result (correction coefficient).
In some examples, a linear fit is performed to correct phasing prior to cross walk correction according to the method of the above example, with rζ2=0.97 of the linear fit; in other examples, signals at the same plurality of locations are fitted, and linear fitting is performed to correct phasing after cross walk correction according to the method of the above example, with R2 = 0.93 for linear fitting.
For S31, in some examples, the intensity of the signal at the position of the corresponding coordinate on the image to be detected is an array (four-dimensional data) containing four values, the signal intensity of the four nucleotides/bases corresponding to the position may be represented, for example, as { Ints a, ints T, ints G, ints C }, where Ints a, ints T, ints G, and Ints C represent the signal intensity values of bases A, T, G and C, respectively, after correction, generally, the Ints a, ints T, ints G, and Ints C have the same reference, a maximum value (max) in the array may be compared with a first preset value, and the first preset value is greater than or equal to the maximum value, and it may be determined that the base type corresponding to the position on the image is the base corresponding to the maximum value, that is the base corresponding to the maximum value is identified; if the maximum value (max) in the array is smaller than the first preset value, judging that the base type corresponding to the position on the image cannot be accurately identified, and marking the base at the position of the corresponding nucleic acid molecule as N or leaving a gap, wherein N is any one of ATGC; in some examples, reads containing N or gaps after base recognition may be further processed, e.g., to further infer the base type represented by N or gaps in the reads based on information from other reads, e.g., neighboring reads, or to be partially filtered out, etc., to improve utilization of or quality of the yield data.
In some examples, each of the values in { ins A, ins T, ins G, ins C } is a processed, e.g., normalized, value.
In one example, the quality score (QSCore) calculation is performed on the four-dimensional data, the QSCore nature being an a priori probability, and may be calculated using known methods, for example, by reference to [ Ewing et al, base-calling of automated sequencer traces using phred.I. Accuracy estimate, genome Res.1998Mar,8 (3): 175-85 ]. Here, the inventors calculated the QScore using the ratio of the maximum value to the total value in the corrected 4-dimensional data, the calculated size of the QScore ranges from [0,40], specifically, qscore= (1.0×max/sumnts-0.25)/0.75×40, max is the maximum value of nts a, nts T, nts G, and nts C, sumnts is the sum of nts a, nts T, nts G, and nts C, and accordingly, a first preset value is set to 0.1, and if the QScore is greater than 0.1, it is determined that the base type of the position is the base corresponding to maxnts. Thus, the base recognition can be efficiently performed.
The logic and/or steps represented in the flowcharts or otherwise described herein may be considered as a sequence of executable instructions for implementing the logic functions, and may be embodied in any computer readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For example, a computer-readable storage medium according to an embodiment of the present invention stores a program for execution by a computer, the execution program including a method of performing any one of the above embodiments. The computer readable storage medium may be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device, including but not limited to read-only memory, magnetic or optical disks, and the like. More specifically, the computer readable storage medium includes the following (a non-exhaustive list): an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer-readable storage medium may even be paper or other suitable medium upon which the program is printed, for example, the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. The above description of technical features and advantages of the base recognition method in any of the embodiments is equally applicable to the computer-readable storage medium, and will not be repeated here.
Further, embodiments of the present invention provide a computer product comprising a computer-readable storage medium of any of the above embodiments.
For example, embodiments of the present invention provide a system comprising a computer product as provided in any of the above embodiments and at least one processor for executing a program stored in the computer readable storage medium.
For example, embodiments of the present invention provide a computer program product comprising instructions for implementing the method of identifying one or more bases in a nucleic acid, which instructions, when executed by a computer, cause the computer to perform the method of identifying one or more bases in a nucleic acid as in any of the embodiments described above.
Embodiments of the present invention provide a system configured to perform the method of identifying one or more bases in a nucleic acid of any of the embodiments described above.
Referring to FIG. 21, an embodiment of the present invention provides a system 100 comprising a plurality of modules for performing the steps of the method of identifying one or more bases in a nucleic acid of any of the embodiments described above. The system 100 includes: a mapping module 110, a signal determination module 120, and a comparison module 130. The mapping module 110 is configured to map coordinates corresponding to each bright spot in the bright spot set of the template onto the image to be detected, and determine a position of the corresponding coordinate on the image to be detected. The set of the bright spots corresponding to the template is constructed and obtained based on a group of images, and each image in the group of images contains a plurality of bright spots; the group of images and the image to be detected are from sequencing and correspond to the same visual field; sequencing involves adding nucleotides for multiple rounds of sequencing, a set of images from at least one round of sequencing, at least a portion of the signal appearing as at least a portion of the bright spots on the set of images;
A signal determining module 120, configured to determine an intensity of a signal from the mapping module 110 at a location of a corresponding coordinate on the image to be detected, where the intensity is corrected intensity; and a comparison module 130, configured to compare the intensity of the signal from the signal determining module 120 at the position of the corresponding coordinate on the image to be detected with the first preset value, and determine the base type corresponding to the position based on the comparison result, so as to implement the base identification.
Those skilled in the art will appreciate that the same functions can be implemented entirely by logic changes to method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controller, embedded microcontroller, etc., except that the controller/processor is implemented as pure computer readable program code. Thus, such a controller/processor may be considered as a hardware component, and means for performing various functions included therein may also be considered as structure within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The technical features and some descriptions of the method for identifying one or more bases in a nucleic acid in the above embodiments also apply to the system, and are not repeated here. It will be appreciated that additional features of the method of identifying one or more bases in a nucleic acid of any of the embodiments described above, including sub-steps, additional steps, alternative or preferred arrangements or treatments, etc., may be implemented by having the system or module of the system further comprise units/modules or sub-units/sub-modules.
In some examples, the system 100 further includes a set of hot spots building module 140 for building what is known as a set of hot spots corresponding to templates, the set of hot spots building module 140 being coupled to the mapping module 110.
In other examples, the mapping module 110 includes a hot spot set construction sub-module 111 for constructing a hot spot set corresponding to a template, the hot spot set construction sub-module 111 including: the image acquisition unit 1111 is used for sequentially or simultaneously adding four nucleotides into the reaction system to perform one-round sequencing to obtain a group of images; the four nucleotides are provided with different marks and are excited to emit signals with different colors, a group of images comprise a first image, a second image, a third image and a fourth image, the first image, the second image, the third image and the fourth image are respectively acquired from reaction signals of the four nucleotides in the same visual field, and a reaction system comprises the template and polymerase; a bright spot detection unit 1113 for detecting bright spots of the first image, the second image, the third image, and the fourth image from the image acquisition unit 1111, respectively, to determine bright spots of the respective images; an alignment unit 1115 for aligning a so-called set of images; a merging unit 1117 for merging the bright spots on the aligned group of images from the alignment unit 1115 to obtain a first-order bright spot set; and a bright spot set creating unit 1119 for creating bright spot sets respectively corresponding to the four nucleotides from the first-order bright spot sets from the combining unit 1117.
In some examples, the image acquisition unit 1111 acquires the signals to obtain the set of images and/or the image to be measured using an imaging system including a first laser, a second laser, a first camera, and a second camera while adding four nucleotides to the reaction system at the same time.
In some examples, the four nucleotides added by the image acquisition unit 1111 are respectively provided with a first label, a second label, a third label and a fourth label, in a so-called round of sequencing, a first laser is started to excite the nucleotides, two of the four nucleotides respectively emit a first signal and a second signal, a first camera and a second camera are synchronously operated to respectively acquire the first signal and the second signal to obtain a first image and a second image, and a second laser is started to excite the nucleotides, the other two of the four nucleotides respectively emit a third signal and a fourth signal, and the first camera and the second camera are synchronously operated to respectively acquire the third signal and the fourth signal to obtain a third image and a fourth image.
In some examples, the bright spot detection unit 1113 detects each image in the set of images by using a k1×k2 matrix, where the matrix that determines that the relationship midS between center intensity and edge intensity satisfies the first preset condition corresponds to one of the bright spots, the center intensity reflects the intensity of a center area of the matrix, the edge intensity reflects the intensity of an edge area of the matrix, the center area and the edge area form the matrix, k1 and k2 are natural numbers greater than 1, and the k1×k2 matrix includes k1×k2 pixels.
In some examples, the hot spot detection unit 1113 detects each image in the set of images using a k1 x k2 matrix, k1 being equal to k2.
In some examples, the speckle detection unit 1113 detects each image in the set of images using a k1 x k2 matrix, where k1 and k2 are each an odd number greater than 1.
When each image in the group of images is detected by the bright spot detection unit 1113 using a k1×k2 matrix, k1 and k2 are each an odd number greater than 3, and the center region is a 3*3 region centered on the center pixel of the matrix.
In some examples, the first preset condition when detecting each image in the set of images using the k1 x k2 matrix is mids++s1, mids=midint-sumInts (1:n)/N, midInt represents the center intensity, sumInts (1:n)/N represents the edge intensity, sumInts (1:n) represents the sum of pixel values of 1 st to nth pixels of the edge region, N is a natural number not less than 4, and S1 is any value of [2,4 ].
In some examples, the hot spot detection unit 1113 includes: respectively convolving each image in the group of images to obtain convolved images; searching all pixels containing peaks in a k3 k4 region in the convolved image, wherein k3 and k4 are natural numbers larger than 1, and the k3 k4 region contains k3 k4 pixels of the convolved image; and determining that a k5 x k6 region with a peak pixel as a center corresponds to one bright spot, wherein the second preset condition is that the pixel of the peak pixel of the k5 x k6 region is not less than S2, k5 and k6 are natural numbers which are larger than 1, and S2 can be determined through the pixel of the convolved image.
In some examples, k3 is equal to k4, and/or k5 is equal to k6.
In some examples, k3 and k4 are each odd numbers greater than 1, and/or k5 and k6 are each odd numbers greater than 1.
In some examples, k3 and k4 are each odd numbers greater than 3, and/or k5 and k6 are each odd numbers greater than 3.
In some examples, S2 is not less than a median of all pixels of the convolved image ordered in ascending pixel values, and/or is not greater than an eighteenth of all pixels of the convolved image ordered in ascending pixel values.
In some examples, the hot spot detection unit 1113 further comprises filtering the hot spots of the original image based on the intensity of the area of the image where the hot spots are located.
In some examples, the hot spot detection unit 1113 detects each image in the set of images by using a k7 x k8 matrix, where the method includes determining that a k7 x k8 matrix with a plurality of pixel values in a specified direction that are monotonically fluctuating corresponds to one candidate hot spot, and screening the candidate hot spot by using pixels in at least a part of a region in the corresponding k7 x k8 matrix to determine the hot spot, where k7 and k8 are natural numbers greater than 1, and where the k7 x k8 matrix includes k7 x k8 pixels.
In some examples, k7 is equal to k8, and/or k7 and k8 are each an odd number greater than 1.
In some examples, the mapping module 110 further includes a subpixel coordinate confirmation sub-module 112 for determining subpixel coordinates of the bright spot using barycenter.
In some examples, the alignment unit 1115 performs the alignment with images from the mth round of sequencing.
In some examples, aligning the set of images includes using images from an mth round of sequencing. M is, for example, greater than 20, 30 or 50.
In some examples, when the alignment unit 1115 performs the alignment with the image from the mth round of sequencing, the image of the mth round of sequencing includes a fifth image, a sixth image, a seventh image, and an eighth image, the fifth image, the sixth image, the seventh image, and the eighth image are respectively referred to as reaction signals of the same kind of nucleotides corresponding to the first image, the second image, the third image, and the fourth image, the coordinate systems of the sixth image, the seventh image, and the eighth image are respectively converted based on the coordinate system of the fifth image, including dividing the fifth image and the sixth image into a group of blocks having a size of k9×k10, each of k9 and k10 is a natural number greater than 30, and k9×k10 includes k9×k10 pixels in the same manner; determining an offset of each block of the sixth image relative to a corresponding block of the fifth image; the second image and the first image are aligned based on the offset.
In some examples, the merging unit 1117, when merging the bright spots on the aligned set of images, includes merging a plurality of bright spots within a preset range k11×k12 into one bright spot, where k11 and k12 are natural numbers greater than 1, and k11×k12 includes k11×k12 pixels.
In some examples, the signal determining module 120 is configured to determine an intensity of a signal at a location of a corresponding coordinate on the image to be inspected, where the intensity is corrected intensity, and the corrected intensity includes cross color correction and/or phase correction.
In some examples, the signal determination module 120 aligns the image to be inspected with the set of bright spots corresponding to templates before correcting the intensity.
In some examples, the signal determination module 120, when performing intensity correction, includes employing cross color correction based on at least one of the images from the same round of sequencing and corresponding to different types of nucleotides.
In some examples, the signal determination module 120, when performing intensity correction, employs cross-color correction including:
fitting signals of positions of a plurality of corresponding coordinates in a specific area of the image to be detected to obtain a fitting result; and correcting signals of positions of corresponding coordinates on the image to be detected based on fitting results. The image Xi and the image to be detected come from the same round of sequencing, the image Xi and the image to be detected correspond to the same visual field, and the image to be detected contains signals from the nucleotide corresponding to the image Xi.
In some examples, the fit is a linear fit.
In some examples, the signal determination module 120, when performing the correction of intensity, includes employing a phase correction based on at least one of the images from adjacent rounds of sequencing and corresponding to the same kind of nucleotide.
In some examples, the signal determination module 120, when making the correction for intensity, employs phase correction comprising: fitting signals of positions of a plurality of corresponding coordinates in a specific area of the image to be detected to obtain a fitting result; and correcting signals of positions of corresponding coordinates on the image to be detected based on the fitting relation. The image Yj and the image to be detected come from two adjacent rounds of sequencing, the image Yj and the image to be detected correspond to the same field of view, and the image Yj and the image to be detected correspond to the same kind of nucleotide.
By using the method, product and/or system of any of the embodiments of the present invention to recognize bases, bases can be recognized quickly and accurately, and determination of the nucleotide/base sequence of at least a portion of the sequence of the template can be achieved.
In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.