Nothing Special   »   [go: up one dir, main page]

CN115620809A - Nanopore sequencing data analysis method and device, storage medium and application - Google Patents

Nanopore sequencing data analysis method and device, storage medium and application Download PDF

Info

Publication number
CN115620809A
CN115620809A CN202211621058.9A CN202211621058A CN115620809A CN 115620809 A CN115620809 A CN 115620809A CN 202211621058 A CN202211621058 A CN 202211621058A CN 115620809 A CN115620809 A CN 115620809A
Authority
CN
China
Prior art keywords
analysis
data
fragment
sequencing
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211621058.9A
Other languages
Chinese (zh)
Other versions
CN115620809B (en
Inventor
郎继东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qitan Technology Ltd Beijing
Original Assignee
Qitan Technology Ltd Beijing
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qitan Technology Ltd Beijing filed Critical Qitan Technology Ltd Beijing
Priority to CN202211621058.9A priority Critical patent/CN115620809B/en
Publication of CN115620809A publication Critical patent/CN115620809A/en
Application granted granted Critical
Publication of CN115620809B publication Critical patent/CN115620809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/106Pharmacogenomics, i.e. genetic variability in individual responses to drugs and drug metabolism
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/118Prognosis of disease development
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Physics & Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Organic Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Immunology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Pathology (AREA)
  • Hospice & Palliative Care (AREA)
  • Oncology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

The invention discloses a nanopore sequencing data analysis method, a nanopore sequencing data analysis device, a storage medium and application. The invention can obtain multi-dimensional biological information through one-time detection, and can effectively distinguish biological samples by using the characteristics of the long reading length and the direct reading modification site of nanopore sequencing, thereby providing a basis for reading and medication guidance of subsequent clinical reports. Compared with a methylation detection and tissue tracing solution of second-generation sequencing, the method greatly simplifies experimental operation steps, greatly reduces the complexity of an experiment, the sequencing data volume and the sequencing cost, brings greater economic benefit, simultaneously reduces the difficulty of data analysis, shortens the detection period, and is more suitable for actual clinical detection requirements.

Description

Nanopore sequencing data analysis method and device, storage medium and application
Technical Field
The invention relates to the field of biological information, in particular to a nanopore sequencing data analysis method, a nanopore sequencing data analysis device, a storage medium and a method for acquiring gene information in a biological sample and classifying the biological sample.
Background
DNA methylation (DNA methylation) is a form of chemical modification of DNA, which means that a methyl group is covalently bonded to the cytosine carbon position 5 of a genomic CpG dinucleotide under the action of DNA methyltransferase. Numerous studies have shown that DNA methylation can cause changes in chromatin structure, DNA conformation, DNA stability, and the way DNA interacts with proteins, thereby controlling gene expression. Thus, DNA methylation is becoming increasingly important.
In recent years, the rapid development of liquid biopsy detection technology based on the combination of next generation sequencing technology (NGS) with cell free DNA (cfDNA) and circulating tumor DNA (ctDNA) in blood provides an opportunity to understand organs and tissues from blood samples, and to trace the origin of cfDNA tissues and organs through information on the degradation rule of cfDNA nucleosomes, organ-specific methylation sites, and the like, and through correlation between adjacent methylation sites. The research provides potential for accurately detecting the disease conditions of different parts of the body by detecting the methylation of the cfDNA/ctDNA, and provides a foundation for clinical application. Currently, the research of methylation in the field of tumor diagnosis and treatment is rapidly progressing.
Nanopore Sequencing Technology (NST), also known as fourth generation sequencing technology or single molecule real-time DNA sequencing technology, is a technology that can sequence each DNA molecule independently without PCR amplification during DNA sequencing. Compared with the NGS sequencing technology, the method can detect short read lengths of hundreds of bases, the sequencing read length of the nanopore sequencing technology can reach thousands to tens of thousands of bases, and even can reach ultra-long read lengths of several megabases, so that the method is favorable for analyzing the characteristics and length distribution of the original fragments of cfDNA/ctDNA; and meanwhile, the modification information of the sequencing sequence can be directly read, namely the change of the ionic current can be recorded by the system when single-stranded DNA passes through the nanopore, and the current is different when methylated DNA and unmethylated DNA pass through the nanopore, so that the methylation level of the DNA at different sites can be measured.
Currently, one of the commonly used detection techniques for DNA methylation of liquid biopsy samples (e.g., cfDNA) is to obtain the methylation level of each site on the whole genome by second-generation high-throughput sequencing, and experimentally processing DNA mainly uses three methods of bisulfite conversion, enrichment of methylated antibodies or MBD (Methyl-CpG-Binding Domain) affinity, and restriction enzyme digestion and bisulfite (RRBS) Binding with restriction enzymes. Although some panel designs targeting methylation sites reduce data volume and sequencing cost, the experimental process is not optimized, the operation is still complicated, and treatment modes such as bisulfite can cause DNA degradation to different degrees, thereby causing partial methylation information loss and obscuring DNA fragmentation characteristics. The analysis aiming at the characteristics and the length distribution of the DNA fragments is more in touch with the pain point of the NGS sequencing reading length, and can not be completed by obtaining a large amount of sequencing data and complicated analysis through the high-depth sequencing of the whole genome while the complete analysis can not be carried out. Particularly, due to different experimental principles, service products for DNA methylation detection, analysis of DNA fragment characteristics and length distribution, and cancer-targeted hotspot detection of a liquid biopsy sample are performed independently, i.e., the multi-dimensional information cannot be obtained simultaneously in one detection, which not only increases the initial consumption of DNA and the difficulty and complexity of the experiment, but also greatly increases the cost of sequencing and data analysis.
The information in this background is only for the purpose of illustrating the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art that is known to a person skilled in the art.
Disclosure of Invention
To solve at least some of the technical problems of the prior art, the present invention provides a nanopore sequencing data analysis method, apparatus, storage medium and use. The invention utilizes the nanopore sequencing technology to better solve at least part of problems in the prior art in both experiments and data analysis. Specifically, the present invention includes the following.
In a first aspect of the invention, there is provided a nanopore sequencing data analysis method, comprising:
acquiring current signal data of nanopore sequencing, wherein the current signal data at least comprises a time sequence current signal Ion-A which comprises information of at least two dimensions of a transverse time dimension and a longitudinal signal intensity dimension;
carrying out base recognition analysis on the Ion-A to obtain sequencing Data-A, and analyzing segment characteristics based on the Data-A;
carrying out methylation detection on the targeted loci based on Ion-A, and acquiring methylation information of the targeted loci; and
classifying the biological sample according to the fragment characteristics and the methylation information.
In certain embodiments, the method of nanopore sequencing data analysis according to the first aspect, wherein the fragment characteristics comprise at least one of a length distribution characteristic, a motif characteristic, and a tissue characteristic.
In certain embodiments, the method of nanopore sequencing data analysis according to the first aspect, the analysis of the length distribution features comprises screening sequences in the sequencing data to retain sequencing read sequence results that have unique alignments in the human reference genome and are not soft-cut, performing length statistics of the screened read sequences and profiling the lengths to obtain the length distribution features.
In certain embodiments, the analysis of the motif signature according to the method for nanopore sequencing data analysis of the first aspect, comprises screening sequences in the sequencing data to retain sequencing read-length sequence results that have unique alignment and are not soft-cut in the human reference genome, and counting the frequency or relative abundance of motifs of k-mers prior to each read-length sequence, wherein 4< = k < =10, resulting in a motif signature.
In certain embodiments, the method for nanopore sequencing data analysis according to the first aspect, wherein the analysis of the tissue features comprises screening sequences in the sequencing data to retain sequencing read sequence results that have unique alignment and are not soft cut in a human reference genome, screening sequence fragments with a specified length range, performing comparative analysis and calculating correlation with expression profile data of reference samples of a cell line and a primary tissue, and performing tissue tracing analysis to obtain the tissue features.
In certain embodiments, the nanopore sequencing data analysis method according to the first aspect, the methylation detection comprises sliding the Ion-a in a time direction by a prescribed step size to obtain a set DST composed of different current signal fragments, and performing similarity alignment analysis on each current signal fragment in the set DST with a reference signal fragment set DSR, wherein the reference signal fragment set DSR comprises a subset of methylated signal fragments and a subset of unmethylated signal fragments.
In certain embodiments, the nanopore sequencing data analysis method according to the first aspect, the methylation detection further comprises methylation discrimination based on similarity of alignment, comprising interpreting the targeted site as methylated if the number of results for each current signal fragment in the set DST aligned with the subset of methylated signal fragments/the number of results for each subset of unmethylated signal fragments aligned >1, and interpreting the targeted site as unmethylated if the number of results for each current signal fragment in the set DST aligned with the subset of methylated signal fragments/the number of results for each subset of unmethylated signal fragments aligned < 1.
In certain embodiments, the nanopore sequencing data analysis method of the first aspect, the constructing of the reference signal fragment set DSR comprises synthesizing a first sequence fragment comprising a methylation targeting site and a corresponding second sequence fragment comprising a non-methylation targeting site, nanopore sequencing a first reference signal fragment corresponding to the first sequence fragment and a second reference signal fragment corresponding to the second sequence fragment, composing a subset of methylation signal fragments from a plurality of the first reference signal fragments, composing a subset of non-methylation signal fragments from a plurality of the second reference sequence fragments.
In a second aspect of the present invention, there is provided a nanopore sequencing data analysis device, comprising:
a. the data acquisition module is arranged to acquire current signal data for nanopore sequencing, and comprises a time sequence current signal Ion-A;
b. the Data processing module is configured to perform fragment feature analysis and result analysis of targeted site methylation detection based on the current signal Ion-A, preferably, the fragment feature analysis comprises base recognition analysis of the Ion-A to obtain sequencing Data-A; preferably, the analysis of the results of the methylation detection of the target site comprises sliding cutting the Ion-A in a time direction by a specified step size to obtain a set DST consisting of different current signal fragments, and performing similarity comparison analysis on each current signal fragment in the set DST with a reference signal fragment set DSR, wherein the reference signal fragment set DSR comprises a methylated signal fragment subset and an unmethylated signal fragment subset; preferably, the methylation discrimination is further performed according to the similarity of the comparison, which comprises interpreting the targeted site as methylation if the number of results of comparison of each current signal segment in the set DST with the subset of methylated signal segments/the number of results of comparison with the subset of unmethylated signal segments is >1, and interpreting the targeted site as unmethylated if the number of results of comparison of each current signal segment in the set DST with the subset of methylated signal segments/the number of results of comparison with the subset of unmethylated signal segments is < 1.
c. A data storage module for storing at least the set of reference signal segments DSR;
preferably, further comprising:
d. and the display module is used for displaying the interpretation result analyzed by the data processing module.
In a third aspect of the present invention, there is provided a computer storage medium having a computer program stored therein, the computer program, when executed by a computer, implementing the method of the first aspect.
In a fourth aspect of the present invention, there is provided a method of obtaining genetic information in a biological sample, comprising the steps of sequencing DNA in the biological sample using nanopore technology, and analysing the sequencing data using a method according to the first aspect;
preferably, the biological sample is selected from at least one of blood, saliva and urine;
preferably, the DNA is selected from free DNA.
The method for identifying the biological sample and detecting the targeted hot spot mutation based on the nanopore sequencing technology can obtain multi-dimensional biological information through one-time detection, and can effectively distinguish the biological sample by utilizing the characteristics of the long read length and the direct read modification site of the nanopore sequencing, thereby providing a basis for reading and medication guidance of subsequent clinical reports. The results obtained by the method of the invention can be further analyzed and then applied to the detection and postoperative monitoring of cancer samples from liquid biopsy sources, such as the monitoring of Minimal Residual Disease (MRD), and can also be applied to early screening and traceability analysis of cancers to a certain extent.
In conclusion, compared with the methylation detection and tissue tracing solution of the NGS of the second-generation sequencing technology, the method greatly simplifies the experimental operation steps, greatly reduces the experimental complexity, the sequencing data quantity and the sequencing cost, brings greater economic benefit, simultaneously reduces the difficulty of data analysis, shortens the detection period, and is more suitable for the actual clinical detection requirements.
Drawings
Fig. 1 is an exemplary biological sample analysis flow diagram.
Fig. 2 is another exemplary biological sample analysis flow diagram.
FIG. 3 is a schematic diagram of the recognition of methylation sites in the current signal.
FIG. 4 sample distribution of fragment lengths of mutant versus wild type of T790M of EGFR gene.
FIG. 5 is a schematic diagram of an exemplary nanopore sequencing data analysis device.
Description of reference numerals:
the system comprises a 100-nanopore sequencing data analysis device, a 110-data acquisition module, a 120-data storage module, a 130-data processing module, a 140-data display module, a 210-internet or cloud terminal, and a 220-nanopore sequencer.
Detailed Description
Reference will now be made in detail to various exemplary embodiments of the invention, the detailed description should not be construed as limiting the invention but rather as a more detailed description of certain aspects, features and embodiments of the invention.
It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Further, for numerical ranges in this disclosure, it is understood that the upper and lower limits of the range, and each intervening value therebetween, is specifically disclosed. Every smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in a stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although only preferred methods and materials are described herein, any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention. All documents mentioned in this specification are incorporated by reference herein for the purpose of disclosing and describing the methods and/or materials associated with the documents. In case of conflict with any incorporated document, the present specification will control. Unless otherwise indicated, "%" is percent by weight.
Data analysis method
In a first aspect of the invention, there is provided a nanopore sequencing data analysis method, comprising:
(1) Acquiring current signal data of nanopore sequencing, wherein the current signal data at least comprises a time sequence current signal Ion-A which comprises information of at least two dimensions of a transverse time dimension and a longitudinal signal intensity dimension;
(2) Carrying out base recognition analysis on the Ion-A to obtain sequencing base Data-A, and analyzing the fragment characteristics based on the Data-A;
(3) Carrying out methylation detection on a target site based on Ion-A to obtain methylation information of the target site;
(4) Classifying the biological sample according to the fragment characteristics and the methylation information.
It will be understood by those skilled in the art that the numbers (1), (2), etc. are for the purpose of distinguishing between different steps and do not indicate the order of the steps. The order of the above steps is not particularly limited as long as the object of the present invention can be achieved. In addition, two or more of the above steps may be combined and performed simultaneously. It will be appreciated by those skilled in the art that additional steps or operations may be included before or after steps (1) - (4) above, or between any of these steps, for example to further optimize and/or improve the methods of the present invention.
Step (1)
In the present invention, the step (1) is a data acquisition step. The current signal data of the present invention can be obtained directly from data from a biological sample generated by a sequencer, or can be retrieved from a memory storing such current signal data, such as a hard disk, a computer, the internet, or a cloud. The memory or cloud that stores such data may be one, or multiple. The nanopore sequencing current signal data provided by the invention are current signals directly generated when bases pass through a nanopore during sequencing, and comprise time sequence current signals Ion-A and the like. The current signal Ion-A comprises data of at least two dimensions, namely a transverse time dimension and a longitudinal signal intensity dimension.
Step (2)
In the invention, the step (2) is a fragment characteristic analysis step, which comprises the steps of carrying out base recognition analysis on Ion-A to obtain sequencing base Data-A and analyzing fragment characteristics based on the Data-A.
In the invention, the base recognition analysis refers to recognition based on a deep learning model or algorithm, and can be performed by using a known model or algorithm, for example, the open source basecaller algorithm of nanocall uses a traditional machine learning algorithm HMM model to describe the relationship between nucleotide sequence information and electric signals, in the algorithm, an observed value is the electric signal of each event, and a hidden state is a DNA sequence with the length of k nucleotides corresponding to the current event. Where k is a hyperparameter, the representation model considers that the current event's electrical signal is determined by how many nucleotides are within the pore size. Deep learning algorithm is used by the deep learning algorithm to solve the basefilling problem, and RNN (recurrent neural network) model suitable for sequence problem is used to translate the electric signal. The open source basecaller algorithm of Chiron improves the translation precision by utilizing a complex deep learning framework of CNN + RNN + CTC decoder. Although all three basecalaler algorithms perform sequence translation based on the events partitioning results given by the ONT, in practical applications, the events partitioning results given by the ONT are not necessarily accurate due to the existence of a large amount of modification information contained in the real DNA sequence.
In the present invention, the segment characteristic analysis includes at least one of a length distribution characteristic, a motif characteristic, and a tissue characteristic. Preferably, the purpose of fragment feature analysis is to determine whether the DNA in the liquid sample is characteristic of cfDNA or of ctDNA.
In the present invention, it is preferred to include screening of sequencing data, for example screening of sequencing read-long (reads) sequences to retain sequencing read-long sequence results with unique alignment and non-soft-cut (soft-clipping) in the human reference genome, prior to fragment profiling.
In an exemplary embodiment, the fragment feature analysis of the present invention includes a length distribution feature analysis, and the specific analysis method is not particularly limited, and illustratively includes counting the lengths of the screened read-length sequences and plotting the length distribution to obtain the length distribution feature.
In an exemplary embodiment, the segment characteristic analysis of the present invention includes tissue characteristic analysis, and a specific analysis method is not particularly limited. Illustratively, the method comprises the step of further screening a certain length range of sequence fragments from the screened read-length sequences, and comparing, analyzing and calculating the correlation between the sequence fragments and expression profile data of a cell line and a reference sample of primary tissues, thereby performing tissue tracing analysis and obtaining tissue characteristics. Wherein, the range of the certain length is preferably 120-180bp, such as 120-160bp, 130-180bp and 140-150bp.
In an exemplary embodiment, the fragment feature analysis of the present invention includes motif feature analysis, and the specific analysis method is not particularly limited, and illustratively, it includes counting the frequency or relative abundance of motif of k-mer before each read sequence for the sequence result after screening, where k is a natural number of 4 or more, such as 5, 7, 9, and the like. Further preferably, k is 10 or less, 8 or less, or the like.
In certain embodiments, the fragment signature analysis of the invention comprises: firstly, screening read-length sequences to reserve a sequencing read-length sequence result which has unique comparison and is not soft-clipping in a human reference genome, secondly, carrying out length statistics of the read-length sequences and making a length distribution map to obtain length distribution characteristics; then counting the frequency or relative abundance of the motif of each front k-mer of the read-length sequence to obtain the characteristics of the motif; finally, screening out sequence segments with a certain length range, and comparing, analyzing and calculating correlation with expression profile data of reference samples of cell lines and primary tissues so as to analyze tissue tracing, wherein the length range is preferably 120-180bp, and tissue characteristics are obtained; wherein the features of the cfDNA include, but are not limited to, fragment length enrichment of about 167bp (corresponding to length distribution features), strong correlation with lymphocyte cell lines or myeloid cell lines or bone marrow tissue (corresponding to tissue features), and relative abundance values of cancer-associated motifs at normal sample values levels (corresponding to motif features); where the characteristics of ctDNA include, but are not limited to, fragment length enrichment of about 100-160bp (corresponding to length distribution characteristics), strong association with cancer cell lines (corresponding to tissue characteristics), and low relative abundance values for cancer-associated motifs compared to normal sample values (corresponding to motif characteristics).
Step (3)
Step (3) of the present invention is targeted site methylation detection based on Ion-a, which generally comprises sliding cutting Ion-a in a time direction by a prescribed step length to obtain a set DST consisting of different current signal fragments, and performing similarity alignment analysis on each current signal fragment in the set DST and a reference signal fragment set DSR, wherein the reference signal fragment set DSR comprises a methylated signal fragment subset and an unmethylated signal fragment subset.
In the present invention, the current signal fragment refers to a fragment corresponding to a continuous part signal of a current signal Ion-A of the whole length of a DNA sequence, and the length of the fragment is not particularly limited, and can be freely selected by a person skilled in the art according to the length of a methylation target sequence. In general, the length is, for example, 10 to 150bp, preferably 20 to 90bp, more preferably 25 to 80bp, for example 20, 25, 30, 35, 40, 45, 50, 60, 70 bp. The gene sequence corresponding to the current signal fragment in the invention can not be lower than 10bp generally.
In the present invention, the reference signal fragment set DSR is a set comprising a subset of methylated signal fragments and a subset of unmethylated signal fragments. The subset of methylated signal fragments is generally composed of at least one methylated signal fragment, which is not particularly limited as long as it includes an electrical current signal corresponding to methylation of the target site, and may be one signal fragment corresponding to the same sequence or a plurality of different signal fragments corresponding to the same sequence. The position of the site in the DNA sequence fragment corresponding to methylation in the methylation signal fragment is not particularly limited, and thus a different position of a methylation site in the DNA sequence fragment corresponds to a different methylation signal fragment. Similarly, the subset of unmethylated signal segments is typically composed of at least one unmethylated signal segment that is not particularly limited as long as it includes a current signal that is unmethylated corresponding to the targeted site, and can be one signal segment corresponding to the same sequence or a plurality of different signal segments corresponding to the same sequence.
In the present invention, the reference signal segment set DSR is typically a set of standard reference signal segments that are pre-constructed. In an exemplary embodiment, construction of a reference signal fragment set DSR of the invention includes synthesizing a first base sequence fragment containing a methylated targeting site and a second base sequence fragment containing an unmethylated targeting site, and nanopore sequencing to obtain a methylated signal fragment corresponding to the first base sequence fragment and an unmethylated signal fragment corresponding to the second base sequence fragment. Thereby obtaining a subset of methylated signal fragments consisting of the plurality of methylated signal fragments and a subset of unmethylated signal fragments consisting of the plurality of unmethylated signal fragments. Further constituting a set of reference signal segments DSR. The number of reference signal segments in the reference signal segment set DSR is not limited, and may be 1 or more, 5 or more, 10 or more, 20 or more, 50 or more, 100 or more, and the like.
In an exemplary embodiment, the construction of the reference signal segment set DSR of the present invention comprises: synthesizing a target fragment of the target methylated gene which is not methylated, sequencing the target fragment on a nanopore sequencing platform to obtain a corresponding current signal value, and repeating the step for at least 5 times to obtain a subset of the current signal values of the target fragment of the target methylated gene which is not methylated; synthesizing a target fragment with methylated target methylated genes, sequencing the target fragment with the same nanopore to obtain a corresponding current signal value, and repeating the step for at least 5 times to obtain a subset of the current signal values of the target fragment with methylated target methylated genes. The reference signal segment set DSR is composed of two subsets.
The latest research found that unlike the traditional understanding, cfDNA has long fragments, with a length of over 600bp, even 23K (Yu SCY, jiang P, peng W, et al, single-molecule sequencing measures a large amplification of long cell-free DNA molecules in a mechanical plant, proc Natl Acad Sci U S A2021 118 (50): e2114937118. Doi: 10.1073/pnas.2117149318), whereas methylation information in long-fragment cfDNA is more important. The conventional methylation detection of free DNA can be carried out by the NGS sequencing technology. Compared with the NGS sequencing technology, the method can detect the short read length of hundreds of bases, the sequencing read length of the nanopore sequencing technology can reach thousands to tens of thousands of bases, and even can reach the ultra-long read length of several megabases, so that the method is beneficial to analyzing the characteristics and the length distribution of the original fragment of cfDNA/ctDNA, particularly the analysis of the free DNA of the long fragment, and the methylation information of the DNA of different sites compared with the traditional NGS method can be obtained.
In certain embodiments, the methylation detection of the invention further comprises performing methylation discrimination based on similarity of the alignments, which comprises interpreting the targeted site as methylated if the number of concordant results of the alignment of each electrical current signal segment in the set DST with the subset of methylated signal segments/concordant results with the subset of unmethylated signal segments >1, and interpreting the targeted site as unmethylated if the number of concordant results of the alignment of each electrical current signal segment in the set DST with the subset of methylated signal segments/concordant results with the subset of unmethylated signal segments < 1.
Step (4)
Step (4) of the present invention is to classify the biological sample based on the fragment characteristics and the methylation information. According to the invention, the result analysis and the fragment characteristic analysis of the targeted site methylation detection can be carried out through one-time sequencing of a sample, and the sample is classified based on the fragment characteristic and the methylation information. For example, a normal or healthy sample is considered when the methylation of the target site of the sample is detected as not being methylated and the fragment signature analysis results are characteristic of cfDNA; and if the result of the methylation detection of the target site of the sample is the occurrence of methylation or the result of the fragment characteristic analysis is the characteristic of ctDNA, the sample is considered as a potential cancer sample.
Analysis device
In a second aspect of the invention, a nanopore sequencing data analysis device is provided. The analysis device of the present invention may be, for example, an electronic device, such as a computer, a processor, etc., which includes at least one data acquisition module, a data processing module, and a data storage module, optionally further includes other modules, such as a display module, or further includes a bus connecting different modules, components, or assemblies (including a storage unit and a processing unit). The "module" and "unit" have the same meaning in the present invention.
The analysis device of the invention comprises, by way of example, the following a-c and optionally further modules d:
a. the Data acquisition module is configured to acquire at least nanopore sequencing current signal Data, including a time sequence current signal Ion-A, and optionally, is further configured to acquire sequencing Data-B of a mutation hotspot;
b. and the Data processing module is configured to perform fragment feature analysis and targeted site methylation detection analysis based on the current signal Ion-A, and optionally, is further configured to perform targeted mutation site detection analysis on Data-B.
c. A data storage module for storing at least the set of reference signal segments DSR;
d. and the display module is used for displaying the interpretation result obtained after the analysis of the data processing module.
In the present invention, a data storage module stores at least the set of reference signal segments DSR and program code executable by a data processing module to cause the data processing module to perform the method of the present invention. Optionally, the data storage module stores the data acquired by the data acquisition module. The memory modules may also include programs/utilities having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment.
The bus may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface. Also, the electronic device may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via a network adapter. The network adapter communicates with other modules of the electronic device over the bus.
The analysis of the fragment characteristics and the detection and analysis of the methylation of the target site in the data processing module of the analysis device based on the current signal Ion-A are the same as those described in the first aspect of the invention, and are not repeated here. The following is a further description of the analysis of Data-B only.
The Data processing module of the present invention is optionally further configured to enable detection analysis of targeted mutation sites for Data-B. Wherein, data-B is sequencing Data obtained by a nanopore sequencing library Lib-B aiming at the mutation hot spot. And the Lib-B can further establish a second nanopore sequencing library for capturing the target mutation hot spot after the biological sample is judged as a potential cancer sample, and obtain sequencing base data after the Lib-B is sequenced by a nanopore sequencer and base recognition (basecloning) analysis is carried out. Further, the data processing module further performs filtering determination and medical report interpretation on the obtained analysis result, wherein the medical report interpretation includes, but is not limited to, interpretation of drug resistance of the mutation hot spot to the tumor drug, and the like. Optionally, the Data-B is further subjected to length distribution analysis of the sequencing sequence length of the hot spot mutation, and the characteristics of the cfDNA obtained by the fragment characteristic analysis of the application are verified, so that the credibility of cancer discrimination is increased.
Computer storage medium
In a third aspect of the invention, a computer storage medium is provided, storing at least a computer program which, when executed by a computer, implements the method of the first aspect of the invention. The storage medium of the present invention may be a readable medium in the form of a magnetic disk, an optical disk, a volatile memory unit, such as a random access memory unit (RAM) and/or a cache memory unit, and may further include a read only memory unit (ROM).
Method for obtaining gene information in biological sample
In a fourth aspect of the invention, there is provided a method of obtaining genetic information from a biological sample, comprising the steps of sequencing DNA in the biological sample using nanopore sequencing technology, and analysing the sequencing data using the method of the first aspect of the invention.
In the invention, the nanopore sequencing technology can be carried out by adopting a currently known platform, and comprises a MinION nanopore sequencer of ONT company, a QNOME-3841 sequencer of Beijing Qincao technology company and the like. Nanopore sequencing can sequence long fragments, such as fragments of 700bp or more, 1kbp, 2 kbp, 5kbp, 6 kbp, 8 kbp, 1Mbp, 2 Mbp or more.
In an exemplary embodiment, sequencing using nanopore sequencing technology of the present invention further comprises the steps of extracting DNA from the biological sample and preparing a sequencing library. The biological sample is not limited, and examples thereof include, but are not limited to, blood or components thereof (e.g., serum, plasma), saliva, and urine. The DNA of the present invention is not limited, and examples thereof include episomal DNA such as cfDNA or ctDNA. The sequencing library preparation method of the present invention is not limited and can be performed using a known kit.
In certain embodiments, the sequencing library is a 1D library, which is prepared by separating the plus and minus strands and sequencing them separately. In an exemplary embodiment, the 1D library preparation step comprises filling up both ends of the DNA, adding a at the ends, ligating to the linker, and then adding the Teher protein to adsorb the DNA strands to the membrane of the sequencing chip. In another exemplary embodiment, the preparation of the 1D library comprises mixing the adaptor-ligated transposon enzyme with long-chain DNA, cleaving the long-chain DNA by the enzyme, adding the adaptor at the breakpoint, and sequencing by adding the dynein and Tether proteins.
In certain embodiments, the sequencing library is 1D 2 A library prepared by ligating 1D on both sides of DNA 2 And connecting a joint, and then connecting a sequencing joint, the dynein and the Tether protein. 1D 2 The linker allows the negative strand to be sequenced immediately following the positive strand, and because the two strands are complementary, the two sequences can be aligned with each other, improving the accuracy of the base sequence interpretation.
In certain embodiments, library construction further comprises a step of enriching the library, such as probe capture and the like.
In certain embodiments, the invention includes the steps of constructing a first nano-sequencing library for extracted biological sample DNA, and creating a second nano-pore sequencing library that captures targeted mutational hotspots.
Example 1
1. Blood samples of 1 non-small cell lung cancer sample, 1 liver cancer sample, 1 small cell lung cancer sample and 1 normal sample were collected by EDTA vacuum tube. The following description will be given with reference to the non-small cell lung cancer sample example, and the same steps 1 to 7 are repeated for the remaining sample examples. And (3) centrifuging the sample in a centrifuge under the conditions of low temperature and low speed. Plasma was collected by pipette separation without disturbing the precipitated blood cells. cfDNA was extracted from 10ml plasma using the QIAamp Circulating Nucleic Acid Kit. Quantification was performed using Qubit and quality control of the DNA fragments was performed by Agilent 2100 Bioanalyzer. The extracted cfDNA was stored at-80 ℃.
2. For the cfDNA extracted in the step 1, a first Nanopore sequencing library is prepared by using a commercial library construction kit QLK-V1.1.1 (Beijing Qizhi carbon technology Co., ltd.) or SQK-LSK109 (Oxford Nanopore Technologies), and specific operations are carried out according to kit instructions to construct a first library designated as Lib-A.
3. Sequencing Lib-A by using a QNOME-3841 sequencer (Beijing Simultaneous carbon technology Co., ltd.) or an Oxford Nanopore Technologies (ONT) sequencer such as MinION to obtain corresponding data for storing a sequencing current signal, wherein the data comprises a current signal, metadata of chip information of the sequencer, channel information and the like, and the current signal is marked as Ion-A; and (3) carrying out base recognition (baselearning) analysis on the sequencing current signal Ion-A by utilizing a QNOME-3841 high-precision baselearning model and algorithm or an HAC (Hac-aided algorithm) model and algorithm of ONT (ONT), wherein the Data quantity required by each library is at least 2 million sequencing sequences, and obtaining corresponding sequencing base Data which is marked as Data-A.
4. The Ion-A in step 3 was subjected to the targeted methylation gene detection of HOXA7, HOXA9, SHOX2 and RASSF1A (as in Table 1), and the Data-A was subjected to fragment feature analysis.
TABLE 1 methylation site information to differentiate lung cancer from normal/healthy samples
Figure 814931DEST_PATH_IMAGE001
The detection of the targeted methylation gene of Ion-A comprises the following steps: firstly, performing sliding cutting on Ion-A by the step size of a 1-sampling point, wherein the cutting length is the length of a target sequence of a targeted methylated gene, and the lengths of target sequences of four targeted methylated genes, namely HOXA7, HOXA9, SHOX2 and RASSF1A, are respectively 114bp, 89bp, 108bp and 74bp, so as to obtain a set of cut Ion-A. Secondly, respectively carrying out signal similarity comparison analysis on the current signal set (marked as Ion _ methyl) of the methylated gene target sequence and the current signal set (marked as Ion-Unmethyl) of the unmethylated gene target sequence in the collection after Ion-A cutting by using a dynamic time warping algorithm to obtain a distance average value after comparison. Finally, the methylation is judged, namely the ratio of the distance average value of the Ion-methyl set to the distance average value of the Ion-unmethyl set is more than 1 (+). The principle of methylation identification based on current signals is shown in FIG. 3.
Wherein, the Data-A is subjected to segment feature analysis, and the analysis step comprises the following steps: first, data-a was aligned to the human reference genome Hg19 (or Hg 38) using minimap2, sorted using sambama, resulting in a bam file, and the unique aligned and non-soft-clipping sequencing read sequence results retained. Secondly, length statistics of the read sequence and length distribution are carried out (as shown in fig. 4), and the fragment length is found to be enriched in 165bp (the first peak of the sequencing sequence length) as a whole, the second peak of the sequencing sequence length is 144bp, the third peak is 146bp, the fourth peak is 158bp, and the fragment length is less than 165bp of the enriched length of cfDNA. Then, counting the relative abundance value of the motif of the front 4-mer of each read sequence, screening out that the relative abundance value of the motif-CCCA sequence is 1.56 percent and is 2.00 percent smaller than the average relative abundance value of the motif-CCCA sequence of a normal sample, finally, screening out sequence segments with the length of 120-180bp, processing the sequence segments by using a fast Fourier transform algorithm, comparing and analyzing the sequence segments with expression spectrum data of reference samples of a cell line and an original tissue to calculate the correlation, performing rank difference analysis, sequencing according to the rank difference from high to low, and finding out that the most correlated cell line is A549 (lung cancer correlated cell line), namely the rank difference is 23.
5. And (4) judging according to the result in the step (4), and finding that the result in the step (4) is that the 4 targeted methylated genes of the sample are methylated, the relative abundance value of the motif-CCCA is lower than that of the normal sample, the relative abundance value of the motif-CCCA is strongly correlated with the lung cancer cell line A549 and conforms to the fragment length distribution characteristic of the ctDNA, so that the sample is suggested to have the fragment characteristic of the ctDNA and is considered as a potential lung cancer sample.
6. Furthermore, step 2 may further include establishing a second nanopore sequencing library for capturing the targeted mutation hot spot, which is denoted as Lib-B, and in step 3, after sequencing Lib-B by using a nanopore sequencer such as QNOME-3841 or MinIOn and performing base recognition (basecalling) analysis by using a QNOME-3841 high-precision basecalling model and algorithm or an HAC model and algorithm of ONT, sequencing base Data (see table 2) is obtained, which is denoted as Data-B.
TABLE 2 sequencing data information for samples
Figure 678982DEST_PATH_IMAGE002
The specific implementation steps for establishing the second nanopore sequencing library for capturing the targeted mutation hot spot are as follows: using VAHTS on cfDNA ® The Universal DNA Library Prep Kit for Illumina V3 (Vazyme) constructs a pre-Library, and the specific operation is carried out according to the Kit instruction. The pre-library is subjected to a hybrid capture elution step of panel and a PCR enrichment process by a capture kit (the size of the panel is 18,452bp, and the information is shown in Table 3) of the lung cancer 11 gene targeted hot-spot panel to prepare a capture library, and the process can be briefly described as follows: 500-1000ng of a pre-library and 7.5 mul of cot-1 DNA are concentrated by 1.8XVAHTS DNA Clean Beads, and 17 mul of mix (2xhybrization buffer 9.5 mul, universal Blockers-ILMN-TS (Du) 2 mul, hybrization enhancer 3 mul, panel 4.5 mul) are used for elution (room temperature 5 min), and the mixture is placed in a PCR instrument at 95 ℃ for 30s; hybridization was carried out overnight at 65 ℃ under hold conditions. After rinsing 50 mul MSB with 150 mul 1xBeads Wash Buffer for 2 times, abandoning the supernatant, and resuspending beads in a new tube with 17 mul 1x hybridization Buffer (2xhibration Buffer 8.5 mul, hybrization enhancer 2.7 mul, NFW 5.8 mul). Preheating on a 65 ℃ PCR instrument, adding into the 17 mu l hybrid liquid tube, blowing, uniformly mixing, and incubating at 65 ℃ for 45min for capture. The captured product is sequentially washed by 100 mu l 1xWash buffer I for 1 time, 150 mu l 1xWash buffer S for 2 times (65 ℃), 150 mu l 1xWash buffer I for 1 time at room temperature, 150 mu l 1xWash buffer II for 1 time at room temperature, and 150 mu l 1xWash buffer III for 1 time at room temperature. After washing by Wash buffer III, the supernatant was discarded, and 20. Mu.l of NFW was used to resuspend the magnetic beads. Using the resuspended magnetic beads as template, using the kitThe DNA polymerase of (4) is subjected to PCR amplification. The reaction system is as follows: resuspended beads 20 μ l, 2xPCR ReadyMix 25 μ l, 10xPCR PrimerMix 5 μ l. The reaction conditions are as follows: 1min at 98 ℃; 15s at 98 ℃, 30s at 60 ℃, 30s at 72 ℃ and 23 cycles; 1min at 72 ℃; storing at 4 deg.C; PCR products were recovered by purification using 1.8XVAHTS DNA Clean Beads, and library quality control and quantification were performed. A Nanopore sequencing library was prepared using a commercial library construction kit QLK-V1.1.1 (Beijing Qiuch carbon Technologies, inc.) or SQK-LSK109 (Oxford Nanopore Technologies) from 300 fmolLib-B.
TABLE 3 Gene List and information for cancer-targeting hotspot detection panel
Figure 477043DEST_PATH_IMAGE003
7. According to the sample judged as the potential lung cancer in the step 5, carrying out detection analysis of the targeted mutation site on the Data-B, and finding that the T790M mutation of the EGFR gene exists; and performing medical interpretation on the T790M mutation of the EGFR gene according to NCCN guidelines, databases of FDA approved drug information and other clinical experimental results, for example, suggesting that the anti-tumor drug which can be referred to by the sample is Oxitinib (Thorasha) and the like; optionally, step 7 may further perform a length distribution analysis of the sequencing sequence length of the hot spot mutation on the Data-B, and the sequencing sequence length of the T790M mutation is 158bp, which coincides with the fourth peak in the length distribution map obtained in step 4.
8. The remaining samples were repeated through steps 1-7 and the results are shown in Table 4.
TABLE 4 statistical information of the analysis results of the samples of the examples
Figure 33926DEST_PATH_IMAGE004
Example 2
This example is an exemplary nanopore sequencing data analysis device. As shown in fig. 5, the nanopore sequencing data analysis device 100 includes a data acquisition module 110, a data storage module 120, a data processing module 130, and a data display module 140. The data acquisition module 110 acquires the sequencing data through communication from the internet or the cloud 210 or the nanopore sequencer 220. Stored in the data storage module 120, while the data storage module 120 also stores the reference signal segment set DSR. The data processing module 120 retrieves the data in the data acquisition module 110, and performs fragment feature analysis and result analysis of targeted site methylation detection based on the current signal Ion-A. The fragment characteristic analysis comprises base recognition analysis of Ion-A to obtain sequencing base Data-A. The analysis of the results of the targeted site methylation detection includes sliding cutting the Ion-a in a time direction by a prescribed step size to obtain a set DST composed of different current signal fragments, and performing similarity comparison analysis on each current signal fragment in the set DST and a reference signal fragment set DSR in the data storage module 120. And judging according to the compared similarity. The judgment result is transferred to the data display module 140, and the interpretation result obtained after the analysis by the data processing module is displayed.
While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. Many modifications and variations may be made to the exemplary embodiments of the present description without departing from the scope or spirit of the present invention. The scope of the claims is to be accorded the broadest interpretation so as to encompass all modifications and equivalent structures and functions.

Claims (16)

1. A method of nanopore sequencing data analysis, comprising:
acquiring current signal data of a biological sample obtained by nanopore sequencing, wherein the current signal data at least comprises a time sequence current signal Ion-A which comprises information of at least two dimensions of a transverse time dimension and a longitudinal signal intensity dimension;
carrying out base recognition analysis on the Ion-A to obtain sequencing Data-A, and analyzing segment characteristics based on the Data-A;
carrying out methylation detection on the target site based on the Ion-A to obtain methylation information of the target site; and
classifying the biological sample according to the fragment features and the methylation information.
2. The method of nanopore sequencing data analysis according to claim 1, wherein said fragment characteristics comprise at least one of length distribution characteristics, motif characteristics, and tissue characteristics.
3. The method of claim 2, wherein analyzing the length distribution features comprises screening sequences in the sequencing data to retain sequencing read sequence results that are uniquely aligned and not soft cut in a human reference genome, performing length statistics and mapping the length of the screened read sequences to obtain the length distribution features.
4. The method of nanopore sequencing data analysis according to claim 2, wherein said analysis of motif features comprises screening sequences in said sequencing data to retain sequencing read-length sequence results with unique alignment and non-soft-cut in a human reference genome, and counting the frequency or relative abundance of motifs of k-mers before each read-length sequence, wherein 4< = k < =10, resulting in motif features.
5. The method according to claim 2, wherein the analyzing of the tissue characteristics comprises screening the sequence in the sequencing data to retain the uniquely aligned and non-soft cut sequencing read sequence result in the human reference genome, screening out sequence fragments with a specified length range, performing comparative analysis and calculating correlation with the expression profile data of the reference sample of the cell line and the primary tissue, and performing tissue tracing analysis to obtain the tissue characteristics.
6. The method of claim 1, wherein the methylation detection comprises sliding the Ion-a in a time direction by a predetermined step size to obtain a set DST consisting of different current signal fragments, and performing similarity alignment analysis on each current signal fragment in the set DST with a reference signal fragment set DSR, wherein the reference signal fragment set DSR comprises a subset of methylated signal fragments and a subset of unmethylated signal fragments.
7. The method of claim 6, wherein the methylation detection further comprises methylation discrimination based on similarity of the alignments, comprising interpreting the targeted site as methylated if the number of results of each current signal segment in the set of DSTs aligned with the subset of methylated signal segments/the number of results of each current signal segment in the set of DSTs aligned with the subset of unmethylated signal segments >1, and interpreting the targeted site as unmethylated if the number of results of each current signal segment in the set of DSTs aligned with the subset of methylated signal segments/the number of results of each current signal segment in the set of DSTs aligned with the subset of unmethylated signal segments is < 1.
8. The method of claim 7, wherein constructing the set of reference signal fragments DSR comprises synthesizing a first sequence fragment comprising a methylated targeting site and a second sequence fragment comprising a non-methylated targeting site, nanopore sequencing a first reference signal fragment corresponding to the first sequence fragment and a second reference signal fragment corresponding to the second sequence fragment, wherein a subset of methylated signal fragments is formed from a plurality of the first reference signal fragments and a subset of non-methylated signal fragments is formed from a plurality of the second reference sequence fragments.
9. A nanopore sequencing data analysis device, comprising:
a. the data acquisition module is arranged to acquire current signal data for nanopore sequencing, and comprises a time sequence current signal Ion-A;
b. the Data processing module is configured to perform base recognition analysis on the Ion-A to obtain sequencing Data-A, perform fragment feature analysis based on the Data-A, and perform result analysis of targeted site methylation detection based on the Ion-A.
10. The nanopore sequencing data analysis device of claim 9, wherein said analysis of the results of targeted site methylation detection comprises sliding the Ion-A in a time direction by a prescribed step size to obtain a set DST of different current signal fragments, performing similarity alignment analysis on each current signal fragment in the set DST with a set of reference signal fragments DSR, wherein the set of reference signal fragments DSR comprises a subset of methylated signal fragments and a subset of unmethylated signal fragments,
the nanopore sequencing data analysis device further comprises:
c. a data storage module to store at least the set of reference signal segments DSR.
11. The nanopore sequencing data analysis device of claim 10, wherein said analysis of the results of methylation detection at the target site further comprises a methylation discrimination based on similarity of the alignments, which comprises interpreting the target site as methylated if the number of results of each current signal fragment in the set DST aligned with the subset of methylated signal fragments/the number of results of each current signal fragment in the set DST aligned with the subset of unmethylated signal fragments >1, and interpreting the target site as unmethylated if the number of results of each current signal fragment in the set DST aligned with the subset of methylated signal fragments/the number of results of each current signal fragment in the set DST aligned with the subset of unmethylated signal fragments < 1.
12. The nanopore sequencing Data analysis device of claim 9, wherein the Data acquisition module is further configured to acquire sequencing Data-B of a mutation hotspot, and the Data processing module is further configured to perform detection analysis of a targeted mutation site on Data-B.
13. The apparatus according to claim 12, wherein the analysis of Data-B for detection of the target mutation site comprises analysis of length distribution of the Data-B for the sequencing sequence of hot-spot mutation, and verification is performed based on the length distribution characteristics obtained by the fragment characteristic analysis.
14. A computer storage medium, in which a computer program is stored which, when executed by a computer, implements the method of any one of claims 1 to 8.
15. A method of obtaining genetic information from a biological sample, comprising the steps of sequencing DNA in the biological sample using nanopore technology, and analysing the sequencing data using the method according to any one of claims 1 to 8.
16. The method of claim 15, wherein the biological sample is at least one selected from the group consisting of blood, saliva, and urine.
CN202211621058.9A 2022-12-16 2022-12-16 Nanopore sequencing data analysis method and device, storage medium and application Active CN115620809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211621058.9A CN115620809B (en) 2022-12-16 2022-12-16 Nanopore sequencing data analysis method and device, storage medium and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211621058.9A CN115620809B (en) 2022-12-16 2022-12-16 Nanopore sequencing data analysis method and device, storage medium and application

Publications (2)

Publication Number Publication Date
CN115620809A true CN115620809A (en) 2023-01-17
CN115620809B CN115620809B (en) 2023-04-07

Family

ID=84880901

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211621058.9A Active CN115620809B (en) 2022-12-16 2022-12-16 Nanopore sequencing data analysis method and device, storage medium and application

Country Status (1)

Country Link
CN (1) CN115620809B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117935909A (en) * 2024-01-26 2024-04-26 哈尔滨工业大学 Third generation sequencing DNA methylation detection method based on fusion of electric signals and sequences

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130186758A1 (en) * 2011-12-09 2013-07-25 University Of Delaware Current-carrying nanowire having a nanopore for high-sensitivity detection and analysis of biomolecules
CN108885649A (en) * 2015-11-12 2018-11-23 塞缪尔·威廉姆斯 Rapid sequencing of short DNA fragments using nanopore technology
CN112309503A (en) * 2020-10-19 2021-02-02 深圳市儒翰基因科技有限公司 Base interpretation method, interpretation equipment and storage medium based on nanopore electric signal
US20220328135A1 (en) * 2021-04-12 2022-10-13 The Chinese University Of Hong Kong Base modification analysis using electrical signals
CN115198035A (en) * 2022-08-04 2022-10-18 北京元码医学检验实验室有限公司 Detection method for simultaneously obtaining virus integration transcript and RNA modification based on nanopore sequencing and application
CN115404275A (en) * 2022-08-17 2022-11-29 中山大学·深圳 Method for evaluating tumor purity based on nanopore sequencing technology

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130186758A1 (en) * 2011-12-09 2013-07-25 University Of Delaware Current-carrying nanowire having a nanopore for high-sensitivity detection and analysis of biomolecules
CN108885649A (en) * 2015-11-12 2018-11-23 塞缪尔·威廉姆斯 Rapid sequencing of short DNA fragments using nanopore technology
CN112309503A (en) * 2020-10-19 2021-02-02 深圳市儒翰基因科技有限公司 Base interpretation method, interpretation equipment and storage medium based on nanopore electric signal
US20220328135A1 (en) * 2021-04-12 2022-10-13 The Chinese University Of Hong Kong Base modification analysis using electrical signals
CN115198035A (en) * 2022-08-04 2022-10-18 北京元码医学检验实验室有限公司 Detection method for simultaneously obtaining virus integration transcript and RNA modification based on nanopore sequencing and application
CN115404275A (en) * 2022-08-17 2022-11-29 中山大学·深圳 Method for evaluating tumor purity based on nanopore sequencing technology

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
汪荣亮;谷德健;刘全俊;: "纳米孔传感技术应用于肿瘤早期诊断的研究进展" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117935909A (en) * 2024-01-26 2024-04-26 哈尔滨工业大学 Third generation sequencing DNA methylation detection method based on fusion of electric signals and sequences

Also Published As

Publication number Publication date
CN115620809B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
Logsdon et al. Long-read human genome sequencing and its applications
CN110800063B (en) Detection of tumor-associated variants using cell-free DNA fragment size
Bock Analysing and interpreting DNA methylation data
CN104254618B (en) The analysis based on size of foetal DNA fraction in Maternal plasma
ES2688458T3 (en) Varietal nucleic acid count to obtain information on the number of genomic copies
AU2023202572A1 (en) Single-molecule sequencing of plasma DNA
KR101858344B1 (en) Method of next generation sequencing using adapter comprising barcode sequence
CN107267613B (en) Sequencing data processing system and SMN gene detection system
He et al. Assessing the impact of data preprocessing on analyzing next generation sequencing data
CN104745679A (en) Method and kit for non-invasive detection of EGFR (epidermal growth factor receptor) gene mutation
CN113308540B (en) Thyroid nodule-related rDNA methylation marker and application thereof
CN108866192A (en) Tumor marker STAMP-EP1 based on methylation modification
CN111321209A (en) Method for double-end correction of circulating tumor DNA sequencing data
CN113889187B (en) Single-sample allele copy number variation detection method, probe set and kit
CN115620809B (en) Nanopore sequencing data analysis method and device, storage medium and application
CN105528532B (en) A kind of characteristic analysis method in rna editing site
CN107988385B (en) Method for detecting marker of PLAG1 gene Indel of beef cattle and special kit thereof
WO2023142625A1 (en) Methylation sequencing data filtering method and application
CN115948521B (en) Method for detecting aneuploidy deletion chromosome information
Stackpole Multi-feature ensemble learning on cell-free dna for accurately detecting and locating cancer
EP4131274A1 (en) Method for characterization of cancer
CN115011695A (en) Multiple cancer species identification marker based on free circular DNA gene, kit and application
CN113948150B (en) JMML related gene methylation level evaluation method, model and construction method
CN112980931A (en) Kit and method for detecting human music epigenotype
Green et al. Modern Diagnostic Methods in the 21st Century

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant