CN117965725A - Method, device and kit for distinguishing liver cancer from liver non-cancer disease samples - Google Patents
Method, device and kit for distinguishing liver cancer from liver non-cancer disease samples Download PDFInfo
- Publication number
- CN117965725A CN117965725A CN202311830003.3A CN202311830003A CN117965725A CN 117965725 A CN117965725 A CN 117965725A CN 202311830003 A CN202311830003 A CN 202311830003A CN 117965725 A CN117965725 A CN 117965725A
- Authority
- CN
- China
- Prior art keywords
- sample
- liver
- dmr
- repeat
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 201000007270 liver cancer Diseases 0.000 title claims abstract description 81
- 208000014018 liver neoplasm Diseases 0.000 title claims abstract description 81
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 42
- 210000004185 liver Anatomy 0.000 title claims abstract description 41
- 201000011510 cancer Diseases 0.000 title claims abstract description 34
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 title claims abstract description 34
- 238000000034 method Methods 0.000 title claims description 41
- 230000011987 methylation Effects 0.000 claims abstract description 81
- 238000007069 methylation reaction Methods 0.000 claims abstract description 81
- 239000000090 biomarker Substances 0.000 claims abstract description 44
- 239000000523 sample Substances 0.000 claims description 141
- 101000782147 Homo sapiens WD repeat-containing protein 20 Proteins 0.000 claims description 121
- 102100036561 WD repeat-containing protein 20 Human genes 0.000 claims description 121
- 230000001684 chronic effect Effects 0.000 claims description 30
- 208000002672 hepatitis B Diseases 0.000 claims description 28
- 208000019425 cirrhosis of liver Diseases 0.000 claims description 27
- 238000007481 next generation sequencing Methods 0.000 claims description 27
- 239000003153 chemical reaction reagent Substances 0.000 claims description 25
- 206010016654 Fibrosis Diseases 0.000 claims description 20
- HVYWMOMLDIMFJA-DPAQBDIFSA-N cholesterol Chemical compound C1C=C2C[C@@H](O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2 HVYWMOMLDIMFJA-DPAQBDIFSA-N 0.000 claims description 20
- 238000009396 hybridization Methods 0.000 claims description 16
- 201000010099 disease Diseases 0.000 claims description 14
- 206010027476 Metastases Diseases 0.000 claims description 12
- 230000009401 metastasis Effects 0.000 claims description 12
- 230000004044 response Effects 0.000 claims description 12
- 208000022309 Alcoholic Liver disease Diseases 0.000 claims description 11
- 208000008439 Biliary Liver Cirrhosis Diseases 0.000 claims description 11
- 208000033222 Biliary cirrhosis primary Diseases 0.000 claims description 11
- 206010011732 Cyst Diseases 0.000 claims description 11
- 208000018565 Hemochromatosis Diseases 0.000 claims description 11
- 206010023126 Jaundice Diseases 0.000 claims description 11
- 208000012654 Primary biliary cholangitis Diseases 0.000 claims description 11
- 208000006682 alpha 1-Antitrypsin Deficiency Diseases 0.000 claims description 11
- 208000031513 cyst Diseases 0.000 claims description 11
- 208000010706 fatty liver disease Diseases 0.000 claims description 11
- 230000004761 fibrosis Effects 0.000 claims description 11
- 208000008338 non-alcoholic fatty liver disease Diseases 0.000 claims description 11
- 208000010157 sclerosing cholangitis Diseases 0.000 claims description 11
- 235000012000 cholesterol Nutrition 0.000 claims description 10
- 230000007882 cirrhosis Effects 0.000 claims description 9
- 210000004369 blood Anatomy 0.000 claims description 7
- 239000008280 blood Substances 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 7
- 238000012937 correction Methods 0.000 claims description 4
- 238000002360 preparation method Methods 0.000 claims description 4
- 210000001519 tissue Anatomy 0.000 claims description 4
- 206010003445 Ascites Diseases 0.000 claims description 3
- 208000002151 Pleural effusion Diseases 0.000 claims description 3
- 206010036790 Productive cough Diseases 0.000 claims description 3
- 210000003567 ascitic fluid Anatomy 0.000 claims description 3
- 210000001175 cerebrospinal fluid Anatomy 0.000 claims description 3
- 210000003296 saliva Anatomy 0.000 claims description 3
- 210000003802 sputum Anatomy 0.000 claims description 3
- 208000024794 sputum Diseases 0.000 claims description 3
- 241000792859 Enema Species 0.000 claims description 2
- 239000007920 enema Substances 0.000 claims description 2
- 229940095399 enema Drugs 0.000 claims description 2
- 210000004072 lung Anatomy 0.000 claims description 2
- 238000001514 detection method Methods 0.000 abstract description 13
- 108091029430 CpG site Proteins 0.000 description 29
- 238000012163 sequencing technique Methods 0.000 description 27
- 230000035945 sensitivity Effects 0.000 description 18
- 238000012549 training Methods 0.000 description 17
- 108020004414 DNA Proteins 0.000 description 13
- 238000012360 testing method Methods 0.000 description 12
- 108090000623 proteins and genes Proteins 0.000 description 11
- 102000013529 alpha-Fetoproteins Human genes 0.000 description 8
- 108010026331 alpha-Fetoproteins Proteins 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 125000003729 nucleotide group Chemical group 0.000 description 7
- 108091033319 polynucleotide Proteins 0.000 description 6
- 102000040430 polynucleotide Human genes 0.000 description 6
- 239000002157 polynucleotide Substances 0.000 description 6
- 210000002966 serum Anatomy 0.000 description 6
- 238000012795 verification Methods 0.000 description 6
- 238000003556 assay Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 239000012634 fragment Substances 0.000 description 5
- 208000019423 liver disease Diseases 0.000 description 5
- 102000039446 nucleic acids Human genes 0.000 description 5
- 108020004707 nucleic acids Proteins 0.000 description 5
- 150000007523 nucleic acids Chemical class 0.000 description 5
- 238000012216 screening Methods 0.000 description 5
- LSNNMFCWUKXFEE-UHFFFAOYSA-M Bisulfite Chemical compound OS([O-])=O LSNNMFCWUKXFEE-UHFFFAOYSA-M 0.000 description 4
- 238000002790 cross-validation Methods 0.000 description 4
- 239000002773 nucleotide Substances 0.000 description 4
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 3
- 108091034117 Oligonucleotide Proteins 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 3
- 230000004069 differentiation Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000003384 imaging method Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000013139 quantization Methods 0.000 description 3
- 239000004065 semiconductor Substances 0.000 description 3
- 238000000638 solvent extraction Methods 0.000 description 3
- 239000000758 substrate Substances 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 206010008909 Chronic Hepatitis Diseases 0.000 description 2
- 108091029523 CpG island Proteins 0.000 description 2
- 230000007067 DNA methylation Effects 0.000 description 2
- 108010044467 Isoenzymes Proteins 0.000 description 2
- 238000012408 PCR amplification Methods 0.000 description 2
- 108091027967 Small hairpin RNA Proteins 0.000 description 2
- 108020004459 Small interfering RNA Proteins 0.000 description 2
- 108020004566 Transfer RNA Proteins 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 239000011324 bead Substances 0.000 description 2
- 238000002591 computed tomography Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 208000006454 hepatitis Diseases 0.000 description 2
- 206010073071 hepatocellular carcinoma Diseases 0.000 description 2
- 208000015181 infectious disease Diseases 0.000 description 2
- 150000002500 ions Chemical class 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 230000003211 malignant effect Effects 0.000 description 2
- 239000003550 marker Substances 0.000 description 2
- 108020004999 messenger RNA Proteins 0.000 description 2
- 238000012164 methylation sequencing Methods 0.000 description 2
- 108091070501 miRNA Proteins 0.000 description 2
- 239000011807 nanoball Substances 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 108020004418 ribosomal RNA Proteins 0.000 description 2
- 238000012502 risk assessment Methods 0.000 description 2
- 241000894007 species Species 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 239000000439 tumor marker Substances 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 102000040650 (ribonucleotides)n+m Human genes 0.000 description 1
- 102000002260 Alkaline Phosphatase Human genes 0.000 description 1
- 108020004774 Alkaline Phosphatase Proteins 0.000 description 1
- 241000972773 Aulopiformes Species 0.000 description 1
- 108090000994 Catalytic RNA Proteins 0.000 description 1
- 102000053642 Catalytic RNA Human genes 0.000 description 1
- 208000000419 Chronic Hepatitis B Diseases 0.000 description 1
- 108091026890 Coding region Proteins 0.000 description 1
- 108020004635 Complementary DNA Proteins 0.000 description 1
- 102000053602 DNA Human genes 0.000 description 1
- 208000001490 Dengue Diseases 0.000 description 1
- 206010012310 Dengue fever Diseases 0.000 description 1
- 108700024394 Exon Proteins 0.000 description 1
- 101710107035 Gamma-glutamyltranspeptidase Proteins 0.000 description 1
- 101710173228 Glutathione hydrolase proenzyme Proteins 0.000 description 1
- 206010019799 Hepatitis viral Diseases 0.000 description 1
- 208000035150 Hypercholesterolemia Diseases 0.000 description 1
- 108091092195 Intron Proteins 0.000 description 1
- 102000003855 L-lactate dehydrogenase Human genes 0.000 description 1
- 108700023483 L-lactate dehydrogenases Proteins 0.000 description 1
- 108020004711 Nucleic Acid Probes Proteins 0.000 description 1
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 102100027378 Prothrombin Human genes 0.000 description 1
- 108010094028 Prothrombin Proteins 0.000 description 1
- 108091028664 Ribonucleotide Proteins 0.000 description 1
- 108020004682 Single-Stranded DNA Proteins 0.000 description 1
- 206010041662 Splinter Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000002583 angiography Methods 0.000 description 1
- 238000000137 annealing Methods 0.000 description 1
- 239000002585 base Substances 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000036952 cancer formation Effects 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000010367 cloning Methods 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000001351 cycling effect Effects 0.000 description 1
- 238000012350 deep sequencing Methods 0.000 description 1
- 208000025729 dengue disease Diseases 0.000 description 1
- 239000005547 deoxyribonucleotide Substances 0.000 description 1
- 125000002637 deoxyribonucleotide group Chemical group 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 102000006640 gamma-Glutamyltransferase Human genes 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000002489 hematologic effect Effects 0.000 description 1
- 230000002440 hepatic effect Effects 0.000 description 1
- 231100000844 hepatocellular carcinoma Toxicity 0.000 description 1
- 235000019689 luncheon sausage Nutrition 0.000 description 1
- 238000002595 magnetic resonance imaging Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 239000002853 nucleic acid probe Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000013612 plasmid Substances 0.000 description 1
- 230000035935 pregnancy Effects 0.000 description 1
- 238000004393 prognosis Methods 0.000 description 1
- 230000002062 proliferating effect Effects 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 229940039716 prothrombin Drugs 0.000 description 1
- 230000002685 pulmonary effect Effects 0.000 description 1
- 238000012175 pyrosequencing Methods 0.000 description 1
- 238000003127 radioimmunoassay Methods 0.000 description 1
- 239000002336 ribonucleotide Substances 0.000 description 1
- 125000002652 ribonucleotide group Chemical group 0.000 description 1
- 108091092562 ribozyme Proteins 0.000 description 1
- 235000019515 salmon Nutrition 0.000 description 1
- 238000007841 sequencing by ligation Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000002604 ultrasonography Methods 0.000 description 1
- 201000001862 viral hepatitis Diseases 0.000 description 1
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
- C12Q1/6886—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/112—Disease subtyping, staging or classification
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/154—Methylation markers
Landscapes
- Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Organic Chemistry (AREA)
- Genetics & Genomics (AREA)
- Analytical Chemistry (AREA)
- Pathology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Zoology (AREA)
- Physics & Mathematics (AREA)
- Wood Science & Technology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Immunology (AREA)
- Biotechnology (AREA)
- Biochemistry (AREA)
- General Engineering & Computer Science (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- Microbiology (AREA)
- Epidemiology (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Primary Health Care (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Hospice & Palliative Care (AREA)
- Oncology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention provides a detection method, a detection device and a detection kit for distinguishing a liver cancer sample from a liver non-cancer disease sample, and particularly relates to a biomarker combination for distinguishing the liver cancer sample from the liver non-cancer disease sample, wherein the biomarker combination comprises any at least 10 different methylation regions DMR shown in table 1 and/or table 5, wherein a reference genome adopted by the DMR in table 1 and/or table 5 is GRCh37/hg19 human reference genome, and the samples to be detected can be classified with low cost and high accuracy.
Description
Technical Field
The invention relates to the field of biotechnology, in particular to a method, a device and a kit for distinguishing liver cancer samples from liver non-cancer disease samples.
Background
To date, intervention prior to distant metastasis provides the greatest opportunity to improve prognosis, and therefore it is highly desirable to develop sensitive, reliable and minimally invasive assays to detect cancer prior to the appearance of symptoms. Among many cancer species, liver cancer (hepatocellular carcinoma, HCC) is a serious disease that seriously jeopardizes human health, and is not only high in incidence but also hidden, fast in progress, high in recurrence rate and mortality, and is called "king in cancer". Most liver cancer patients who visit hospitals are middle or late, and if the natural course of the liver cancer patients is not treated, the liver cancer patients only need 3-6 months.
One very important cause in the development of liver cancer is chronic hepatitis b (chronic HBV) infection. Chronic HBV infection may lead to chronic hepatitis, cirrhosis, and even further development of liver cancer. Currently, the detection means of liver cancer mainly comprise two types of serum marker detection and imaging detection. However, both of these detection means are difficult to achieve accurate differentiation between liver cancer and benign liver disease.
Existing liver cancer serum marker assays include serum Alpha Fetoprotein (AFP) assays and hematological and other tumor marker assays. Among them, serum Alpha Fetoprotein (AFP) assay has relative specificity for diagnosing liver cancer. The radioimmunoassay can be used for detecting serum AFP not less than 400 μg/L, and eliminating pregnancy and active liver diseases, and can be used for diagnosing liver cancer, however, chronic hepatitis or liver cirrhosis can also produce high alpha fetoprotein level. Meanwhile, about 30% of liver cancer patients clinically have negative AFP, so that the specificity of the AFP test adopted alone is not high, and the liver cancer and other liver non-cancer diseases are difficult to distinguish. Blood enzymology and other tumor marker tests are performed by the principle that gamma-glutamyl transpeptidase and its isozyme, abnormal prothrombin, alkaline phosphatase and lactate dehydrogenase isozyme in serum of liver cancer patients can be higher than normal. But also lack specificity.
Imaging examinations typically include ultrasound examinations, computed Tomography (CT) examinations, magnetic Resonance Imaging (MRI) examinations, selective celiac or hepatic angiography examinations, and liver puncture needle aspiration cytology examinations, but imaging examinations are difficult to distinguish between liver cancer and benign liver disease in more complex cases, and diagnosis of liver cancer also needs to be performed after a tumor has formed and reached a certain size, failing to achieve the goals of early cancer examinations or early screening.
Currently, DNA methylation sequencing is increasingly known as a high resolution, high throughput technique that is useful in cancer screening, diagnosis, and monitoring. Most regions of the human genome are not active during the development of cancer, and cancer-related variations tend to concentrate in certain specific regions, such as CpG islands (CPG ISLAND), which provides a good opportunity for targeted sequencing. Although there are a large number of scientific articles reporting biomarkers based on DNA methylation and their clinical links in cancer and various non-cancerous diseases, only a few tens of biomarkers have been converted into commercial clinical test products, and related products for liver cancer are more scarce. Therefore, it is urgent to develop a kit and a corresponding detection method for distinguishing liver cancer from other liver non-cancer diseases.
Disclosure of Invention
The invention provides a method, a device and a kit for distinguishing liver cancer samples from liver non-cancer disease samples, which adopt DNA or RNA oligonucleotide sequences to capture methylation variation regions of malignant or benign liver diseases and judge the existence of tumor components (ctDNA) in samples to be detected, thereby providing a low-cost and high-precision method for distinguishing the samples to be detected into liver cancer samples or liver non-cancer disease samples.
In one aspect, the invention provides a biomarker combination for distinguishing liver cancer samples from liver non-cancer disease samples, wherein the biomarker combination comprises any of at least 10 different methylation regions DMR shown in table 1 and/or table 5, wherein the reference genome employed by the DMR in table 1 and/or table 5 is a GRCh37/hg19 human reference genome.
In another aspect, the invention provides a kit comprising reagents for detecting a biomarker combination as described above.
In another aspect, the invention provides the use of a reagent for detecting the above biomarker combination in the preparation of a kit for distinguishing liver cancer samples from liver non-cancer disease samples.
In another aspect, the invention provides a method of classifying methylation data, comprising: acquiring first methylation data corresponding to the biomarker combination of claim 1 in a sample to be tested; correcting the first methylation data according to the confusion factor corresponding to the sample to be detected to obtain second methylation data; classifying the second methylation data and a classification threshold value based on a preset rule, and generating indication information for indicating the classification to which the sample to be detected belongs, wherein the preset rule comprises comparing the value of the second methylation data with the classification threshold value, and generating the indication information according to a comparison result; preferably, the first indicating information for indicating the classification to which the sample to be tested belongs is generated in response to the value of the second methylation data being smaller than or equal to the classification threshold, and the second indicating information for indicating the classification to which the sample to be tested belongs is generated in response to the value of the second methylation data being larger than the classification threshold.
In another aspect, the present invention provides a methylation data sorting apparatus comprising: an acquisition unit configured to acquire first methylation data corresponding to the biomarker combination of claim 1 in a sample to be tested; the correction unit is configured to correct the first methylation data according to the confusion factor corresponding to the sample to be detected to obtain second methylation data; the classifying unit is configured to classify the second methylation data and the classifying threshold value based on a preset rule and generate indicating information for indicating the classification to which the sample to be detected belongs, wherein the preset rule comprises comparing the value of the second methylation data with the classifying threshold value and generating the indicating information according to a comparison result; preferably, the first indicating information for indicating the classification to which the sample to be tested belongs is generated in response to the value of the second methylation data being smaller than or equal to the classification threshold, and the second indicating information for indicating the classification to which the sample to be tested belongs is generated in response to the value of the second methylation data being larger than the classification threshold.
In another aspect, the present invention provides an electronic device, including: one or more processors; and a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method.
In another aspect, the present invention provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method described above.
The biomarker combination, the kit, the method, the application, the device, the electronic equipment and the storage medium can be suitable for risk assessment of cancers, and have the advantages of low cost and high accuracy.
Specifically, the liver non-cancerous disease includes one or more of the following: cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.
The biomarker combination, the kit, the method, the application, the device, the electronic equipment and the storage medium provided by the invention adopt DNA or RNA oligonucleotide sequences to capture methylation variation regions of malignant or benign liver diseases, judge the existence of tumor components (ctDNA) in a sample to be tested, can be suitable for risk assessment of cancers, are used for distinguishing liver cancer from other liver non-cancer diseases, classify methylation data and generate corresponding indication information, fill the blank of related technologies in the field, and have the advantages of high accessibility in clinical application, convenient implementation, low cost and high accuracy.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention. In the drawings:
Fig. 1 shows an exemplary case where CpG sites cannot be classified into the same DMR.
Fig. 2 shows an exemplary case where CpG sites are partitioned into the same DMR.
Fig. 3 illustrates an exemplary case for explaining the principle of judging whether the DMR is valid or not in the present invention.
Fig. 4 shows the control results of the weight configuration of the confounding variables in the DOC model of the present application.
Fig. 5 shows that the DOC model established by the present invention remains balanced across the age groups.
Detailed Description
I. definition of the definition
In the present invention, unless otherwise indicated, scientific and technical terms used herein have the meanings commonly understood by one of ordinary skill in the art. Also, protein and nucleic acid chemistry, molecular biology, cell and tissue culture, microbiology, immunology-related terms and laboratory procedures as used herein are terms and conventional procedures that are widely used in the corresponding arts. Meanwhile, in order to better understand the present invention, definitions and explanations of related terms are provided below.
As used herein, the term "differential methylation region" (DIFFERENTIALLY METHYLATED region, DMR) generally refers to a region of DNA that contains one or more differential methylation sites. For example, a DMR that includes a greater number or frequency of methylation sites under selected conditions of interest, such as a cancer state, may be referred to as a hypermethylated DMR. For example, a DMR that includes a lesser number or frequency of methylation sites under selected conditions of interest, such as a cancer state, may be referred to as a hypomethylated DMR.
As used herein, the term "methylation" generally refers to the methylation state of a gene fragment, nucleotide, or base thereof of the present application. For example, a DNA fragment in which a gene of the application is located may have methylation on one or more strands. For example, a DNA fragment in which a gene of the application resides may have methylation at one site or DMR or at multiple sites or DMR.
As used herein, the term "next generation sequencing" (Next Generation Sequencing, NGS) refers to any sequencing method that determines the nucleotide sequence of an individual nucleic acid molecule (e.g., in single molecule sequencing) or of a surrogate of an individual nucleic acid molecule that is clonally amplified in a high-throughput mode (e.g., sequencing more than 10 3、104、105 molecules or more simultaneously). The next generation sequencing platform includes, but is not limited to, existing Illumina et al sequencing platforms. With the continued development of sequencing technology, one skilled in the art will appreciate that other methods of sequencing methods and devices may also be employed for the present method. The next generation sequencing methods include, for example, sequencing by synthesis technology (Illumina), large-scale parallel signature sequencing (MASSIVELY PARALLEL Signature Sequencing, MPSS), polymerase cloning (Polony Sequencing), pyrosequencing (454), ion semiconductor technology (ion-shock sequencing) (Ion semi conductor sequencing), DNA nanoball sequencing (DNA nano-ball sequencing), DNA nanoarray-and-combinatorial probe anchored ligation sequencing of Complete Genomics, single molecule real-time sequencing (Pacific Biosciences), and sequencing by ligation (SOLiD sequencing), and the like. The next generation sequencing described above may enable detailed analysis of the transcriptome and genome of a species, and is therefore also referred to as deep sequencing. For example, the methods of the invention are equally applicable to first generation gene sequencing, second generation gene sequencing, third generation gene sequencing, or Single Molecule Sequencing (SMS).
As used herein, the term "human reference genome" generally refers to a human genome that can perform a reference function in gene sequencing. The above information of the human reference genome may refer to UCSC. The human reference genome may be in different versions, for example, hg19, hg38, GRCh37, GRCh38, gca_000001405, gcf_000001405, or Ensembl75.
As used herein, the terms "polynucleotide," "nucleotide," "nucleic acid," and "oligonucleotide" are used interchangeably. They represent polymeric forms of nucleotides (deoxyribonucleotides or ribonucleotides) of any length, or analogues thereof. Polynucleotides may have any steric structure and may perform any function, whether known or unknown. The following are non-limiting examples of polynucleotides: coding or non-coding regions of a gene or gene fragment, loci (loci), exons, introns, messenger RNAs (mRNA), transfer RNAs (tRNA), ribosomal RNAs (rRNA), short interfering RNAs (siRNA), short-hairpin RNAs (shRNA), micrornas (miRNA), ribozymes, cdnas, recombinant polynucleotides, branched polynucleotides, plasmids, vectors, isolated DNA of any sequence, isolated RNAs of any sequence, nucleic acid probes, primers and adaptors defined according to linkage analysis. Polynucleotides may include one or more modified nucleotides, such as methylated nucleotides and nucleotide analogs.
As used herein, the term "sample to be tested" generally refers to a sample that is to be tested. For example, the presence or absence of a modification in one or more gene regions on a test sample can be detected. In embodiments of the present invention, the sample to be tested includes, but is not limited to, a tissue sample, a blood sample, saliva, sputum, pleural effusion, pulmonary lavage, peritoneal effusion, peritoneal lavage, and cerebrospinal fluid.
As used herein, the term "about index", also known as the correct index, is a method of evaluating the authenticity of a screening test, which can be applied given the equivalent meaning of the hazard of false negatives (missed diagnosis rates) and false positives (false misdiagnosis rates). The about log index is the sum of sensitivity and specificity minus 1. Indicating the total ability of the screening method to find true patients and non-patients. The larger the index, the better the effect of the screening experiment, and the greater the authenticity. The term "about log index optimum" is the case where the sum of sensitivity and specificity minus 1 is the largest.
As used herein, the term "non-cancerous disease" (noncancer disease) refers to a disease of the body other than cancer, including benign proliferative conditions of tumors.
Detailed description of the preferred embodiments
In one aspect, the present invention provides a biomarker combination for distinguishing liver cancer samples from liver non-cancer disease samples, wherein the biomarker combination comprises any of at least 10 differential methylation regions DMR shown in table 1 and/or table 5, wherein the reference genome employed by the DMR in table 1 and/or table 5 is a GRCh37/hg19 human reference genome.
In some preferred embodiments, the biomarker combinations comprise any at least 10 DMR shown in table 1, and/or any at least 10 DMR shown in table 5.
In some preferred embodiments, the biomarker combinations comprise all 195 DMRs shown in table 1, and/or all 230 DMRs shown in table 5.
In some preferred embodiments, the biomarker combinations described above comprise any of the at least 10 DMR shown in table 1.
In some alternative embodiments, the biomarker combinations comprise 10 DMR selected from any of the group of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 3.
In some preferred embodiments, the biomarker combinations described above comprise any of the at least 10 DMR shown in table 5.
In some alternative embodiments, the biomarker combinations comprise 10 DMR selected from any of the group of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 7.
In some embodiments, the liver non-cancerous disease comprises: cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.
In some embodiments, the present invention provides a biomarker combination for distinguishing a liver cancer sample from a chronic viral hepatitis b sample, the biomarker combination comprising any of at least 10 DMR as shown in table 1.
In some preferred embodiments, the biomarker combinations provided by the present invention are used to distinguish liver cancer samples from chronic viral hepatitis b samples, the biomarker combinations comprising all 195 DMRs shown in table 1.
In some embodiments, the present invention provides a biomarker combination for distinguishing a liver cancer sample from a liver cirrhosis sample, the biomarker combination comprising any of at least 10 DMR as shown in table 5.
In some preferred embodiments, the biomarker combinations provided by the present invention are used to distinguish liver cancer samples from liver cirrhosis samples, the biomarker combinations comprising all 230 DMR as shown in table 5.
In another aspect, the invention provides a kit, wherein the kit comprises reagents for detecting the biomarker combination.
In some embodiments, the above-described kits comprise next-generation sequencing reagents.
In some preferred embodiments, the next generation sequencing reagents described above include hybridization capture probes or primers covering any at least 10 DMR in table 1 and/or table 5.
In some preferred embodiments, the next generation sequencing reagents described above include covering any at least 10 DMR shown in table 1, and/or any at least 10 DMR shown in table 5.
In some preferred embodiments, the next generation sequencing reagents described above include hybridization capture probes or primers that cover all 195 DMRs shown in table 1, and/or hybridization capture probes or primers for all 230 DMRs shown in table 5.
In some preferred embodiments, the next generation sequencing reagents described above include hybridization capture probes or primers covering any of at least 10 DMR shown in table 1.
In some alternative embodiments, the next generation sequencing reagents described above comprise hybridization capture probes or primers covering 10 DMR selected from any one of the sets of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 3.
In some preferred embodiments, the next generation sequencing reagents described above include hybridization capture probes or primers covering any of at least 10 DMR shown in table 5.
In some alternative embodiments, the next generation sequencing reagents described above comprise hybridization capture probes or primers covering 10 DMR selected from any one of the sets of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 7.
In some embodiments, the above-described kit is used to distinguish liver cancer samples from liver non-cancer disease samples.
In some preferred embodiments, the liver non-cancerous disease described above comprises: cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.
In some embodiments, the kit provided by the invention is used for distinguishing liver cancer samples from chronic viral hepatitis b samples, and the next-generation sequencing reagent included in the kit comprises at least 10 DMR covering any of those shown in table 1.
In some preferred embodiments, the kit provided by the invention is used for distinguishing liver cancer samples from chronic viral hepatitis b samples, and the next generation sequencing reagent included in the kit comprises all 195 DMRs as shown in table 1.
In some embodiments, the invention provides a kit for distinguishing liver cancer samples from liver cirrhosis samples, the next generation sequencing reagents included in the kit comprising a dna sequence covering any of at least 10 DMR shown in table 5.
In some preferred embodiments, the invention provides a kit for distinguishing liver cancer samples from liver cirrhosis samples, the next generation sequencing reagents included in the kit comprising all 230 DMR's covering those shown in table 5.
In another aspect, the present disclosure provides a method for detecting a biomarker combination as described above, comprising administering to a subject in need thereof a kit for distinguishing a liver cancer sample from a liver non-cancer disease sample.
In some preferred embodiments, the liver non-cancerous disease described above comprises: cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.
In another aspect, the present disclosure provides a method of classifying methylation data, comprising: acquiring first methylation data corresponding to the biomarker combination according to claim 1 in a sample to be tested; correcting the first methylation data according to the confusion factor corresponding to the sample to be detected to obtain second methylation data; classifying based on a preset rule according to the second methylation data and a classification threshold value, and generating indication information for indicating the classification to which the sample to be tested belongs, wherein the preset rule comprises comparing the value of the second methylation data with the classification threshold value, and generating the indication information according to a comparison result; preferably, the first indicating information for indicating the classification to which the sample to be measured belongs is generated in response to the value of the second methylation data being less than or equal to the classification threshold, and the second indicating information for indicating the classification to which the sample to be measured belongs is generated in response to the value of the second methylation data being greater than the classification threshold.
In some preferred embodiments, the classification of the sample to be tested includes: liver cancer, liver cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.
In some embodiments, the first indication information used for representing the classification to which the sample to be measured belongs in the method provided by the present disclosure may be a prompt information indicating that the sample to be measured belongs to a liver cancer sample, and the second indication information may be a prompt information indicating that the sample to be measured belongs to one or more liver non-cancer disease samples.
In some embodiments, for example, the method provided by the present disclosure may classify whether the sample to be tested belongs to a liver cancer sample or a chronic viral hepatitis b sample, and in this application scenario, the first indication information used for characterizing the classification of the sample to be tested may be a prompt information indicating that the sample to be tested belongs to a liver cancer sample, and the second indication information may be a prompt information indicating that the sample to be tested belongs to a chronic viral hepatitis b sample.
In some embodiments, for example, the method provided by the present disclosure may classify whether the sample to be measured belongs to a liver cancer sample or a liver cirrhosis sample, and in this application scenario, the first indication information used for characterizing the classification to which the sample to be measured belongs may be a prompt information indicating that the sample to be measured belongs to a liver cancer sample, and the second indication information may be a prompt information indicating that the sample to be measured belongs to a liver cirrhosis sample.
In some preferred embodiments, the sample to be tested is selected from any one or more of the following: tissue samples, blood samples, saliva, sputum, pleural effusion, lung lavage, peritoneal effusion, peritoneal lavage, enema, and cerebrospinal fluid.
In another aspect, the present disclosure provides a methylation data sorting apparatus comprising: an acquisition unit configured to acquire first methylation data corresponding to the biomarker combination according to claim 1 in a sample to be tested; the correcting unit is configured to correct the first methylation data according to the confusion factor corresponding to the sample to be detected to obtain second methylation data; the classifying unit is configured to classify the second methylation data and the classifying threshold value based on a preset rule and generate indicating information for indicating the classification to which the sample to be detected belongs, wherein the preset rule comprises comparing the value of the second methylation data with the classifying threshold value and generating the indicating information according to a comparison result; preferably, the first indicating information for indicating the classification to which the sample to be measured belongs is generated in response to the value of the second methylation data being less than or equal to the classification threshold, and the second indicating information for indicating the classification to which the sample to be measured belongs is generated in response to the value of the second methylation data being greater than the classification threshold.
In some preferred embodiments, the classification of the sample to be tested includes: liver cancer, liver cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.
In another aspect, the present disclosure provides an electronic device, comprising: one or more processors; and a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method.
The implementation environment of the invention comprises electronic equipment, and the method for distinguishing the liver cancer sample from the liver non-cancer disease sample in the embodiment of the invention can be executed by the terminal equipment. By way of example, the electronic device may comprise at least one of a terminal device or a server.
The terminal device may be hardware or software. When the terminal device is hardware, it may be a variety of electronic devices having a display screen and supporting information input (e.g., text input and/or voice input, etc.), including but not limited to smart phones, tablet computers, laptop and desktop computers, and the like. When the terminal device is software, it can be installed in the above-listed terminal device. It may be implemented as a plurality of software or software modules (e.g., to provide a service for distinguishing liver cancer samples from liver non-cancerous disease samples), or as a single software or software module. The present invention is not particularly limited herein.
In another aspect, the present invention provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method described above.
The computer readable medium of the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the method of assessing the correlation of a sample under test with risk of cancer formation shown in the above-described embodiments and alternative embodiments thereof.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The above description is only illustrative of the preferred embodiments of the present invention and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present invention is not limited to the specific combinations of technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present invention (but not limited to) having similar functions are replaced with each other.
Examples
Example 1: division of DMR regions
1. Hypothesis testing
Obtaining a sample to be tested (for example, a blood sample), wherein the sample to be tested is divided into a liver cancer group (C group) and a normal group (N group), and the bisulfite methylation sequencing of the sample to be tested can comprise the following steps:
S1: cell-free DNA (cfDNA) extraction: for example, the QiaAmp cycle nucleic acid kit (Qiagen, 55114) and its corresponding platform can be used;
S2: bisulfite conversion: for example, the bisulfite conversion step (Bisulfite Conversion, BC) is performed using a modified protocol according to EZ-96DNA methylation-LIGHTNINGTM MAGPREP (Zymo, D5047);
S3: pre-library preparation: comprises a first tailing and connecting step, wherein a plurality of G or A synthesized randomly by a split (splinter) joint can be used, the 3' -end poly-C/T tail of a single-stranded DNA substrate is annealed, and the connection is completed after hybridization with the first tail through a cantilever of the joint; annealing the DNA substrate with the adaptor added at one end into a single strand, performing 5-15 rounds of linear amplification, performing a second tailing and connecting step by adopting a similar step to the first tailing and connecting step, connecting the second adaptor to the A tail at the other end of the DNA substrate, and performing a plurality of rounds of PCR amplification to complete the preparation of a pre-library (for example, see Chinese patent publication CN 110892097A);
s4: pre-library hybridization: hybridizing a pre-library with a hybridization capture probe covering the target DMR region;
S5: capturing and eluting: the non-specific fragments are eluted through the combination of the magnetic beads and the probes, the magnetic beads are removed, and the final library is formed through PCR amplification;
s6: sequencing: and sequencing the final library by an NGS sequencer to generate sequencing data containing the target DMR region.
In this embodiment, the step of noise reduction treatment for genomic methylation signal CpG and noise region CHH/CHG sites may be optionally included, for example, see Chinese patent publication CN114974417A.
Based on each CpG site, carrying out hypothesis test on whether the difference between the C group and the N group has statistical significance, respectively calculating the P value of each CpG site in the C group and the N group, wherein the calculation process adopts weighted logistic regression (WEIGHTED LR, weighted Logistic Regression), determines the given weight according to the coverage depth of each CpG site, takes the methylation level of each CpG site as an explanatory variable, and outputs a binary result of (0, 1) to correspond to C and N.
Partitioning of DMR
Calculating according to the following formula, taking the methylation level and sequencing coverage depth of each methylation CpG site as parameters, evaluating the similarity of the methylation level of the genome space continuous sites, wherein the deeper the coverage depth is, the larger the value of the parameter P in the following formula is, the higher the similarity of the methylation level between adjacent CpG sites in the same group (liver cancer group or normal group) is, and further dividing the DMR:
The subscript ij of each parameter represents the j-th site of the i-th sample, the parameter d is used for representing the effective coverage depth of the CpG sites in the liver cancer group, and the parameter M is used for representing the methylation level of the CpG sites in the liver cancer group.
Taking a beta value as a judging index after calculation, taking beta=0.25 as a preset threshold value, substituting the j and (j+1) th sites into a calculation area statistic B (B value is used for representing whether the DMR obtained by division is a valid DMR) when the beta is smaller than the preset threshold value, and possibly dividing into one DMR; when β is greater than or equal to the preset threshold, the jth and (j+1) th sites cannot be substituted into the calculated region statistic B and are not divided into one DMR.
In this embodiment, an exemplary case (as shown in fig. 1) that the DMR cannot be divided into the same DMR is given to explain the principle of dividing the DMR in the present invention.
Wherein the colored dots characterize a methylated CpG site, sample A, sample B, and sample C are from the same sample group (e.g., tumor group or normal group as described above), wherein sample A and sample B each obtain coverage of 500 effective sequences, and sample C obtains coverage of 200 effective sequences. The dots of each column correspond to the same CpG site, with the methylation level of the first CpG site in the region being 0.2 and the methylation level of the second CpG site being 0 in sample A.
The coverage depth parameter value P for the first CpG site within the region was calculated to be 0.617 for sample a, sample B and sample C above. At this time, by substituting the above parameters into the above formula, β 11 can be calculated to be 0.29, and based on the preset threshold value of 0.25, the methylation level difference between the first CpG site and the second CpG site in the region is greater than 0.25, so that the two adjacent CpG sites are not classified into the same DMR.
Another exemplary case of dividing into the same DMR is given in this embodiment (as shown in fig. 2) to explain the principle of dividing the DMR in the present invention.
Wherein the colored dots characterize a methylated CpG site, sample A, sample B and sample D are from the same sample group (e.g., tumor group or normal group) and wherein sample A and sample B each obtain coverage of 500 effective sequences and sample D obtains coverage of 400 effective sequences (the coverage depth of sample D is increased compared to sample C in the previous example, and thus the P value in the present example is also increased accordingly). Also, in sample a, the methylation level of the first CpG site in this region is 0.2 and the methylation level of the second CpG site is 0.
The coverage depth parameter value P for the first CpG site within the region was calculated to be 0.962 for sample a, sample B and sample D above. At this time, the above parameters are substituted into the above formula, and β 11 is calculated to be 0.21, and based on the preset threshold value of 0.25, the methylation level difference between the first CpG site and the second CpG site in the region is less than 0.25, so that the two adjacent CpG sites are marked into the same DMR.
The above method can be seen in chinese patent publication CN115132273a.
Therefore, the coverage depth of CpG sites is introduced in the DMR division process by the method, so that the accuracy of DMR region division can be remarkably improved.
3. Calculation of region statistics B value
In some optional embodiments, based on the above calculated β value, a region statistic B value of CpG sites in the region is further calculated according to the following formula to represent whether the DMR obtained by the division is a valid DMR.
The calculation formula of the value B is as follows:
Wherein, the parameter k is the number of CpG sites in the region, and the subscript ij of each parameter represents the j site of the i sample. Taking beta=0.25 as a preset threshold value, when beta is smaller than the preset threshold value, the j-th and (j+1) -th sites can be substituted into the calculated area statistic B, and the calculation of the area statistic B is possible to be divided into one DMR; when β is greater than or equal to the preset threshold, the jth and (j+1) th sites cannot be substituted into the calculated region statistic B and are not divided into one DMR. Taking b=1 as a preset threshold, and when the B value is smaller than the preset threshold, DMR corresponding to the jth and (j+1) th positions can be used as effective DMR; when the B value is greater than or equal to the preset threshold, DMR corresponding to the jth and (j+1) th positions is not used as an effective DMR.
An exemplary case (as shown in fig. 3) is given in this embodiment to explain the principle of judging whether the DMR is effective in the present invention.
When the DMRs divided by the groups a, B and C respectively contain 10 CpG sites, B ij of all samples are combined together when calculating the B value corresponding to each DMR, and the average value is calculated as the score of each DMR.
Wherein the calculation steps of the B value in the DMR shown in the group A are shown in the following table:
b-value division of DMR corresponding to group A Less than a preset threshold of 1, and therefore, the DMR may be an effective DMR.
Similarly, the B value score for DMR shown in group B isCan be used as an effective DMR; b value score in DMR shown in sample C isTherefore, the DMR corresponding to sample C cannot be valid.
Example 2: cancer detection (Detection of Cancer, DOC) model building
The invention quantifies bias caused by confounding variables for confounding variables (confounding variable) that may affect the accuracy of the classification model, thereby increasing the accuracy and generalizable capability of the DOC model. In the application scenario of the present invention, because ctDNA content in blood of a patient is greatly different in different development stages of liver cancer, the ctDNA content is easily affected by experimental batch effect, and methylation is related to age of a sample source to be tested, race and whether other diseases are suffered, the above conditions may all constitute confounding variables in the present embodiment.
The parameters involved in the formulas shown in this embodiment are defined in accordance with the definitions known in the art, except for the parameters specifically defined and explained.
In order to quantify bias caused by confusion variables, the invention adopts a Salmon model construction method, and an exemplary quantization mode in the embodiment can adopt Hilbert-Schmidt independence Criterion (HSIC). For the model after biased quantization, regularization term (regularization) is embedded for correction.
For quantization using the hilbert-schmitt independence criterion, the following formula is shown:
||Ch(y)h(z)||2=(Eh(x)h(z)-Eh(x)Eh(z))2=(Eh(x)h(z))2+(Eh(x)Eh(z))2-2Eh(x)h(z)Eh(x)Eh(z)
wherein L H (Hilbert-Schmitt independent coefficient, hilbert-SCHMIDT INDEPENDENCE criterion) calculated by the formula is used for representing the independent degree of variables X and Z, and in the invention, a feature vector X (X 1,…,xm),xi is an n-dimensional vector and represents methylation characteristics of a sample i, a classification label Y (Y 1,…,ym),yi is a classification label of X i, Y i epsilon-1, +1, positive when Y i is +1 and negative when Y i is-1) is set, and a confusion variable Z (Z 1,…,zm),zi is a confusion variable of the sample i and m represents the number of samples).
A support vector machine (SVM, support vector machine) is adopted as a main classifier to carry out two classification, and simultaneously, in order to control confusion variables, regularization terms are added into a target equation solved by the SVM, wherein the target equation is that
s.t.yi(wTx+b)≥1-ξ
ξi≥0
Where ζ i here refers to the degree to which the sample x i violates the equation, C and λ are the coefficients that minimize training errors with control, minimize the correlation of confounding variables with interpreted variables, and maximize the balance of classification intervals.
In this embodiment, fig. 4 shows the control result of the weight configuration of the DOC model of the present application for the confounding variables.
Wherein each data point represents a blood sample for DOC model construction, the horizontal axis represents confounding variables of the corresponding sample, and the vertical axis represents original uncorrected interpretation variables (left graph) and corrected interpretation variables (right graph), respectively. Comparing the correction before and after, the weight of the confusion variable is controlled in the DOC model established by the invention.
In this example, fig. 5 shows that the DOC model established in the present invention overcomes the weakness of increasing the past methylation false positive with age in healthy groups, and maintains balance in each age group (the horizontal axis represents age, and the vertical axis represents model liver cancer probability score).
Example 3: detection of chronic viral hepatitis B based on DMR by DOC model
Based on the differentiation of liver cancer patients and chronic viral hepatitis B patients in different DMRs, 30 chronic viral hepatitis B samples and 82 liver cancer samples are randomly split into a training set (comprising 21 chronic viral hepatitis B samples and 57 liver cancer samples) and a verification set (comprising 9 chronic viral hepatitis B samples and 25 liver cancer samples) according to a ratio of 7:3. 195 DMRs (shown in table 1) with obvious methylation level differences are screened out by using a training set sample and used for constructing a DOC model and determining a threshold value, and the distinguishing performance of the model and the threshold value is further confirmed by using a verification set sample.
TABLE 1 195 DMRs screened against chronic viral hepatitis B according to the present invention
And carrying out ten-fold cross validation on 21 chronic viral hepatitis B samples and 57 liver cancer samples of the training set, and taking the average value of the threshold values corresponding to the optimal condition of the index of the cross reduction as a dividing threshold value for yin-yang division of the training set and the test set samples. The method comprises the following steps: the overall sensitivity of the training set was 96.5% (55/57), overall specificity was 90.5% (19/21), and AUC was 0.991; the overall sensitivity of the validation set was 88.0% (22/25), the sensitivity of each stage of a particular liver cancer is shown in Table 2, the overall specificity was 88.9% (8/9), and the AUC was 0.915. In addition, the sensitivity of each stage of a specific liver cancer is shown in Table 2.
The specific steps of ten-fold cross validation for the training set are as follows:
1. The chronic viral hepatitis B samples in the training set are randomly split into 10 parts, and similarly, the liver cancer samples are also randomly split into 10 parts;
2. establishing a corresponding DOC model by using a 9/10 chronic viral hepatitis B sample and a 9/10 liver cancer sample;
3. Predicting the residual 1/10 chronic viral hepatitis B sample and 1/10 liver cancer sample by using the DOC model, and obtaining an optimal threshold of the fold through a about dengue index optimal principle;
4. sequentially cycling until all samples are traversed, and obtaining 10 optimal thresholds;
5. calculating the average value of the 10 optimal thresholds as the threshold of the DOC model, namely DOC model threshold= -0.03 in the embodiment;
6. And judging the yin and yang of the test set sample by using the DOC model and the corresponding threshold value, namely, judging that the test set sample is negative when the DOC model is smaller than the threshold value and judging that the test set sample is positive when the DOC model is larger than the threshold value.
TABLE 2 sensitivity of liver cancer stages
In practice, due to cost and efficiency constraints, DOC models can also be built using a smaller number of DMRs to achieve partitioning, not limited to all 195 DMRs in table 1 above. Of the five replicates, 10 of 195 DMRs were randomly adopted each time for constructing DOC models and corresponding thresholds, and 10 DMRs were randomly adopted each time as shown in table 3:
TABLE 3 five randomly selected DMR combinations for chronic viral hepatitis B
The sensitivity, specificity results of five replicates with training set specificity controlled at the same level (85.7%) are shown in table 4:
TABLE 4 sensitivity (with each stage), specificity results of five randomly selected 10 DMR repeats
From this, it can be seen that any 10 DMRs of 195 DMRs provided by the invention can realize better specificity and sensitivity in dividing the training set and the verification set positive in each stage of liver cancer, and accords with the expectations of use.
Example 4: detection of cirrhosis based on DMR using DOC model
Based on the differentiation of liver cancer patients and liver cirrhosis patients in different DMRs, 22 liver cirrhosis samples and 82 liver cancer samples are split into a training set (containing 15 liver cirrhosis samples and 57 liver cancer samples) and a verification set (containing 7 liver cirrhosis samples and 25 liver cancer samples) according to a ratio of 7:3. 230 DMRs with obvious methylation level differences (shown in table 5) are screened out by adopting a training set sample and used for constructing a DOC model and determining a threshold value, and the distinguishing performance of the model and the threshold value is further confirmed by utilizing a verification set sample.
TABLE 5 230 DMR for cirrhosis selected according to the invention
Ten-fold cross-validation is carried out on 15 liver cirrhosis samples and 57 liver cancer samples of the training set, and the average value of the threshold values corresponding to the optimal condition of the index of the cross-fold is taken as the dividing threshold value for yin-yang division of the training set and the validation set samples. The specific procedure for ten-fold cross-validation is as in example 3, with the corresponding threshold value obtained being threshold=0.5. The method comprises the following steps: the overall sensitivity of the training dataset was 86.0% (49/57), overall specificity was 86.7% (13/15), and AUC was 0.929; the overall sensitivity of the data set was verified to be 80.0% (20/25), overall specificity was 85.7% (6/7), and AUC was 0.940. In addition, the sensitivity of each stage of a specific liver cancer is shown in Table 6.
TABLE 6 sensitivity of liver cancer stages
In practice, due to cost and efficiency constraints, DOC models can also be built using a smaller number of DMRs to achieve partitioning, not limited to all 195 DMRs in table 1 above. Of the five replicates, 10 of 195 DMRs were randomly adopted each time for constructing DOC models and corresponding thresholds, and 10 DMRs were randomly adopted each time as shown in table 7:
TABLE 7 five randomly selected DMR combinations for liver cirrhosis
The sensitivity, specificity results of five replicates with training set specificity controlled at the same level (73.3%) are shown in tables 8 and 9:
TABLE 8 sensitivity (without each stage), specificity results of five randomly selected 10 DMR repeats
TABLE 9 sensitivity (with each stage), specificity results of five randomly selected 10 DMR repeats
From this, it can be seen that any 10 DMRs of 230 DMRs provided by the invention can realize better specificity and sensitivity in each stage of liver cancer by dividing the training set and the verification set positive, and accords with the use expectation.
Furthermore, although the description provides only a method of constructing a DOC model for methylation detection and determining a threshold value based on the differences between a liver cancer patient and a chronic viral hepatitis type B patient and a liver cirrhosis patient, the method can be applied to other liver non-cancer patients as well, including: liver metastasis, hypercholesterolemia, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.
The foregoing detailed description is provided by way of explanation and example and is not intended to limit the scope of the appended claims. Numerous variations of the presently illustrated embodiments of the application will be apparent to those of ordinary skill in the art and are intended to be within the scope of the appended claims and equivalents thereof.
Claims (10)
1. A biomarker combination for distinguishing liver cancer samples from liver non-cancer disease samples, wherein the biomarker combination comprises any of at least 10 differential methylation regions DMR as set forth in table 1 and/or table 5, wherein the reference genome employed by the DMR in table 1 and/or table 5 is a GRCh37/hg19 human reference genome;
preferably, the biomarker combination comprises any at least 10 DMR as set forth in table 1, and/or any at least 10 DMR as set forth in table 5;
Preferably, the biomarker combination comprises all 195 DMRs shown in table 1, and/or all 230 DMRs shown in table 5;
more preferably, the biomarker combination comprises any of at least 10 DMR as shown in table 1; alternatively, the biomarker combination comprises 10 DMR selected from any of the group of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 3;
More preferably, the biomarker combination comprises any of at least 10 DMR as shown in table 5; alternatively, the biomarker combination comprises 10 DMR selected from any of the group of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 7.
2. A kit, wherein the kit comprises reagents for detecting the biomarker combination of claim 1.
3. The kit of claim 2, wherein the kit comprises next generation sequencing reagents;
Preferably, the next generation sequencing reagents comprise hybridization capture probes or primers covering any at least 10 DMR in table 1 and/or table 5;
preferably, the next generation sequencing reagent comprises a primer covering any at least 10 DMR as set forth in table 1, and/or any at least 10 DMR as set forth in table 5;
Preferably, the next generation sequencing reagent comprises a hybridization capture probe or primer covering all 195 DMRs shown in table 1, and/or all 230 DMRs shown in table 5;
More preferably, the next generation sequencing reagents comprise hybridization capture probes or primers covering any of at least 10 DMR shown in table 1; alternatively, the next generation sequencing reagents comprise hybridization capture probes or primers covering 10 DMR selected from any of the sets of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 3;
more preferably, the next generation sequencing reagents comprise hybridization capture probes or primers covering any of at least 10 DMR shown in table 5; alternatively, the next generation sequencing reagents comprise hybridization capture probes or primers covering 10 DMR selected from any one of the sets of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 7.
4. A kit according to claim 2 or 3, wherein the kit is for distinguishing liver cancer samples from liver non-cancerous disease samples;
Preferably, the liver non-cancerous disease comprises: cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.
5. Use of a reagent for detecting the biomarker combination of claim 1 in the preparation of a kit for distinguishing a liver cancer sample from a liver non-cancer disease sample;
Preferably, the liver non-cancerous disease comprises: cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.
6. A methylation data classification method comprising:
acquiring first methylation data corresponding to the biomarker combination of claim 1 in a sample to be tested;
correcting the first methylation data according to the confusion factor corresponding to the sample to be detected to obtain second methylation data;
Classifying based on a preset rule according to the second methylation data and a classification threshold value, and generating indication information for indicating the classification to which the sample to be detected belongs;
The preset rule comprises the steps of comparing the numerical value of the second methylation data with a classification threshold value, and generating the indication information according to a comparison result; preferably, in response to the value of the second methylation data being smaller than or equal to the classification threshold, first indication information for indicating the classification to which the sample to be tested belongs is generated, and in response to the value of the second methylation data being larger than the classification threshold, second indication information for indicating the classification to which the sample to be tested belongs is generated;
Preferably, the classification to which the sample to be tested belongs includes: liver cancer, liver cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.
7. The method of claim 6, wherein the sample to be tested is selected from any one or more of the following: tissue samples, blood samples, saliva, sputum, pleural effusion, lung lavage, peritoneal effusion, peritoneal lavage, enema, and cerebrospinal fluid.
8. An apparatus for distinguishing a liver cancer sample from a liver non-cancer disease sample, comprising:
An acquisition unit configured to acquire first methylation data corresponding to the biomarker combination of claim 1 in a sample to be tested;
The correction unit is configured to correct the first methylation data according to the confusion factor corresponding to the sample to be detected to obtain second methylation data;
The classifying unit is configured to generate indicating information for indicating the class to which the sample to be detected belongs based on a preset rule according to the second methylation data and a classifying threshold, wherein the preset rule comprises comparing the value of the second methylation data with the classifying threshold, and generating the indicating information according to a comparison result; preferably, in response to the value of the second methylation data being smaller than or equal to the classification threshold, first indication information for indicating the classification to which the sample to be tested belongs is generated, and in response to the value of the second methylation data being larger than the classification threshold, second indication information for indicating the classification to which the sample to be tested belongs is generated;
Preferably, the classification to which the sample to be tested belongs includes: liver cancer, liver cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.
9. An electronic device, comprising:
one or more processors;
A storage device having one or more programs stored thereon,
The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 6 or 7.
10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by one or more processors implements the method of claim 6 or 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311830003.3A CN117965725A (en) | 2023-12-28 | 2023-12-28 | Method, device and kit for distinguishing liver cancer from liver non-cancer disease samples |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311830003.3A CN117965725A (en) | 2023-12-28 | 2023-12-28 | Method, device and kit for distinguishing liver cancer from liver non-cancer disease samples |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117965725A true CN117965725A (en) | 2024-05-03 |
Family
ID=90848689
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311830003.3A Pending CN117965725A (en) | 2023-12-28 | 2023-12-28 | Method, device and kit for distinguishing liver cancer from liver non-cancer disease samples |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117965725A (en) |
-
2023
- 2023-12-28 CN CN202311830003.3A patent/CN117965725A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6817259B2 (en) | Use of size and number abnormalities in plasma DNA for the detection of cancer | |
JP2022169566A (en) | Systems and methods to detect rare mutations and copy number variation | |
JP2022521492A (en) | An integrated machine learning framework for estimating homologous recombination defects | |
AU2020244763A1 (en) | Systems and methods for deriving and optimizing classifiers from multiple datasets | |
WO2019023517A2 (en) | Genomic sequencing classifier | |
US20190338349A1 (en) | Methods and systems for high fidelity sequencing | |
CN112301130B (en) | Marker, kit and method for early detection of lung cancer | |
CN115667554A (en) | Method and system for detecting colorectal cancer by nucleic acid methylation analysis | |
US20190062841A1 (en) | Diagnostic assay for urine monitoring of bladder cancer | |
CN117413072A (en) | Methods and systems for detecting cancer by nucleic acid methylation analysis | |
US20200109457A1 (en) | Chromosomal assessment to diagnose urogenital malignancy in dogs | |
US20190073445A1 (en) | Identifying false positive variants using a significance model | |
CN116804218A (en) | Methylation marker for detecting benign and malignant lung nodules and application thereof | |
CN112877429A (en) | Prediction tool for judging liver cancer drug sensitivity and long-term prognosis based on gene detection and application thereof | |
Hobbs et al. | Biostatistics and bioinformatics in clinical trials | |
CN114300089B (en) | Decision algorithm for middle and late colorectal cancer treatment scheme | |
CN117965725A (en) | Method, device and kit for distinguishing liver cancer from liver non-cancer disease samples | |
Fan et al. | Rapid preliminary purity evaluation of tumor biopsies using deep learning approach | |
KR102161511B1 (en) | Extracting method for biomarker for diagnosis of biliary tract cancer, computing device therefor, biomarker for diagnosis of biliary tract cancer, and biliary tract cancer diagnosis device comprising same | |
CN118240934A (en) | Methylation signal detection method, device and kit | |
CN113159529A (en) | Risk assessment model and related system for intestinal polyp | |
CN113160895A (en) | Colorectal cancer risk assessment model and system | |
CN117344014B (en) | Pancreatic cancer early diagnosis kit, method and device thereof | |
WO2024027591A1 (en) | Multi-cancer methylation detection kit and use thereof | |
WO2024155681A1 (en) | Methods and systems for detecting and assessing liver conditions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |