CN117965725A

CN117965725A - Method, device and kit for distinguishing liver cancer from liver non-cancer disease samples

Info

Publication number: CN117965725A
Application number: CN202311830003.3A
Authority: CN
Inventors: 高强; 樊嘉; 顾建英; 江孙芳; 纪元; 许佳悦
Original assignee: Guangzhou Burning Rock Dx Co ltd; Zhongshan Hospital Fudan University
Current assignee: Guangzhou Burning Rock Dx Co ltd; Zhongshan Hospital Fudan University
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-05-03

Abstract

The invention provides a detection method, a detection device and a detection kit for distinguishing a liver cancer sample from a liver non-cancer disease sample, and particularly relates to a biomarker combination for distinguishing the liver cancer sample from the liver non-cancer disease sample, wherein the biomarker combination comprises any at least 10 different methylation regions DMR shown in table 1 and/or table 5, wherein a reference genome adopted by the DMR in table 1 and/or table 5 is GRCh37/hg19 human reference genome, and the samples to be detected can be classified with low cost and high accuracy.

Description

Method, device and kit for distinguishing liver cancer from liver non-cancer disease samples

Technical Field

The invention relates to the field of biotechnology, in particular to a method, a device and a kit for distinguishing liver cancer samples from liver non-cancer disease samples.

Background

To date, intervention prior to distant metastasis provides the greatest opportunity to improve prognosis, and therefore it is highly desirable to develop sensitive, reliable and minimally invasive assays to detect cancer prior to the appearance of symptoms. Among many cancer species, liver cancer (hepatocellular carcinoma, HCC) is a serious disease that seriously jeopardizes human health, and is not only high in incidence but also hidden, fast in progress, high in recurrence rate and mortality, and is called "king in cancer". Most liver cancer patients who visit hospitals are middle or late, and if the natural course of the liver cancer patients is not treated, the liver cancer patients only need 3-6 months.

One very important cause in the development of liver cancer is chronic hepatitis b (chronic HBV) infection. Chronic HBV infection may lead to chronic hepatitis, cirrhosis, and even further development of liver cancer. Currently, the detection means of liver cancer mainly comprise two types of serum marker detection and imaging detection. However, both of these detection means are difficult to achieve accurate differentiation between liver cancer and benign liver disease.

Existing liver cancer serum marker assays include serum Alpha Fetoprotein (AFP) assays and hematological and other tumor marker assays. Among them, serum Alpha Fetoprotein (AFP) assay has relative specificity for diagnosing liver cancer. The radioimmunoassay can be used for detecting serum AFP not less than 400 μg/L, and eliminating pregnancy and active liver diseases, and can be used for diagnosing liver cancer, however, chronic hepatitis or liver cirrhosis can also produce high alpha fetoprotein level. Meanwhile, about 30% of liver cancer patients clinically have negative AFP, so that the specificity of the AFP test adopted alone is not high, and the liver cancer and other liver non-cancer diseases are difficult to distinguish. Blood enzymology and other tumor marker tests are performed by the principle that gamma-glutamyl transpeptidase and its isozyme, abnormal prothrombin, alkaline phosphatase and lactate dehydrogenase isozyme in serum of liver cancer patients can be higher than normal. But also lack specificity.

Imaging examinations typically include ultrasound examinations, computed Tomography (CT) examinations, magnetic Resonance Imaging (MRI) examinations, selective celiac or hepatic angiography examinations, and liver puncture needle aspiration cytology examinations, but imaging examinations are difficult to distinguish between liver cancer and benign liver disease in more complex cases, and diagnosis of liver cancer also needs to be performed after a tumor has formed and reached a certain size, failing to achieve the goals of early cancer examinations or early screening.

Currently, DNA methylation sequencing is increasingly known as a high resolution, high throughput technique that is useful in cancer screening, diagnosis, and monitoring. Most regions of the human genome are not active during the development of cancer, and cancer-related variations tend to concentrate in certain specific regions, such as CpG islands (CPG ISLAND), which provides a good opportunity for targeted sequencing. Although there are a large number of scientific articles reporting biomarkers based on DNA methylation and their clinical links in cancer and various non-cancerous diseases, only a few tens of biomarkers have been converted into commercial clinical test products, and related products for liver cancer are more scarce. Therefore, it is urgent to develop a kit and a corresponding detection method for distinguishing liver cancer from other liver non-cancer diseases.

Disclosure of Invention

The invention provides a method, a device and a kit for distinguishing liver cancer samples from liver non-cancer disease samples, which adopt DNA or RNA oligonucleotide sequences to capture methylation variation regions of malignant or benign liver diseases and judge the existence of tumor components (ctDNA) in samples to be detected, thereby providing a low-cost and high-precision method for distinguishing the samples to be detected into liver cancer samples or liver non-cancer disease samples.

In one aspect, the invention provides a biomarker combination for distinguishing liver cancer samples from liver non-cancer disease samples, wherein the biomarker combination comprises any of at least 10 different methylation regions DMR shown in table 1 and/or table 5, wherein the reference genome employed by the DMR in table 1 and/or table 5 is a GRCh37/hg19 human reference genome.

In another aspect, the invention provides a kit comprising reagents for detecting a biomarker combination as described above.

In another aspect, the invention provides the use of a reagent for detecting the above biomarker combination in the preparation of a kit for distinguishing liver cancer samples from liver non-cancer disease samples.

In another aspect, the invention provides a method of classifying methylation data, comprising: acquiring first methylation data corresponding to the biomarker combination of claim 1 in a sample to be tested; correcting the first methylation data according to the confusion factor corresponding to the sample to be detected to obtain second methylation data; classifying the second methylation data and a classification threshold value based on a preset rule, and generating indication information for indicating the classification to which the sample to be detected belongs, wherein the preset rule comprises comparing the value of the second methylation data with the classification threshold value, and generating the indication information according to a comparison result; preferably, the first indicating information for indicating the classification to which the sample to be tested belongs is generated in response to the value of the second methylation data being smaller than or equal to the classification threshold, and the second indicating information for indicating the classification to which the sample to be tested belongs is generated in response to the value of the second methylation data being larger than the classification threshold.

In another aspect, the present invention provides a methylation data sorting apparatus comprising: an acquisition unit configured to acquire first methylation data corresponding to the biomarker combination of claim 1 in a sample to be tested; the correction unit is configured to correct the first methylation data according to the confusion factor corresponding to the sample to be detected to obtain second methylation data; the classifying unit is configured to classify the second methylation data and the classifying threshold value based on a preset rule and generate indicating information for indicating the classification to which the sample to be detected belongs, wherein the preset rule comprises comparing the value of the second methylation data with the classifying threshold value and generating the indicating information according to a comparison result; preferably, the first indicating information for indicating the classification to which the sample to be tested belongs is generated in response to the value of the second methylation data being smaller than or equal to the classification threshold, and the second indicating information for indicating the classification to which the sample to be tested belongs is generated in response to the value of the second methylation data being larger than the classification threshold.

In another aspect, the present invention provides an electronic device, including: one or more processors; and a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method.

In another aspect, the present invention provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method described above.

The biomarker combination, the kit, the method, the application, the device, the electronic equipment and the storage medium can be suitable for risk assessment of cancers, and have the advantages of low cost and high accuracy.

Specifically, the liver non-cancerous disease includes one or more of the following: cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.

The biomarker combination, the kit, the method, the application, the device, the electronic equipment and the storage medium provided by the invention adopt DNA or RNA oligonucleotide sequences to capture methylation variation regions of malignant or benign liver diseases, judge the existence of tumor components (ctDNA) in a sample to be tested, can be suitable for risk assessment of cancers, are used for distinguishing liver cancer from other liver non-cancer diseases, classify methylation data and generate corresponding indication information, fill the blank of related technologies in the field, and have the advantages of high accessibility in clinical application, convenient implementation, low cost and high accuracy.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention. In the drawings:

Fig. 1 shows an exemplary case where CpG sites cannot be classified into the same DMR.

Fig. 2 shows an exemplary case where CpG sites are partitioned into the same DMR.

Fig. 3 illustrates an exemplary case for explaining the principle of judging whether the DMR is valid or not in the present invention.

Fig. 4 shows the control results of the weight configuration of the confounding variables in the DOC model of the present application.

Fig. 5 shows that the DOC model established by the present invention remains balanced across the age groups.

Detailed Description

I. definition of the definition

In the present invention, unless otherwise indicated, scientific and technical terms used herein have the meanings commonly understood by one of ordinary skill in the art. Also, protein and nucleic acid chemistry, molecular biology, cell and tissue culture, microbiology, immunology-related terms and laboratory procedures as used herein are terms and conventional procedures that are widely used in the corresponding arts. Meanwhile, in order to better understand the present invention, definitions and explanations of related terms are provided below.

As used herein, the term "differential methylation region" (DIFFERENTIALLY METHYLATED region, DMR) generally refers to a region of DNA that contains one or more differential methylation sites. For example, a DMR that includes a greater number or frequency of methylation sites under selected conditions of interest, such as a cancer state, may be referred to as a hypermethylated DMR. For example, a DMR that includes a lesser number or frequency of methylation sites under selected conditions of interest, such as a cancer state, may be referred to as a hypomethylated DMR.

As used herein, the term "methylation" generally refers to the methylation state of a gene fragment, nucleotide, or base thereof of the present application. For example, a DNA fragment in which a gene of the application is located may have methylation on one or more strands. For example, a DNA fragment in which a gene of the application resides may have methylation at one site or DMR or at multiple sites or DMR.

As used herein, the term "next generation sequencing" (Next Generation Sequencing, NGS) refers to any sequencing method that determines the nucleotide sequence of an individual nucleic acid molecule (e.g., in single molecule sequencing) or of a surrogate of an individual nucleic acid molecule that is clonally amplified in a high-throughput mode (e.g., sequencing more than 10 ³、10⁴、10⁵ molecules or more simultaneously). The next generation sequencing platform includes, but is not limited to, existing Illumina et al sequencing platforms. With the continued development of sequencing technology, one skilled in the art will appreciate that other methods of sequencing methods and devices may also be employed for the present method. The next generation sequencing methods include, for example, sequencing by synthesis technology (Illumina), large-scale parallel signature sequencing (MASSIVELY PARALLEL Signature Sequencing, MPSS), polymerase cloning (Polony Sequencing), pyrosequencing (454), ion semiconductor technology (ion-shock sequencing) (Ion semi conductor sequencing), DNA nanoball sequencing (DNA nano-ball sequencing), DNA nanoarray-and-combinatorial probe anchored ligation sequencing of Complete Genomics, single molecule real-time sequencing (Pacific Biosciences), and sequencing by ligation (SOLiD sequencing), and the like. The next generation sequencing described above may enable detailed analysis of the transcriptome and genome of a species, and is therefore also referred to as deep sequencing. For example, the methods of the invention are equally applicable to first generation gene sequencing, second generation gene sequencing, third generation gene sequencing, or Single Molecule Sequencing (SMS).

As used herein, the term "human reference genome" generally refers to a human genome that can perform a reference function in gene sequencing. The above information of the human reference genome may refer to UCSC. The human reference genome may be in different versions, for example, hg19, hg38, GRCh37, GRCh38, gca_000001405, gcf_000001405, or Ensembl75.

As used herein, the terms "polynucleotide," "nucleotide," "nucleic acid," and "oligonucleotide" are used interchangeably. They represent polymeric forms of nucleotides (deoxyribonucleotides or ribonucleotides) of any length, or analogues thereof. Polynucleotides may have any steric structure and may perform any function, whether known or unknown. The following are non-limiting examples of polynucleotides: coding or non-coding regions of a gene or gene fragment, loci (loci), exons, introns, messenger RNAs (mRNA), transfer RNAs (tRNA), ribosomal RNAs (rRNA), short interfering RNAs (siRNA), short-hairpin RNAs (shRNA), micrornas (miRNA), ribozymes, cdnas, recombinant polynucleotides, branched polynucleotides, plasmids, vectors, isolated DNA of any sequence, isolated RNAs of any sequence, nucleic acid probes, primers and adaptors defined according to linkage analysis. Polynucleotides may include one or more modified nucleotides, such as methylated nucleotides and nucleotide analogs.

As used herein, the term "sample to be tested" generally refers to a sample that is to be tested. For example, the presence or absence of a modification in one or more gene regions on a test sample can be detected. In embodiments of the present invention, the sample to be tested includes, but is not limited to, a tissue sample, a blood sample, saliva, sputum, pleural effusion, pulmonary lavage, peritoneal effusion, peritoneal lavage, and cerebrospinal fluid.

As used herein, the term "about index", also known as the correct index, is a method of evaluating the authenticity of a screening test, which can be applied given the equivalent meaning of the hazard of false negatives (missed diagnosis rates) and false positives (false misdiagnosis rates). The about log index is the sum of sensitivity and specificity minus 1. Indicating the total ability of the screening method to find true patients and non-patients. The larger the index, the better the effect of the screening experiment, and the greater the authenticity. The term "about log index optimum" is the case where the sum of sensitivity and specificity minus 1 is the largest.

As used herein, the term "non-cancerous disease" (noncancer disease) refers to a disease of the body other than cancer, including benign proliferative conditions of tumors.

Detailed description of the preferred embodiments

In one aspect, the present invention provides a biomarker combination for distinguishing liver cancer samples from liver non-cancer disease samples, wherein the biomarker combination comprises any of at least 10 differential methylation regions DMR shown in table 1 and/or table 5, wherein the reference genome employed by the DMR in table 1 and/or table 5 is a GRCh37/hg19 human reference genome.

In some preferred embodiments, the biomarker combinations comprise any at least 10 DMR shown in table 1, and/or any at least 10 DMR shown in table 5.

In some preferred embodiments, the biomarker combinations comprise all 195 DMRs shown in table 1, and/or all 230 DMRs shown in table 5.

In some preferred embodiments, the biomarker combinations described above comprise any of the at least 10 DMR shown in table 1.

In some alternative embodiments, the biomarker combinations comprise 10 DMR selected from any of the group of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 3.

In some preferred embodiments, the biomarker combinations described above comprise any of the at least 10 DMR shown in table 5.

In some alternative embodiments, the biomarker combinations comprise 10 DMR selected from any of the group of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 7.

In some embodiments, the liver non-cancerous disease comprises: cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.

In some embodiments, the present invention provides a biomarker combination for distinguishing a liver cancer sample from a chronic viral hepatitis b sample, the biomarker combination comprising any of at least 10 DMR as shown in table 1.

In some preferred embodiments, the biomarker combinations provided by the present invention are used to distinguish liver cancer samples from chronic viral hepatitis b samples, the biomarker combinations comprising all 195 DMRs shown in table 1.

In some embodiments, the present invention provides a biomarker combination for distinguishing a liver cancer sample from a liver cirrhosis sample, the biomarker combination comprising any of at least 10 DMR as shown in table 5.

In some preferred embodiments, the biomarker combinations provided by the present invention are used to distinguish liver cancer samples from liver cirrhosis samples, the biomarker combinations comprising all 230 DMR as shown in table 5.

In another aspect, the invention provides a kit, wherein the kit comprises reagents for detecting the biomarker combination.

In some embodiments, the above-described kits comprise next-generation sequencing reagents.

In some preferred embodiments, the next generation sequencing reagents described above include hybridization capture probes or primers covering any at least 10 DMR in table 1 and/or table 5.

In some preferred embodiments, the next generation sequencing reagents described above include covering any at least 10 DMR shown in table 1, and/or any at least 10 DMR shown in table 5.

In some preferred embodiments, the next generation sequencing reagents described above include hybridization capture probes or primers that cover all 195 DMRs shown in table 1, and/or hybridization capture probes or primers for all 230 DMRs shown in table 5.

In some preferred embodiments, the next generation sequencing reagents described above include hybridization capture probes or primers covering any of at least 10 DMR shown in table 1.

In some alternative embodiments, the next generation sequencing reagents described above comprise hybridization capture probes or primers covering 10 DMR selected from any one of the sets of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 3.

In some preferred embodiments, the next generation sequencing reagents described above include hybridization capture probes or primers covering any of at least 10 DMR shown in table 5.

In some alternative embodiments, the next generation sequencing reagents described above comprise hybridization capture probes or primers covering 10 DMR selected from any one of the sets of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 7.

In some embodiments, the above-described kit is used to distinguish liver cancer samples from liver non-cancer disease samples.

In some preferred embodiments, the liver non-cancerous disease described above comprises: cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.

In some embodiments, the kit provided by the invention is used for distinguishing liver cancer samples from chronic viral hepatitis b samples, and the next-generation sequencing reagent included in the kit comprises at least 10 DMR covering any of those shown in table 1.

In some preferred embodiments, the kit provided by the invention is used for distinguishing liver cancer samples from chronic viral hepatitis b samples, and the next generation sequencing reagent included in the kit comprises all 195 DMRs as shown in table 1.

In some embodiments, the invention provides a kit for distinguishing liver cancer samples from liver cirrhosis samples, the next generation sequencing reagents included in the kit comprising a dna sequence covering any of at least 10 DMR shown in table 5.

In some preferred embodiments, the invention provides a kit for distinguishing liver cancer samples from liver cirrhosis samples, the next generation sequencing reagents included in the kit comprising all 230 DMR's covering those shown in table 5.

In another aspect, the present disclosure provides a method for detecting a biomarker combination as described above, comprising administering to a subject in need thereof a kit for distinguishing a liver cancer sample from a liver non-cancer disease sample.

In another aspect, the present disclosure provides a method of classifying methylation data, comprising: acquiring first methylation data corresponding to the biomarker combination according to claim 1 in a sample to be tested; correcting the first methylation data according to the confusion factor corresponding to the sample to be detected to obtain second methylation data; classifying based on a preset rule according to the second methylation data and a classification threshold value, and generating indication information for indicating the classification to which the sample to be tested belongs, wherein the preset rule comprises comparing the value of the second methylation data with the classification threshold value, and generating the indication information according to a comparison result; preferably, the first indicating information for indicating the classification to which the sample to be measured belongs is generated in response to the value of the second methylation data being less than or equal to the classification threshold, and the second indicating information for indicating the classification to which the sample to be measured belongs is generated in response to the value of the second methylation data being greater than the classification threshold.

In some preferred embodiments, the classification of the sample to be tested includes: liver cancer, liver cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.

In some embodiments, the first indication information used for representing the classification to which the sample to be measured belongs in the method provided by the present disclosure may be a prompt information indicating that the sample to be measured belongs to a liver cancer sample, and the second indication information may be a prompt information indicating that the sample to be measured belongs to one or more liver non-cancer disease samples.

In some embodiments, for example, the method provided by the present disclosure may classify whether the sample to be tested belongs to a liver cancer sample or a chronic viral hepatitis b sample, and in this application scenario, the first indication information used for characterizing the classification of the sample to be tested may be a prompt information indicating that the sample to be tested belongs to a liver cancer sample, and the second indication information may be a prompt information indicating that the sample to be tested belongs to a chronic viral hepatitis b sample.

In some embodiments, for example, the method provided by the present disclosure may classify whether the sample to be measured belongs to a liver cancer sample or a liver cirrhosis sample, and in this application scenario, the first indication information used for characterizing the classification to which the sample to be measured belongs may be a prompt information indicating that the sample to be measured belongs to a liver cancer sample, and the second indication information may be a prompt information indicating that the sample to be measured belongs to a liver cirrhosis sample.

In some preferred embodiments, the sample to be tested is selected from any one or more of the following: tissue samples, blood samples, saliva, sputum, pleural effusion, lung lavage, peritoneal effusion, peritoneal lavage, enema, and cerebrospinal fluid.

In another aspect, the present disclosure provides a methylation data sorting apparatus comprising: an acquisition unit configured to acquire first methylation data corresponding to the biomarker combination according to claim 1 in a sample to be tested; the correcting unit is configured to correct the first methylation data according to the confusion factor corresponding to the sample to be detected to obtain second methylation data; the classifying unit is configured to classify the second methylation data and the classifying threshold value based on a preset rule and generate indicating information for indicating the classification to which the sample to be detected belongs, wherein the preset rule comprises comparing the value of the second methylation data with the classifying threshold value and generating the indicating information according to a comparison result; preferably, the first indicating information for indicating the classification to which the sample to be measured belongs is generated in response to the value of the second methylation data being less than or equal to the classification threshold, and the second indicating information for indicating the classification to which the sample to be measured belongs is generated in response to the value of the second methylation data being greater than the classification threshold.

In another aspect, the present disclosure provides an electronic device, comprising: one or more processors; and a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method.

The implementation environment of the invention comprises electronic equipment, and the method for distinguishing the liver cancer sample from the liver non-cancer disease sample in the embodiment of the invention can be executed by the terminal equipment. By way of example, the electronic device may comprise at least one of a terminal device or a server.

The terminal device may be hardware or software. When the terminal device is hardware, it may be a variety of electronic devices having a display screen and supporting information input (e.g., text input and/or voice input, etc.), including but not limited to smart phones, tablet computers, laptop and desktop computers, and the like. When the terminal device is software, it can be installed in the above-listed terminal device. It may be implemented as a plurality of software or software modules (e.g., to provide a service for distinguishing liver cancer samples from liver non-cancerous disease samples), or as a single software or software module. The present invention is not particularly limited herein.

The computer readable medium of the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the method of assessing the correlation of a sample under test with risk of cancer formation shown in the above-described embodiments and alternative embodiments thereof.

Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The above description is only illustrative of the preferred embodiments of the present invention and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present invention is not limited to the specific combinations of technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present invention (but not limited to) having similar functions are replaced with each other.

Examples

Example 1: division of DMR regions

1. Hypothesis testing

Obtaining a sample to be tested (for example, a blood sample), wherein the sample to be tested is divided into a liver cancer group (C group) and a normal group (N group), and the bisulfite methylation sequencing of the sample to be tested can comprise the following steps:

S1: cell-free DNA (cfDNA) extraction: for example, the QiaAmp cycle nucleic acid kit (Qiagen, 55114) and its corresponding platform can be used;

S2: bisulfite conversion: for example, the bisulfite conversion step (Bisulfite Conversion, BC) is performed using a modified protocol according to EZ-96DNA methylation-LIGHTNINGTM MAGPREP (Zymo, D5047);

S3: pre-library preparation: comprises a first tailing and connecting step, wherein a plurality of G or A synthesized randomly by a split (splinter) joint can be used, the 3' -end poly-C/T tail of a single-stranded DNA substrate is annealed, and the connection is completed after hybridization with the first tail through a cantilever of the joint; annealing the DNA substrate with the adaptor added at one end into a single strand, performing 5-15 rounds of linear amplification, performing a second tailing and connecting step by adopting a similar step to the first tailing and connecting step, connecting the second adaptor to the A tail at the other end of the DNA substrate, and performing a plurality of rounds of PCR amplification to complete the preparation of a pre-library (for example, see Chinese patent publication CN 110892097A);

s4: pre-library hybridization: hybridizing a pre-library with a hybridization capture probe covering the target DMR region;

S5: capturing and eluting: the non-specific fragments are eluted through the combination of the magnetic beads and the probes, the magnetic beads are removed, and the final library is formed through PCR amplification;

s6: sequencing: and sequencing the final library by an NGS sequencer to generate sequencing data containing the target DMR region.

In this embodiment, the step of noise reduction treatment for genomic methylation signal CpG and noise region CHH/CHG sites may be optionally included, for example, see Chinese patent publication CN114974417A.

Based on each CpG site, carrying out hypothesis test on whether the difference between the C group and the N group has statistical significance, respectively calculating the P value of each CpG site in the C group and the N group, wherein the calculation process adopts weighted logistic regression (WEIGHTED LR, weighted Logistic Regression), determines the given weight according to the coverage depth of each CpG site, takes the methylation level of each CpG site as an explanatory variable, and outputs a binary result of (0, 1) to correspond to C and N.

Partitioning of DMR

Calculating according to the following formula, taking the methylation level and sequencing coverage depth of each methylation CpG site as parameters, evaluating the similarity of the methylation level of the genome space continuous sites, wherein the deeper the coverage depth is, the larger the value of the parameter P in the following formula is, the higher the similarity of the methylation level between adjacent CpG sites in the same group (liver cancer group or normal group) is, and further dividing the DMR:

The subscript ij of each parameter represents the j-th site of the i-th sample, the parameter d is used for representing the effective coverage depth of the CpG sites in the liver cancer group, and the parameter M is used for representing the methylation level of the CpG sites in the liver cancer group.

Taking a beta value as a judging index after calculation, taking beta=0.25 as a preset threshold value, substituting the j and (j+1) th sites into a calculation area statistic B (B value is used for representing whether the DMR obtained by division is a valid DMR) when the beta is smaller than the preset threshold value, and possibly dividing into one DMR; when β is greater than or equal to the preset threshold, the jth and (j+1) th sites cannot be substituted into the calculated region statistic B and are not divided into one DMR.

In this embodiment, an exemplary case (as shown in fig. 1) that the DMR cannot be divided into the same DMR is given to explain the principle of dividing the DMR in the present invention.

Wherein the colored dots characterize a methylated CpG site, sample A, sample B, and sample C are from the same sample group (e.g., tumor group or normal group as described above), wherein sample A and sample B each obtain coverage of 500 effective sequences, and sample C obtains coverage of 200 effective sequences. The dots of each column correspond to the same CpG site, with the methylation level of the first CpG site in the region being 0.2 and the methylation level of the second CpG site being 0 in sample A.

The coverage depth parameter value P for the first CpG site within the region was calculated to be 0.617 for sample a, sample B and sample C above. At this time, by substituting the above parameters into the above formula, β ₁₁ can be calculated to be 0.29, and based on the preset threshold value of 0.25, the methylation level difference between the first CpG site and the second CpG site in the region is greater than 0.25, so that the two adjacent CpG sites are not classified into the same DMR.

Another exemplary case of dividing into the same DMR is given in this embodiment (as shown in fig. 2) to explain the principle of dividing the DMR in the present invention.

Wherein the colored dots characterize a methylated CpG site, sample A, sample B and sample D are from the same sample group (e.g., tumor group or normal group) and wherein sample A and sample B each obtain coverage of 500 effective sequences and sample D obtains coverage of 400 effective sequences (the coverage depth of sample D is increased compared to sample C in the previous example, and thus the P value in the present example is also increased accordingly). Also, in sample a, the methylation level of the first CpG site in this region is 0.2 and the methylation level of the second CpG site is 0.

The coverage depth parameter value P for the first CpG site within the region was calculated to be 0.962 for sample a, sample B and sample D above. At this time, the above parameters are substituted into the above formula, and β ₁₁ is calculated to be 0.21, and based on the preset threshold value of 0.25, the methylation level difference between the first CpG site and the second CpG site in the region is less than 0.25, so that the two adjacent CpG sites are marked into the same DMR.

The above method can be seen in chinese patent publication CN115132273a.

Therefore, the coverage depth of CpG sites is introduced in the DMR division process by the method, so that the accuracy of DMR region division can be remarkably improved.

3. Calculation of region statistics B value

In some optional embodiments, based on the above calculated β value, a region statistic B value of CpG sites in the region is further calculated according to the following formula to represent whether the DMR obtained by the division is a valid DMR.

The calculation formula of the value B is as follows:

Wherein, the parameter k is the number of CpG sites in the region, and the subscript ij of each parameter represents the j site of the i sample. Taking beta=0.25 as a preset threshold value, when beta is smaller than the preset threshold value, the j-th and (j+1) -th sites can be substituted into the calculated area statistic B, and the calculation of the area statistic B is possible to be divided into one DMR; when β is greater than or equal to the preset threshold, the jth and (j+1) th sites cannot be substituted into the calculated region statistic B and are not divided into one DMR. Taking b=1 as a preset threshold, and when the B value is smaller than the preset threshold, DMR corresponding to the jth and (j+1) th positions can be used as effective DMR; when the B value is greater than or equal to the preset threshold, DMR corresponding to the jth and (j+1) th positions is not used as an effective DMR.

An exemplary case (as shown in fig. 3) is given in this embodiment to explain the principle of judging whether the DMR is effective in the present invention.

When the DMRs divided by the groups a, B and C respectively contain 10 CpG sites, B _ij of all samples are combined together when calculating the B value corresponding to each DMR, and the average value is calculated as the score of each DMR.

Wherein the calculation steps of the B value in the DMR shown in the group A are shown in the following table:

b-value division of DMR corresponding to group A Less than a preset threshold of 1, and therefore, the DMR may be an effective DMR.

Similarly, the B value score for DMR shown in group B isCan be used as an effective DMR; b value score in DMR shown in sample C isTherefore, the DMR corresponding to sample C cannot be valid.

Example 2: cancer detection (Detection of Cancer, DOC) model building

The invention quantifies bias caused by confounding variables for confounding variables (confounding variable) that may affect the accuracy of the classification model, thereby increasing the accuracy and generalizable capability of the DOC model. In the application scenario of the present invention, because ctDNA content in blood of a patient is greatly different in different development stages of liver cancer, the ctDNA content is easily affected by experimental batch effect, and methylation is related to age of a sample source to be tested, race and whether other diseases are suffered, the above conditions may all constitute confounding variables in the present embodiment.

The parameters involved in the formulas shown in this embodiment are defined in accordance with the definitions known in the art, except for the parameters specifically defined and explained.

In order to quantify bias caused by confusion variables, the invention adopts a Salmon model construction method, and an exemplary quantization mode in the embodiment can adopt Hilbert-Schmidt independence Criterion (HSIC). For the model after biased quantization, regularization term (regularization) is embedded for correction.

For quantization using the hilbert-schmitt independence criterion, the following formula is shown:

||C_h(y)h(z)||²＝(E_h(x)h(z)-E_h(x)E_h(z))²＝(E_h(x)h(z))²+(E_h(x)E_h(z))²-2E_h(x)h(z)E_h(x)E_h(z)

wherein L _H (Hilbert-Schmitt independent coefficient, hilbert-SCHMIDT INDEPENDENCE criterion) calculated by the formula is used for representing the independent degree of variables X and Z, and in the invention, a feature vector X (X ₁,…,x_m),x_i is an n-dimensional vector and represents methylation characteristics of a sample i, a classification label Y (Y ₁,…,y_m),y_i is a classification label of X _i, Y _i epsilon-1, +1, positive when Y _i is +1 and negative when Y _i is-1) is set, and a confusion variable Z (Z ₁,…,z_m),z_i is a confusion variable of the sample i and m represents the number of samples).

A support vector machine (SVM, support vector machine) is adopted as a main classifier to carry out two classification, and simultaneously, in order to control confusion variables, regularization terms are added into a target equation solved by the SVM, wherein the target equation is that

s.t.y_i(wTx+b)≥1-ξ

ξ_i≥0

Where ζ _i here refers to the degree to which the sample x _i violates the equation, C and λ are the coefficients that minimize training errors with control, minimize the correlation of confounding variables with interpreted variables, and maximize the balance of classification intervals.

In this embodiment, fig. 4 shows the control result of the weight configuration of the DOC model of the present application for the confounding variables.

Wherein each data point represents a blood sample for DOC model construction, the horizontal axis represents confounding variables of the corresponding sample, and the vertical axis represents original uncorrected interpretation variables (left graph) and corrected interpretation variables (right graph), respectively. Comparing the correction before and after, the weight of the confusion variable is controlled in the DOC model established by the invention.

In this example, fig. 5 shows that the DOC model established in the present invention overcomes the weakness of increasing the past methylation false positive with age in healthy groups, and maintains balance in each age group (the horizontal axis represents age, and the vertical axis represents model liver cancer probability score).

Example 3: detection of chronic viral hepatitis B based on DMR by DOC model

Based on the differentiation of liver cancer patients and chronic viral hepatitis B patients in different DMRs, 30 chronic viral hepatitis B samples and 82 liver cancer samples are randomly split into a training set (comprising 21 chronic viral hepatitis B samples and 57 liver cancer samples) and a verification set (comprising 9 chronic viral hepatitis B samples and 25 liver cancer samples) according to a ratio of 7:3. 195 DMRs (shown in table 1) with obvious methylation level differences are screened out by using a training set sample and used for constructing a DOC model and determining a threshold value, and the distinguishing performance of the model and the threshold value is further confirmed by using a verification set sample.

TABLE 1 195 DMRs screened against chronic viral hepatitis B according to the present invention

And carrying out ten-fold cross validation on 21 chronic viral hepatitis B samples and 57 liver cancer samples of the training set, and taking the average value of the threshold values corresponding to the optimal condition of the index of the cross reduction as a dividing threshold value for yin-yang division of the training set and the test set samples. The method comprises the following steps: the overall sensitivity of the training set was 96.5% (55/57), overall specificity was 90.5% (19/21), and AUC was 0.991; the overall sensitivity of the validation set was 88.0% (22/25), the sensitivity of each stage of a particular liver cancer is shown in Table 2, the overall specificity was 88.9% (8/9), and the AUC was 0.915. In addition, the sensitivity of each stage of a specific liver cancer is shown in Table 2.

The specific steps of ten-fold cross validation for the training set are as follows:

1. The chronic viral hepatitis B samples in the training set are randomly split into 10 parts, and similarly, the liver cancer samples are also randomly split into 10 parts;

2. establishing a corresponding DOC model by using a 9/10 chronic viral hepatitis B sample and a 9/10 liver cancer sample;

3. Predicting the residual 1/10 chronic viral hepatitis B sample and 1/10 liver cancer sample by using the DOC model, and obtaining an optimal threshold of the fold through a about dengue index optimal principle;

4. sequentially cycling until all samples are traversed, and obtaining 10 optimal thresholds;

5. calculating the average value of the 10 optimal thresholds as the threshold of the DOC model, namely DOC model threshold= -0.03 in the embodiment;

6. And judging the yin and yang of the test set sample by using the DOC model and the corresponding threshold value, namely, judging that the test set sample is negative when the DOC model is smaller than the threshold value and judging that the test set sample is positive when the DOC model is larger than the threshold value.

TABLE 2 sensitivity of liver cancer stages

In practice, due to cost and efficiency constraints, DOC models can also be built using a smaller number of DMRs to achieve partitioning, not limited to all 195 DMRs in table 1 above. Of the five replicates, 10 of 195 DMRs were randomly adopted each time for constructing DOC models and corresponding thresholds, and 10 DMRs were randomly adopted each time as shown in table 3:

TABLE 3 five randomly selected DMR combinations for chronic viral hepatitis B

The sensitivity, specificity results of five replicates with training set specificity controlled at the same level (85.7%) are shown in table 4:

TABLE 4 sensitivity (with each stage), specificity results of five randomly selected 10 DMR repeats

From this, it can be seen that any 10 DMRs of 195 DMRs provided by the invention can realize better specificity and sensitivity in dividing the training set and the verification set positive in each stage of liver cancer, and accords with the expectations of use.

Example 4: detection of cirrhosis based on DMR using DOC model

Based on the differentiation of liver cancer patients and liver cirrhosis patients in different DMRs, 22 liver cirrhosis samples and 82 liver cancer samples are split into a training set (containing 15 liver cirrhosis samples and 57 liver cancer samples) and a verification set (containing 7 liver cirrhosis samples and 25 liver cancer samples) according to a ratio of 7:3. 230 DMRs with obvious methylation level differences (shown in table 5) are screened out by adopting a training set sample and used for constructing a DOC model and determining a threshold value, and the distinguishing performance of the model and the threshold value is further confirmed by utilizing a verification set sample.

TABLE 5 230 DMR for cirrhosis selected according to the invention

Ten-fold cross-validation is carried out on 15 liver cirrhosis samples and 57 liver cancer samples of the training set, and the average value of the threshold values corresponding to the optimal condition of the index of the cross-fold is taken as the dividing threshold value for yin-yang division of the training set and the validation set samples. The specific procedure for ten-fold cross-validation is as in example 3, with the corresponding threshold value obtained being threshold=0.5. The method comprises the following steps: the overall sensitivity of the training dataset was 86.0% (49/57), overall specificity was 86.7% (13/15), and AUC was 0.929; the overall sensitivity of the data set was verified to be 80.0% (20/25), overall specificity was 85.7% (6/7), and AUC was 0.940. In addition, the sensitivity of each stage of a specific liver cancer is shown in Table 6.

TABLE 6 sensitivity of liver cancer stages

In practice, due to cost and efficiency constraints, DOC models can also be built using a smaller number of DMRs to achieve partitioning, not limited to all 195 DMRs in table 1 above. Of the five replicates, 10 of 195 DMRs were randomly adopted each time for constructing DOC models and corresponding thresholds, and 10 DMRs were randomly adopted each time as shown in table 7:

TABLE 7 five randomly selected DMR combinations for liver cirrhosis

The sensitivity, specificity results of five replicates with training set specificity controlled at the same level (73.3%) are shown in tables 8 and 9:

TABLE 8 sensitivity (without each stage), specificity results of five randomly selected 10 DMR repeats

TABLE 9 sensitivity (with each stage), specificity results of five randomly selected 10 DMR repeats

From this, it can be seen that any 10 DMRs of 230 DMRs provided by the invention can realize better specificity and sensitivity in each stage of liver cancer by dividing the training set and the verification set positive, and accords with the use expectation.

Furthermore, although the description provides only a method of constructing a DOC model for methylation detection and determining a threshold value based on the differences between a liver cancer patient and a chronic viral hepatitis type B patient and a liver cirrhosis patient, the method can be applied to other liver non-cancer patients as well, including: liver metastasis, hypercholesterolemia, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.

The foregoing detailed description is provided by way of explanation and example and is not intended to limit the scope of the appended claims. Numerous variations of the presently illustrated embodiments of the application will be apparent to those of ordinary skill in the art and are intended to be within the scope of the appended claims and equivalents thereof.

Claims

1. A biomarker combination for distinguishing liver cancer samples from liver non-cancer disease samples, wherein the biomarker combination comprises any of at least 10 differential methylation regions DMR as set forth in table 1 and/or table 5, wherein the reference genome employed by the DMR in table 1 and/or table 5 is a GRCh37/hg19 human reference genome;

preferably, the biomarker combination comprises any at least 10 DMR as set forth in table 1, and/or any at least 10 DMR as set forth in table 5;

Preferably, the biomarker combination comprises all 195 DMRs shown in table 1, and/or all 230 DMRs shown in table 5;

more preferably, the biomarker combination comprises any of at least 10 DMR as shown in table 1; alternatively, the biomarker combination comprises 10 DMR selected from any of the group of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 3;

More preferably, the biomarker combination comprises any of at least 10 DMR as shown in table 5; alternatively, the biomarker combination comprises 10 DMR selected from any of the group of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 7.

2. A kit, wherein the kit comprises reagents for detecting the biomarker combination of claim 1.

3. The kit of claim 2, wherein the kit comprises next generation sequencing reagents;

Preferably, the next generation sequencing reagents comprise hybridization capture probes or primers covering any at least 10 DMR in table 1 and/or table 5;

preferably, the next generation sequencing reagent comprises a primer covering any at least 10 DMR as set forth in table 1, and/or any at least 10 DMR as set forth in table 5;

Preferably, the next generation sequencing reagent comprises a hybridization capture probe or primer covering all 195 DMRs shown in table 1, and/or all 230 DMRs shown in table 5;

More preferably, the next generation sequencing reagents comprise hybridization capture probes or primers covering any of at least 10 DMR shown in table 1; alternatively, the next generation sequencing reagents comprise hybridization capture probes or primers covering 10 DMR selected from any of the sets of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 3;

more preferably, the next generation sequencing reagents comprise hybridization capture probes or primers covering any of at least 10 DMR shown in table 5; alternatively, the next generation sequencing reagents comprise hybridization capture probes or primers covering 10 DMR selected from any one of the sets of repeat one, repeat two, repeat three, repeat four, or repeat five shown in table 7.

4. A kit according to claim 2 or 3, wherein the kit is for distinguishing liver cancer samples from liver non-cancerous disease samples;

Preferably, the liver non-cancerous disease comprises: cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.

5. Use of a reagent for detecting the biomarker combination of claim 1 in the preparation of a kit for distinguishing a liver cancer sample from a liver non-cancer disease sample;

6. A methylation data classification method comprising:

acquiring first methylation data corresponding to the biomarker combination of claim 1 in a sample to be tested;

correcting the first methylation data according to the confusion factor corresponding to the sample to be detected to obtain second methylation data;

Classifying based on a preset rule according to the second methylation data and a classification threshold value, and generating indication information for indicating the classification to which the sample to be detected belongs;

The preset rule comprises the steps of comparing the numerical value of the second methylation data with a classification threshold value, and generating the indication information according to a comparison result; preferably, in response to the value of the second methylation data being smaller than or equal to the classification threshold, first indication information for indicating the classification to which the sample to be tested belongs is generated, and in response to the value of the second methylation data being larger than the classification threshold, second indication information for indicating the classification to which the sample to be tested belongs is generated;

Preferably, the classification to which the sample to be tested belongs includes: liver cancer, liver cirrhosis, chronic viral hepatitis B, liver metastasis, high cholesterol, alcoholic liver disease, cyst, fatty liver disease (NAFLD), fibrosis, jaundice, primary Sclerosing Cholangitis (PSC), hemochromatosis, primary biliary cirrhosis, or alpha-1 antitrypsin deficiency.

7. The method of claim 6, wherein the sample to be tested is selected from any one or more of the following: tissue samples, blood samples, saliva, sputum, pleural effusion, lung lavage, peritoneal effusion, peritoneal lavage, enema, and cerebrospinal fluid.

8. An apparatus for distinguishing a liver cancer sample from a liver non-cancer disease sample, comprising:

An acquisition unit configured to acquire first methylation data corresponding to the biomarker combination of claim 1 in a sample to be tested;

The correction unit is configured to correct the first methylation data according to the confusion factor corresponding to the sample to be detected to obtain second methylation data;

The classifying unit is configured to generate indicating information for indicating the class to which the sample to be detected belongs based on a preset rule according to the second methylation data and a classifying threshold, wherein the preset rule comprises comparing the value of the second methylation data with the classifying threshold, and generating the indicating information according to a comparison result; preferably, in response to the value of the second methylation data being smaller than or equal to the classification threshold, first indication information for indicating the classification to which the sample to be tested belongs is generated, and in response to the value of the second methylation data being larger than the classification threshold, second indication information for indicating the classification to which the sample to be tested belongs is generated;

9. An electronic device, comprising:

one or more processors;

A storage device having one or more programs stored thereon,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 6 or 7.

10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by one or more processors implements the method of claim 6 or 7.