CN111613268B - Method for determining gene expression regulation mechanism based on single cell transcriptome data - Google Patents
Method for determining gene expression regulation mechanism based on single cell transcriptome data Download PDFInfo
- Publication number
- CN111613268B CN111613268B CN202010464757.1A CN202010464757A CN111613268B CN 111613268 B CN111613268 B CN 111613268B CN 202010464757 A CN202010464757 A CN 202010464757A CN 111613268 B CN111613268 B CN 111613268B
- Authority
- CN
- China
- Prior art keywords
- cell
- gene
- expression
- determining
- central
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000014509 gene expression Effects 0.000 title claims abstract description 170
- 238000000034 method Methods 0.000 title claims abstract description 61
- 230000008844 regulatory mechanism Effects 0.000 title claims abstract description 51
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 327
- 108091023040 Transcription factor Proteins 0.000 claims abstract description 114
- 102000040945 Transcription factor Human genes 0.000 claims abstract description 114
- 230000003993 interaction Effects 0.000 claims abstract description 48
- 239000003446 ligand Substances 0.000 claims abstract description 48
- 238000013518 transcription Methods 0.000 claims abstract description 15
- 230000035897 transcription Effects 0.000 claims abstract description 12
- 239000011159 matrix material Substances 0.000 claims description 38
- 230000007246 mechanism Effects 0.000 claims description 21
- 230000033228 biological regulation Effects 0.000 claims description 16
- 238000010276 construction Methods 0.000 claims description 14
- 238000007781 pre-processing Methods 0.000 claims description 14
- 238000012216 screening Methods 0.000 claims description 14
- 238000010606 normalization Methods 0.000 claims description 12
- 230000009467 reduction Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 6
- 230000001276 controlling effect Effects 0.000 claims 1
- 210000004027 cell Anatomy 0.000 description 371
- 108020003175 receptors Proteins 0.000 description 90
- 102000005962 receptors Human genes 0.000 description 90
- 230000008569 process Effects 0.000 description 16
- 238000012163 sequencing technique Methods 0.000 description 13
- 230000004913 activation Effects 0.000 description 11
- 238000004458 analytical method Methods 0.000 description 11
- 230000001105 regulatory effect Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 9
- 230000000694 effects Effects 0.000 description 9
- 238000003908 quality control method Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 6
- 238000011160 research Methods 0.000 description 6
- 102000053723 Angiotensin-converting enzyme 2 Human genes 0.000 description 5
- 108090000975 Angiotensin-converting enzyme 2 Proteins 0.000 description 5
- 230000004071 biological effect Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000001413 cellular effect Effects 0.000 description 4
- 230000035990 intercellular signaling Effects 0.000 description 4
- 238000011144 upstream manufacturing Methods 0.000 description 4
- 238000000729 Fisher's exact test Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000010354 integration Effects 0.000 description 3
- 230000037361 pathway Effects 0.000 description 3
- 230000014493 regulation of gene expression Effects 0.000 description 3
- 230000011664 signaling Effects 0.000 description 3
- 238000000692 Student's t-test Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 210000002588 alveolar type II cell Anatomy 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000000546 chi-square test Methods 0.000 description 2
- 238000007621 cluster analysis Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000031146 intracellular signal transduction Effects 0.000 description 2
- 230000004068 intracellular signaling Effects 0.000 description 2
- 239000003550 marker Substances 0.000 description 2
- 230000001404 mediated effect Effects 0.000 description 2
- 230000002438 mitochondrial effect Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- 230000019491 signal transduction Effects 0.000 description 2
- 230000009897 systematic effect Effects 0.000 description 2
- 238000012353 t test Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 241001678559 COVID-19 virus Species 0.000 description 1
- 241000711573 Coronaviridae Species 0.000 description 1
- 101000612671 Homo sapiens Pulmonary surfactant-associated protein C Proteins 0.000 description 1
- 208000023369 Hyperphosphatasia-intellectual disability syndrome Diseases 0.000 description 1
- 238000000585 Mann–Whitney U test Methods 0.000 description 1
- 108020005196 Mitochondrial DNA Proteins 0.000 description 1
- 102100040971 Pulmonary surfactant-associated protein C Human genes 0.000 description 1
- 238000003559 RNA-seq method Methods 0.000 description 1
- 108020001027 Ribosomal DNA Proteins 0.000 description 1
- 210000002383 alveolar type I cell Anatomy 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008827 biological function Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000003915 cell function Effects 0.000 description 1
- 230000009391 cell specific gene expression Effects 0.000 description 1
- 230000005754 cellular signaling Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000012010 growth Effects 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 238000013090 high-throughput technology Methods 0.000 description 1
- 210000003630 histaminocyte Anatomy 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003834 intracellular effect Effects 0.000 description 1
- 101150112095 map gene Proteins 0.000 description 1
- 108020004999 messenger RNA Proteins 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000010839 reverse transcription Methods 0.000 description 1
- 210000003705 ribosome Anatomy 0.000 description 1
- 230000028327 secretion Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000002195 synergetic effect Effects 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 230000026683 transduction Effects 0.000 description 1
- 238000010361 transduction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
Landscapes
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Genetics & Genomics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The application provides a method for determining a gene expression regulation mechanism based on single-cell transcriptome data, which comprises the following steps: determining the specific high expression gene of the central cell and the specific high expression gene of the neighbor cell according to the single cell transcription group data; determining a first sub-network between the central cell and the neighbor cell according to the specific high-expression genes of the central cell and the neighbor cell and the pairing information between the ligand and the receptor; determining a second sub-network of the central cell according to the specific high-expression gene of the central cell and the interaction information between the transcription factor and the target gene; determining a third sub-network of the central cell according to the specific high-expression gene of the central cell and the interaction information between the receptor and the transcription factor; and determining an intercellular multilayer signal network of the central cell and the neighbor cells according to the first sub-network, the second sub-network and the third sub-network so as to reveal the neighbor cells to regulate the gene expression of the central cell.
Description
Technical Field
The application relates to the field of bioinformatics, in particular to a method for determining a gene expression regulation and control mechanism based on single-cell transcriptome data.
Background
The gene expression is the basis and the root of a complex life phenomenon, is a synergistic action process with multi-level, multi-factor and space-time characteristics, and the control rule of the gene expression is mastered, which is helpful for explaining the mechanisms of life growth and development, disease occurrence and development and the like.
Changes in gene expression level affect changes in cell function and fate, and therefore, the establishment of corresponding cell signaling networks is required for the study of gene expression regulation mechanisms. The signal networks involved in the regulation of gene expression include both intercellular signaling and intracellular signaling and gene activation. Therefore, a systematic, multi-layered, intercellular and intracellular signaling network is needed to elucidate the regulatory mechanisms of gene expression.
Currently, there are two main methods for studying gene expression regulation mechanisms: the first is based on traditional experimental studies and the second is based on methods of high throughput technology (RNA-Seq). Both of these approaches focus primarily on one or several linear signaling pathways at the molecular level.
Both methods have a number of disadvantages in their application:
1) The regulation of gene expression is regulated by a very complex signal network, and is formed by interweaving signal channels formed by functional molecules such as ligand-receptor-transcription factor-target gene and the like, and a single or a plurality of linear channels are not enough to clarify the regulation mechanism of gene expression.
2) The influence of the cellular microenvironment on the regulation of gene expression levels was neglected.
3) Traditional high-throughput sequencing techniques ignore differences in cell specificity. However, the conventional transcriptome sequencing usually reflects the average expression level of the whole gene in a certain region, and individual cell-specific functional molecules with special regulation and control effects may be mistaken as molecules without regulation and control significance because the expression level of the individual cell-specific functional molecules is not as high as that of functional molecules widely expressed among other cells. Thus, subtle, specific biological effects caused by cell-type differences are easily overlooked using traditional transcriptome sequencing.
4) The traditional experiment research efficiency is low. Although the accuracy of experimental research is high and the result is relatively reliable, due to the high resource consumption of the experiment and the complexity of the cell signal network, research can be carried out only on specific signal paths. Therefore, the regulation mechanism of gene expression cannot be comprehensively and systematically elucidated by conventional experimental studies.
Therefore, it is difficult to comprehensively and systematically disclose the regulation mechanism of gene expression by the current methods for studying the regulation mechanism of gene expression.
Disclosure of Invention
The embodiments of the present application are directed to a method for determining a gene expression regulation mechanism based on single-cell transcriptome data, so as to comprehensively and systematically disclose the gene expression regulation mechanism.
In order to achieve the above object, the embodiments of the present application are implemented as follows:
in a first aspect, the embodiments of the present application provide a method for determining a gene expression control mechanism based on single-cell transcriptome data, comprising: determining a specific high-expression gene of a central cell and a specific high-expression gene of a neighbor cell, wherein the central cell represents a cell type of a regulation mechanism of gene expression to be researched, and the neighbor cell represents a cell type having influence possibility on the gene expression of the central cell; determining a first sub-network between the central cell and the neighbor cell according to the specific high-expression genes of the central cell and the specific high-expression genes of the neighbor cell and pairing information between a ligand and a receptor; determining a second sub-network of the central cell according to the specific high-expression genes of the central cell and the interaction information between the transcription factors and the target genes; determining a third sub-network of the central cell according to the specific high-expression genes of the central cell and the interaction information between the receptor and the transcription factor; and determining an intercellular multilayer signal network of the central cell and the neighbor cells according to the first sub-network, the second sub-network and the third sub-network so as to reveal the neighbor cells to regulate the gene expression of the central cell.
In the embodiment of the application, a first sub-network, a second sub-network and a third sub-network are respectively constructed by determining the specific high-expression gene of the central cell and the specific high-expression gene of the neighbor cell and combining pairing information between a ligand and a receptor, interaction information between a transcription factor and a target gene and interaction information between the receptor and the transcription factor, and the intercellular multilayer signal network of the central cell and the neighbor cell is obtained after integration. Therefore, the gene regulation mechanism can be displayed more comprehensively and systematically, the influence of cell specificity on the gene expression regulation mechanism can be accurately reflected, and a new tool is provided for analyzing the regulation mechanism of the cell microenvironment mediation on the interested gene.
With reference to the first aspect, in a first possible implementation manner of the first aspect, before the determining the specific high-expression genes of the central cell and the specific high-expression genes of the neighbor cells, the method further includes: determining an expression matrix of the single cell according to the single cell transcription group data, wherein a row represents each gene, and a column represents each cell; filtering the expression matrix of the single cell, and carrying out normalization processing on the filtered expression matrix of the single cell so as to realize the pretreatment of the single cell transcriptome data; performing dimension reduction, clustering and cell type identification on the single-cell transcriptome data subjected to data preprocessing to determine a data set comprising multiple cell types; determining the central cell and the neighbor cells from the data set.
In the implementation mode, the expression matrix of the single cell is determined according to the single cell transcription group data so as to filter and normalize the expression matrix, so that the data quality control can be realized, the influence of the sequencing depth/library size and the abnormal value/extreme value on the sequencing result can be reduced, the possible technical noise can be corrected, and the like. And performing dimensionality reduction, clustering and cell type identification on the single-cell transcription group data subjected to data preprocessing to determine a data set comprising multiple cell types, so that the accurate classification of the cell types is favorably realized, and the research on a gene expression regulation mechanism is facilitated.
With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the determining the specific high-expression gene of the central cell and the specific high-expression gene of the neighboring cell includes: determining a specific expression matrix of the central cell and a specific expression matrix of the neighbor cell according to the clustering result of the single-cell transcriptome data; and determining the specific high-expression genes of the central cells and the specific high-expression genes of the neighbor cells according to the specific expression matrix of the central cells and the specific expression matrix of the neighbor cells and preset screening conditions.
In the implementation mode, the specific high-expression genes of the central cells and the specific high-expression genes of the neighbor cells can be accurately determined by combining the expression matrix for determining the specificity of the central cells and the expression matrix for determining the specificity of the neighbor cells with the preset screening conditions.
With reference to the first aspect, in a third possible implementation manner of the first aspect, the determining a first subnetwork between the central cell and the neighbor cell according to the specific high-expression gene of the central cell and the specific high-expression gene of the neighbor cell, and the pairing information between the ligand and the receptor includes: obtaining a first relationship list comprising pairing information between a ligand and a receptor; according to the first relation list, determining a high-expression receptor from the specific high-expression genes of the central cell, and determining a high-expression ligand from the specific high-expression genes of the neighbor cells; establishing a first sub-network between the central cell and the neighbor cells based on the highly expressed receptor and the highly expressed ligand.
In this implementation, a first relationship list including information on pairs of ligands and receptors is determined, a high-expression receptor is determined from the specific high-expression genes of the central cell, and a high-expression ligand is determined from the specific high-expression genes of the neighbor cells, so that a first subnetwork between the central cell and the neighbor cells is established, and a signal network formed by the neighbor cells through the pairs of ligands and receptors of the central cell can be accurately reflected.
With reference to the first aspect, in a fourth possible implementation manner of the first aspect, the determining a second subnetwork of the central cell according to the specific highly expressed gene of the central cell and the interaction information between the transcription factor and the target gene includes: acquiring a second relation list containing interaction information between the transcription factor and the target gene; determining a first gene set containing all target genes in the central cell, a second gene set containing high-expression target genes in the central cell and a third gene set containing target genes corresponding to the specified transcription factors; determining a fourth gene set comprising significantly activated transcription factors according to the first gene set, the second gene set and the third gene set; establishing a second subnetwork of the central cells according to the second relationship list, the second gene set, and the fourth gene set.
In this implementation, by determining a second relationship list including interaction information of the transcription factor and the target genes, and determining a fourth gene set including a significantly activated transcription factor according to a first gene set including all target genes in the central cell, a second gene set including target genes highly expressed in the central cell, and a third gene set including target genes corresponding to the specified transcription factor, a second subnetwork of the central cell can be established by combining the second relationship list, the second gene set, and the fourth gene set, so as to comprehensively and accurately reflect a signal network between the transcription factor and the target genes in the central cell.
With reference to the first aspect, in a fifth possible implementation manner of the first aspect, the determining a third subnetwork of the central cell according to the specific highly expressed gene of the central cell and the interaction information between the receptor and the transcription factor includes: obtaining a third relation list containing interaction information between the receptor and the transcription factor; determining a fifth gene set containing all transcription factors in the central cells, a sixth gene set containing activated transcription factors in the central cells and a seventh gene set containing the transcription factors corresponding to the specified receptors; determining an eighth gene set comprising significantly activated receptors according to the fifth gene set, the sixth gene set and the seventh gene set; establishing a third subnetwork of the central cells according to the third relational list, the sixth gene set and the eighth gene set.
In this implementation, a third subnetwork of the central cell can be established by determining a third relational list containing information about interaction between the receptors and the transcription factors, and determining an eighth gene set containing significantly activated receptors from a fifth gene set containing all the transcription factors in the central cell, a sixth gene set containing the transcription factors activated in the central cell, and a seventh gene set containing the transcription factors corresponding to the specified receptors, so as to comprehensively and accurately reflect a signal network between the receptors and the transcription factors in the central cell.
With reference to the first aspect or any one of the first to the fifth possible implementation manners of the first aspect, in a sixth possible implementation manner of the first aspect, the determining an intercellular multilayer signal network of the central cell and the neighboring cells according to the first sub-network, the second sub-network, and the third sub-network includes: updating the first, second, and third sub-networks according to an upstream-downstream relationship among the first, second, and third sub-networks; and integrating the updated first sub-network, the updated second sub-network and the updated third sub-network to determine the intercellular multilayer signal network of the central cell and the neighbor cells.
Updating each sub-network according to the upstream and downstream relations among the first sub-network, the second sub-network and the third sub-network, and integrating the updated sub-networks to determine the intercellular multilayer signal network of the central cell and the neighbor cells, so that the signal network relation between the central cell and the neighbor cells can be systematically and comprehensively reflected, and the method is favorable for systematically and deeply researching the regulation and control path and mechanism of gene expression in the central cell.
In a second aspect, the present application provides an apparatus for determining a gene expression control mechanism based on single-cell transcriptome data, comprising: a high expression gene screening unit for determining a specific high expression gene of a central cell and a specific high expression gene of a neighbor cell, wherein the central cell represents a cell type of a regulation mechanism of gene expression of the central cell to be researched, and the neighbor cell represents a cell type having influence possibility on the gene expression of the central cell; a first sub-network construction unit, configured to determine a first sub-network between the central cell and the neighbor cell according to the specific high-expression genes of the central cell and the neighbor cell, and pairing information between a ligand and a receptor; a second sub-network construction unit, which is used for determining a second sub-network of the central cell according to the specific high-expression genes of the central cell and the interaction information between the transcription factors and the target genes; a third sub-network construction unit, configured to determine a third sub-network of the central cell according to the specific highly expressed gene of the central cell and the interaction information between the receptor and the transcription factor; and the multilayer network model construction unit is used for determining an intercellular multilayer signal network of the central cell and the neighbor cells according to the first sub-network, the second sub-network and the third sub-network so as to reveal the regulation and control of the neighbor cells on the gene expression of the central cell.
In a third aspect, embodiments of the present application provide a storage medium storing one or more programs, which are executable by one or more processors to implement the method for determining a gene expression regulation mechanism based on single-cell transcriptome data according to any one of the first aspect or possible implementations of the first aspect.
In a fourth aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory is configured to store information including program instructions, and the processor is configured to control execution of the program instructions, where the program instructions are loaded and executed by the processor, to implement the method for determining a single-cell transcriptome-data-based gene expression regulation mechanism according to any one of the first aspect or possible implementations of the first aspect.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
FIG. 1 is a flow chart of a method for determining a gene expression control mechanism based on single-cell transcriptome data according to an embodiment of the present disclosure.
FIG. 2 is a schematic diagram of the overall process of the method for determining the gene expression control mechanism based on single-cell transcriptome data provided in the embodiment of the present application.
FIG. 3 is a diagram of cell-specific gene expression violin.
FIG. 4 is a schematic diagram of the cell-specific expression of the gene of interest ACE2 and the multi-layer signal network between the central cell and the neighboring cells.
Fig. 5 is a block diagram illustrating a structure of a device for determining a gene expression control mechanism based on single-cell transcriptome data according to an embodiment of the present application.
Fig. 6 is a block diagram of an electronic device according to an embodiment of the present disclosure.
Icon: 10-a device for determining a gene expression regulation mechanism based on single-cell transcriptome data; 11-high expression gene screening unit; 12-a first sub-network building unit; 13-a second sub-network building unit; 14-a third sub-network building unit; 15-a multi-layer network model building unit; 20-an electronic device; 21-a memory; 22-a communication module; 23-a bus; 24-a processor.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
Gene expression regulation is regulated by a very complex signaling network. The signal network is usually formed by interlacing signal paths composed of functional molecules such as ligand-receptor-transcription factor-target gene. Since there is also an interplay between signal paths, there is not a strict one-to-one correspondence between functional molecules: for example, the same receptor can bind to a plurality of ligands with similar structures, the receptor can change a plurality of downstream molecular conformations after being activated, and transcription factors can also activate the expression of a plurality of target genes; alternatively, the same functional molecule may appear in different pathways and perform different biological functions, and different ligands acting on the same receptor may produce different or even opposite effects. Therefore, a comprehensive signaling network integrating multiple pathways is needed to more systematically elucidate gene expression regulation mechanisms.
Furthermore, cells are not independent individuals in multicellular tissues or organisms, often multiple cell types coexist interactively in the microenvironment, and their function and fate are often coordinated by its local environment and neighboring cells. The cellular microenvironment includes various types of cells and chemical molecules, and the intercellular secretion of signal molecules (indirect information exchange) is one of the information transmission modes in the cellular microenvironment, which is not limited to the interaction between intercellular receptors and ligands, but also includes the signal molecule transduction on the cell surface, the cascade amplification effect of signal molecules inside the cell and the interaction between downstream transcription factors and target genes. Therefore, the regulation mechanism of gene expression can be comprehensively and systematically researched by researching the cell microenvironment.
Moreover, conventional transcriptome sequencing often reflects the average expression level of the whole gene in a certain region, and individual cell-specific functional molecules with special regulation and control effects may be mistaken as molecules without regulation and control significance because the expression level of the individual cell-specific functional molecules is not as high as that of other cell-to-cell functional molecules. Subtle, specific biological effects caused by cell type differences are therefore easily overlooked using traditional transcriptome sequencing. The single cell transcription group data can be used for identifying cell types and quantifying cell type specific gene expression in mixed cell groups, so that the interaction of microenvironment is solved, and intracellular and intercellular signal pathways mediated by the microenvironment are clarified.
Therefore, in order to comprehensively and systematically explore the regulation mechanism of gene expression, the embodiment of the application provides a method for determining the regulation mechanism of gene expression based on the single-cell transcriptome data.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for determining a gene expression control mechanism based on single-cell transcriptome data according to an embodiment of the present disclosure. The method for determining a gene expression regulatory mechanism based on single-cell transcriptome data may include step S10, step S20, step S30, step S40, and step S50.
In order to take into account the subtle, specific biological effects caused by cell type differences in the intercellular signaling network for comprehensive and systematic study of the gene expression regulation mechanism, the single-cell transcriptome data approach can be used in this embodiment.
Referring to FIG. 2, FIG. 2 is a schematic diagram of the overall process of determining the gene expression control mechanism based on single-cell transcriptome data according to the embodiment of the present application. The process mainly comprises the following steps: the method comprises a single cell transcription group data analysis process, a multilayer structure signal path sub-network construction process and a cell microenvironment multilayer signal network construction process. Wherein, the analysis of the single-Cell transcriptome data can comprise the processes of data Quality control (Quality control), normalization (normalization), dimension reduction (dimensional reduction), cell clustering analysis (Cell clustering), marker identification (Marker identification) and the like; the signal path sub-network construction of the multi-layer structure may include: construction of Ligand-Receptor sub-network, i.e. Ligand-Receptor sub-network (first sub-network); construction of a TF-target gene sub-network, i.e., transcription factor-target gene network (second subnetwork); construction of Receptor-TF sub-network, receptor-transcription factor sub-network (third sub-network); a multilayer signaling network, namely the process of constructing a multilayer signal network and the like.
Before step S10 is executed, data preprocessing may be performed on the single-cell transcriptome data, and dimension reduction, clustering and cell type identification may be performed on the single-cell transcriptome data after data preprocessing, so as to determine a data set including multiple cell types.
In this embodiment, R-package seruat can be used to analyze single-cell transcriptome data, and the analysis process may include two parts: data preprocessing and downstream analysis.
Illustratively, the data preprocessing process may include the steps of: data quality control, normalization/standardization (normalization/scaling) and data rectification and Integration (Correction and Integration).
For example, assuming a total of n cells, m genes, in the single-cell transcriptome data, the single-cell transcriptome data may be converted into an m × n matrix, where rows represent each gene and columns represent each cell.
In order to realize data quality control, the expression matrix of single cells can be filtered. Illustratively, data quality control can be achieved based on the number of gene expressions (e.g., filtering out cells expressing genes greater than 2500 and less than 200), gene expression profiles (e.g., filtering out genes expressed by less than 3 cells), mitochondrial/ribosomal gene ratio (e.g., filtering out cells with a ribosomal/mitochondrial ratio of greater than 10%). It should be noted that the data quality control method and standard are only exemplary and should not be considered as limiting the present application, and the data quality control method and standard may be selected according to actual needs.
In this example, each value in the expression matrix may represent the successful capture, reverse transcription and sequencing of a mRNA molecule in a cell. In practice, even if the same cell is sequenced twice, the depth of the counts obtained may vary. Thus, to reduce the impact of sequencing depth and/or library size on sequencing results, the expression matrix can be normalized. For example, expression matrices are normalized by the lognormaize algorithm to obtain comparable relative abundance of gene expression between cells. In order to facilitate the subsequent downstream analysis process, the height difference gene may be calculated by using the vst algorithm (of course, the height difference gene may also be calculated when the subsequent downstream analysis is performed, and the calculation is not limited herein). And, in order to exclude the influence of the gene expression outlier and the extreme value, the expression matrix may be z-score-converted (normalized) so that the mean of the expression amount of each gene in all cells is 0 and the variance is 1. It should be noted that the normalization processing method is only exemplary, and should not be considered as limiting the application, and different manners may be selected according to actual needs. In addition, the normalization method can select whether to perform normalization according to actual needs.
When single-cell transcriptome data contains multiple data sets, the effect of batch effects also needs to be considered in the process of merging the data sets. In addition, single cell sequencing may present various technical and biological noises, such as cell stress state, sequencing capture failure (dropout), and the like. Corresponding measures (ComBat, LIGER and other software) can be taken to reduce the influence of batch effect and other noises so as to ensure the accuracy of the data as much as possible.
After data preprocessing is performed on the single-cell transcriptome data, downstream analysis can be performed on the single-cell transcriptome data after data preprocessing. In this embodiment, the downstream analysis may include dimension reduction, clustering, and cell type identification.
For example, the normalized expression matrix may be subjected to linear dimensionality reduction by using a Principal Component Analysis (PCA), or may be subjected to other manners, which is not limited herein.
After linear dimensionality reduction, partial principal components can be selected as required to represent the entire data set for cluster analysis. For example, the selected principal components are subjected to clustering analysis by using a Louvain algorithm, and the visualization of a clustering result can be realized through tSNE. Of course, the cluster analysis or the visualization of the cluster result may be implemented in other manners, and is not limited herein.
Then, the single cell identification information collected through the channels of documents, databases and the like or the clustering result is annotated through automatic annotation software to obtain different cell types, so that the cell types are identified to determine a data set comprising a plurality of cell types.
The data preprocessing is carried out on the single-cell transcription group data, so that the data quality control is realized, the influence of the sequencing depth/library size and the abnormal value/extreme value on the sequencing result is reduced, the possible technical noise is corrected, and the like. And performing dimensionality reduction, clustering and cell type identification on the single-cell transcription group data subjected to data preprocessing to determine a data set comprising multiple cell types, so that the accurate classification of the cell types is favorably realized, and the research on a gene expression regulation mechanism is facilitated.
After data pre-processing and downstream analysis (including dimension reduction, clustering, and cell type identification) of the single-cell transcriptome data is completed, central cells and neighbor cells can be determined from the data set.
Generally, when the gene expression regulation mechanism of some genes of interest (e.g., a gene of interest) needs to be investigated due to the specific expression of the gene, the cell type specifically expressing the gene can be determined as a central cell, and the cell type having a possibility of affecting the gene expression of the central cell can be considered as a neighbor cell. Thus, the central cell and neighbor cells can be determined from the data set. However, such a manner should not be considered as limiting the present application, and there are many other ways of determining central and neighbor cells, for example, it may be desirable to study the effect of the cellular microenvironment on gene expression, a particular cell type may be designated as a central cell, a particular cell type may be designated as a neighbor cell, and the like.
After the central cell and the neighbor cells are determined, step S10 may be performed.
Step S10: and determining the specific high-expression genes of the central cell and the specific high-expression genes of the neighbor cells, wherein the central cell represents the cell type of the regulation mechanism of the gene expression to be researched, and the neighbor cells represent the cell types with influence possibility on the gene expression of the central cell.
In this embodiment, the specific expression matrix of the central cell and the specific expression matrix of the neighbor cells can be determined according to the clustering result of the single-cell transcriptome data. Then, the specific high expression genes of the central cells and the specific high expression genes of the neighbor cells can be determined according to the specific expression matrix of the central cells and the specific expression matrix of the neighbor cells and preset screening conditions.
Illustratively, a cell type-specific expression matrix can be obtained according to the clustering result of the single-cell transcriptome data, and the expression ratio of a certain gene (e.g., a gene of interest or other genes) in a specified type of cell (i.e., the cell proportion of the specified type of cell in which the gene expression exceeds a set threshold) can be calculated.
For example, the preset screening condition may be: the one of the two types of cells (e.g., the central cell and a neighbor cell of the central cell) in which the ratio of the gene expression is larger needs to be larger than a certain threshold (e.g., 0.1); the difference of the gene expression ratio of the two types of cells needs to be larger than a certain threshold (for example, 0.1). In addition, the difference in the mean expression values of the gene in the two types of cells needs to be greater than a certain threshold (e.g., 0.25). Of course, such screening conditions are merely exemplary and should not be construed as limiting the present application, and may be set based on a combination of factors such as actual requirements, types of genes, and types of cells.
Illustratively, genes in a cell type-specific expression matrix may be screened according to screening conditions. Then, normalization processing (for example, normalization processing by using a lognormaize algorithm) can be performed on the expression matrix corresponding to the screened gene, and then t-test is performed to examine the reliability of the differential expression of the gene. For example, if the p-value of the t-test is less than 0.05, it can be determined that the gene is highly expressed in the cell type. Thus, it can be determined that the gene is a specific highly expressed gene of the cell type.
In this way, the specific high-expression gene of the central cell and the specific high-expression gene of the neighbor cell can be accurately determined.
Of course, the method for determining the specific highly expressed gene of the central cell and the specific highly expressed gene of the neighboring cell is not limited to this method, and may be implemented in other methods. For example, the fold-change method is used to analyze the gene expression level difference by using the expression value fold, i.e., the ratio of the expression levels of the genes under two conditions is calculated, the threshold value of the ratio is determined, and the gene with the ratio larger than the threshold value is judged as the differentially expressed gene. In addition, statistical wilcoxon rank sum test, SAM and other methods can be used, and an appropriate method can be selected according to actual needs to determine the specific high-expression gene of the central cell and the specific high-expression gene of the neighbor cell, without limitation.
After the specific high expression genes of the central cell and the specific high expression genes of the neighbor cells are determined, step S20 may be performed.
Step S20: and determining a first sub-network between the central cell and the neighbor cell according to the specific high-expression genes of the central cell and the specific high-expression genes of the neighbor cells and pairing information between the ligand and the receptor.
In this embodiment, a first relationship list including pairing information between the ligand and the receptor may be obtained, and the high-expression receptor may be determined from the specific high-expression gene of the central cell and the high-expression ligand may be determined from the specific high-expression gene of the neighbor cell according to the first relationship list, so as to establish a first subnetwork between the central cell and the neighbor cell according to the high-expression receptor and the high-expression ligand.
Illustratively, matching information about the ligand and the receptor can be collected from databases such as DLRP, IUPHAR, HPMR, HPRD, and STRING, and the collected matching information (for example, 2557 pairs) of the ligand and the receptor is collated to obtain a first relationship list including the interaction relationship between the 2557 pairs of the ligand and the receptor, which can be expressed as: e LR ={(Ligand i ,Receptor i )}。
It should be noted that 2557 pairing information of ligand and receptor is only exemplary, and the collected pairing information of ligand and receptor may be different according to the database, and the logarithm of the pairing information of ligand and receptor may be changed by updating the pairing information of ligand and receptor in the database, which is not limited herein. In addition, the obtained first relationship list may be already stored, and is not limited herein.
According to the first relation list E LR The ligand with high expression can be determined from the specificity high expression gene of the neighbor cell and is marked asDetermination of the highly expressed receptor from the specific highly expressed genes of the central cell, and this is reportedA first subnetwork between the central cell and the neighbor cells can thus be established, denoted:
the method comprises the steps of determining a first relation list containing pairing information of a ligand and a receptor, determining a high-expression receptor from specific high-expression genes of a central cell, and determining a high-expression ligand from specific high-expression genes of neighbor cells, so that a first sub-network between the central cell and the neighbor cells is established, and a signal network formed by pairing the ligand and the receptor of the central cell by the neighbor cells can be accurately reflected.
After the specific high expression genes of the central cell and the specific high expression genes of the neighbor cells are determined, step S30 may be performed.
Step S30: and determining a second sub-network of the central cell according to the specific high-expression genes of the central cell and the interaction information between the transcription factors and the target genes.
In this embodiment, a second relationship list containing information on the interaction between the transcription factor and the target gene may be acquired. And a first gene set comprising all target genes in the central cell, a second gene set comprising target genes highly expressed in the central cell, and a third gene set comprising target genes corresponding to the specified transcription factors can be determined. Then, a fourth gene set containing a transcription factor which is significantly activated is determined according to the first gene set, the second gene set and the third gene set, so that a second sub-network of the central cell is established according to the second relation list, the second gene set and the fourth gene set.
Illustratively, the interaction information between the transcription factor and the target gene can be collected from a TRED, KEGG, etc. database, and the collected interaction information between the transcription factor and the target gene (e.g. 8869 pairs) is collated to obtain a second relationship list containing the 8869 pairs of interaction relationships between the transcription factor and the target gene, which can be expressed as: e TT ={(TF i ,TG i )}。
It should be noted that the interaction information of 8869 on the transcription factor and the target gene is merely exemplary and not limited herein. In addition, the obtained second relationship list may be already stored, and is not limited herein.
According to the second relation list E TT The first gene comprising all target genes in the central cell can be determinedThe cause is collected and recorded as TG all (ii) a A second set of genes comprising highly expressed target genes in the central cell (i.e., a set of all highly expressed target genes in the central cell) can be determined and designated TG up (ii) a The TF containing the specified transcription factor can be determined i The third set of corresponding target genes, denoted as
Then, TG can be based on the first gene set all A second gene set TG up And a third Gene setVerifying the activation of the transcription factor to determine a fourth set of genes comprising a significantly activated transcription factor, denoted asIllustratively, fisher exact test (i.e., fisher exact test) can be used to verify the activation of the transcription factor, and specifically, the activation probability of the transcription factor is calculated as follows:
wherein,representing a binomial coefficient;represents a highly expressed target gene regulated by a specified transcription factor in a central cell; b = | TG up A represents a high-expression target gene regulated by a non-specified transcription factor in a central cell;represents a non-highly expressed target gene regulated by a specified transcription factor in a central cell; d = | TG all And l-a-b-c, which represents a target gene which is not highly expressed in the central cell, or a target gene regulated by a specified transcription factor.
Illustratively, a transcription factor can be determined to be a significantly activated transcription factor when the transcription factor activation probability P is less than 0.05. From this, a fourth gene set comprising all significantly activated transcription factors in the central cell can be determined
It should be noted that the manner of determining the fourth gene set including the significantly activated transcription factor may also be another manner, for example, the manner of verifying the activation state of the transcription factor by using the chi-square test to determine the fourth gene set including all the significantly activated transcription factors in the central cell, and therefore, the manner of determining the fourth gene set should not be considered as a limitation of the present application.
After the fourth gene set is determined, the second relationship list E can be used TT A second gene set TC up And a fourth Gene setEstablishing a second subnetwork of central cells, denoted:
by determining a second relation list containing interaction information of the transcription factors and the target genes, determining a fourth gene set containing the transcription factors which are obviously activated according to a first gene set containing all the target genes in the central cell, a second gene set containing the target genes which are highly expressed in the central cell and a third gene set containing the target genes corresponding to the specified transcription factors, and combining the second relation list, the second gene set and the fourth gene set, a second sub-network of the central cell can be established so as to comprehensively and accurately reflect a signal network between the transcription factors and the target genes in the central cell.
After the specific high expression genes of the central cell and the specific high expression genes of the neighbor cells are determined, step S40 may be performed.
Step S40: and determining a third sub-network of the central cell according to the specific high-expression genes of the central cell and the interaction information between the receptor and the transcription factor.
In this embodiment, a third relationship list including information on interaction between receptors and transcription factors may be obtained, a fifth gene set including all transcription factors in the central cell, a sixth gene set including transcription factors activated in the central cell, a seventh gene set including transcription factors corresponding to the designated receptors may be determined, and an eighth gene set including significantly activated receptors may be determined according to the fifth gene set, the sixth gene set, and the seventh gene set. Then, a third subnetwork of central cells is established according to the third relational list, the sixth gene set and the eighth gene set.
Illustratively, information about the interaction between the receptor and the transcription factor can be collected from the STRING database, and the collected information about the interaction between the receptor and the transcription factor is screened to screen out the receptor and the downstream transcription factor (for example, 39141 pair) with the shortest distance, so as to obtain a third relationship list containing the interaction relationship between the 39141 pair of the receptor and the transcription factor, which can be written as: e RT ={(Receptor i ,TF i )}。
It should be noted that 39141 is merely exemplary and not limiting for the interaction information between the receptor and the transcription factor. In addition, the obtained third relationship list may be already stored, and is not limited herein.
According to the third relation table E RT A fifth set of genes, designated TF, can be identified which contains all transcription factors in the central cell all (ii) a A sixth set of genes comprising transcription factors activated in the central cell (i.e., all significantly activated transcription factors in the central cell) can be identified as TF A (ii) a Can determine that the specific receptor R is contained i Corresponding rotationThe seventh Gene set of transcription factors, denotedWherein the sixth gene set TF A Namely the fourth gene set
Then, TF can be assembled based on the fifth gene set all And the sixth gene set TF A (i.e., the fourth Gene set)) And the seventh Gene setValidating the activation of the receptor to determine an eighth set of genes comprising significantly activated receptors, denoted as
Illustratively, fisher's exact test can also be used to verify receptor activation, and specifically, the probability of receptor activation is calculated as follows:
wherein,(i.e. the) Indicating activated transcription factors regulated by a given receptor in the central cell; y = | TF A I-x (i.e. the) Indicates activated transcription factors regulated by unspecified receptors in the central cell;represents an inactive transcription factor regulated by a given receptor in the central cell; n = | TF all I-x-y-m, which means neither an activated transcription factor nor a transcription factor regulated by a given receptor in the central cell.
Illustratively, a receptor can be determined to be a significantly activated receptor when the receptor activation probability P is less than 0.05. From this, an eighth gene set comprising all significantly activated receptors in the central cell can be determined
It should be noted that the activation status of the receptor can also be verified in other ways, such as chi-square test, to determine the eighth gene set comprising all significantly activated receptors in the central cell, which is not limited herein.
After the eighth gene set is determined, the third relation table E can be used RT And the sixth gene set TF A (i.e., the fourth Gene set)) The eighth Gene setEstablishing a third subnetwork of central cells, denoted:namely, it is
By determining a third relation list containing interaction information of the receptors and the transcription factors, and determining an eighth gene set containing significantly activated receptors according to a fifth gene set containing all the transcription factors in the central cells, a sixth gene set containing the transcription factors activated in the central cells and a seventh gene set containing the transcription factors corresponding to the specified receptors, a third subnetwork of the central cells can be established so as to comprehensively and accurately reflect a signal network between the receptors and the transcription factors in the central cells.
It should be noted that, step S20 does not have a strict sequence with step S30 and step S40, but step S30 and step S40 need to be executed first and then step S30 and step S40 need to be executed between step S30 and step S40. That is, step S20 may be performed first, and then step S30 and step S40 may be performed; or, first, step S30 is executed, then step S20 is executed, and then step S40 is executed; or, step S30 is executed first, step S40 is executed, and step S20 is executed; step S20 and step S30 may be executed simultaneously, or step S20 and step S40 may be executed simultaneously, only step S40 is required to be ensured after step S30, and therefore, the present application should not be considered as limited herein.
After establishing the first, second and third sub-networks, step S50 may be performed.
Step S50: and determining an intercellular multilayer signal network of the central cell and the neighbor cells according to the first sub-network, the second sub-network and the third sub-network so as to reveal the neighbor cells to regulate the gene expression of the central cell.
In this embodiment, the first sub-network, the second sub-network and the third sub-network may be updated according to the upstream and downstream relationship among the first sub-network, the second sub-network and the third sub-network. And then integrating the updated first sub-network, the updated second sub-network and the updated third sub-network to determine an intercellular multilayer signal network of the central cell and the neighbor cells.
Illustratively, a first sub-network may reveal an intercellular signaling pathway between the receptor and the ligand, a second sub-network may reveal a mutual intracellular signaling pathway between the transcription factor and the target gene, and a third sub-network may reveal an intracellular signaling pathway between the receptor and the transcription factor. When a gene expression regulation mechanism is explored, the ligand of the neighbor cell is combined with the receptor of the central cell, the receptor interacts with the transcription factor, and the transcription factor can act on the target gene, so that the determined upstream and downstream relationship is as follows: first sub-network-third sub-network-second sub-network. Thus, the first, second and third sub-networks may be updated according to the determined upstream and downstream relationship.
For example, the update process of the first sub-network, the second sub-network, and the third sub-network may specifically be:
determination of a significantly activated and highly expressed receptor in central cells asThus, the first subnetwork (ligand and receptor) is updated to:while the third subnetwork (receptor and transcription factor) can be updated to:in N RT The corresponding transcription factor in (1) is marked as TF; the second subnetwork (transcription factor and target gene) can be updated to: n is a radical of TT =E TT ∩(TF×TG up ) Wherein, TG up Represents a target gene highly expressed in the central cell, i.e., the second gene set.
After updating the first, second, and third subnetworks, the updated subnetworks may be integrated to determine an intercellular multi-layered signal network of the central cell and the neighboring cells.
In addition, in order to facilitate observation and research, the intercellular multilayer signal network of the central cell and the neighbor cells can be visualized and visually presented.
Updating each sub-network through the upstream-downstream relationship among the first sub-network, the second sub-network and the third sub-network, and integrating the updated sub-networks to determine the intercellular multilayer signal network of the central cell and the neighbor cells, so that the signal network relationship between the central cell and the neighbor cells can be systematically and comprehensively reflected, and the regulation and control path and mechanism of gene expression in the central cell can be systematically and deeply researched.
It should be noted that, in this embodiment, the central cell and the neighbor cells are taken as an example for illustration, the number of neighbor cells (providing ligands) is not limited, and in the actual process of studying the gene expression regulation mechanism, the number of the neighbor cells involved is usually large, and it can reflect which cells are regulated during the gene expression process of the central cell, and what the regulation mechanism is. In addition, the interaction relation among the ligand, the receptor, the transcription factor and the target gene is expressed as a multi-layer and multi-channel network structure, so that the gene regulation mechanism can be displayed comprehensively and systematically, the influence of cell specificity on the gene expression regulation mechanism can be accurately reflected, and a new tool is provided for analyzing the regulation mechanism of the cell microenvironment mediated on the interested gene.
Referring to fig. 3 and 4, fig. 3 is a diagram of gene expression violin, and fig. 4 is a schematic diagram of cell-specific expression of the gene of interest ACE2 and a multilayer signal network between a central cell and a neighbor cell.
As can be seen from fig. 3, the method for determining a gene expression regulation and control mechanism based on single-cell transcriptome data provided in the embodiment of the present application can identify specific cell types, which provides a completely new perspective for studying the gene expression regulation and control mechanism. The cell class can be annotated by identifying genes (age, SFTPC, etc. in fig. 3).
Part A of FIG. 4 shows that the present method can map gene expression levels to specific cell types, excluding interference from differences in overall expression levels of a variety of cells. Compared with the traditional analysis method, the method can capture specific biological effects at the resolution of a single cell so as to explore the influence of cell heterogeneity on gene expression.
Part B of fig. 4 shows the mechanism of regulation of ACE2 expression levels by the microenvironment signal network centered on AT2 cells. The intercellular multilayer signal network not only shows the influence of a plurality of signal paths on the ACE2 expression level, but also shows the specific effect of various cells on the ACE2 expression level. Wherein SARS-CoV-2 represents a novel coronavirus; the Transcription factor represents a Transcription factor; nucleotide chain (i.e., gene); ligand represents a Ligand; receptor means Receptor; while Mast cells, AT1 cells, etc. represent neighbor cells, and AT2 cells represent central cells.
Referring to fig. 5, based on the same inventive concept, the embodiment of the present application further provides an apparatus 10 for determining a gene expression regulation mechanism based on single-cell transcriptome, comprising:
a high-expression gene screening unit 11 for determining a specific high-expression gene of a central cell indicating a cell type of a regulatory mechanism of gene expression of which is to be studied and a specific high-expression gene of a neighbor cell indicating a cell type having a possibility of affecting gene expression of the central cell;
a first sub-network constructing unit 12, configured to determine a first sub-network between the central cell and the neighbor cell according to the specific high-expression genes of the central cell and the specific high-expression genes of the neighbor cell, and pairing information between a ligand and a receptor;
a second subnetwork constructing unit 13 for determining a second subnetwork of the central cell according to the specific highly expressed gene of the central cell and the interaction information between the transcription factor and the target gene;
a third sub-network constructing unit 14 for determining a third sub-network of the central cell according to the specific highly expressed gene of the central cell and the interaction information between the receptor and the transcription factor;
and a multilayer network model constructing unit 15, configured to determine an intercellular multilayer signal network of the central cell and the neighbor cells according to the first sub-network, the second sub-network, and the third sub-network, so as to reveal the regulation of the neighbor cells on the gene expression of the central cell.
In this embodiment, the apparatus 10 for determining a gene expression regulation and control mechanism based on single-cell transcriptome data further includes a data processing and cell determining unit, configured to perform data preprocessing on the single-cell transcriptome data before the high-expression gene screening unit 11 determines the specific high-expression gene of the central cell and the specific high-expression gene of the neighbor cell; performing dimensionality reduction, clustering and cell type identification on the single-cell transcription group data subjected to data preprocessing to determine a data set comprising multiple cell types; determining the central cell and the neighbor cells from the data set.
In this embodiment, the data processing and cell determining unit is further configured to determine an expression matrix of a single cell according to the single cell transcription group data, where a row represents each gene and a column represents each cell; and filtering the expression matrix of the single cell, and carrying out normalization processing on the filtered expression matrix of the single cell so as to realize the pretreatment of the single cell transcriptome data.
In this embodiment, the high-expression gene screening unit 11 is further configured to determine an expression matrix of specificity of the central cell and an expression matrix of specificity of the neighbor cells according to a clustering result of the single-cell transcriptome data; and determining the specific high-expression genes of the central cells and the specific high-expression genes of the neighbor cells according to the specific expression matrix of the central cells and the specific expression matrix of the neighbor cells and preset screening conditions.
In this embodiment, the first sub-network constructing unit 12 is further configured to obtain a first relationship list including pairing information between the ligand and the receptor; according to the first relation list, determining a high-expression receptor from the specific high-expression genes of the central cell, and determining a high-expression ligand from the specific high-expression genes of the neighbor cells; establishing a first subnetwork between the central cell and the neighbor cells based on the highly expressed receptor and the highly expressed ligand.
In this embodiment, the second sub-network constructing unit 13 is further configured to obtain a second relationship list including information on interaction between the transcription factor and the target gene; determining a first gene set containing all target genes in the central cell, a second gene set containing high-expression target genes in the central cell and a third gene set containing target genes corresponding to the specified transcription factors; determining a fourth gene set comprising significantly activated transcription factors according to the first gene set, the second gene set and the third gene set; establishing a second subnetwork of the central cells according to the second relationship list, the second gene set, and the fourth gene set.
In this embodiment, the third subnetwork constructing unit 14 is further configured to obtain a third relation list including information about interaction between the receptor and the transcription factor; determining a fifth gene set containing all transcription factors in the central cells, a sixth gene set containing activated transcription factors in the central cells and a seventh gene set containing the transcription factors corresponding to the specified receptors; determining an eighth gene set comprising significantly activated receptors according to the fifth gene set, the sixth gene set and the seventh gene set; establishing a third subnetwork of the central cells according to the third relational list, the sixth gene set and the eighth gene set.
In this embodiment, the multi-layer network model constructing unit 15 is further configured to update the first sub-network, the second sub-network, and the third sub-network according to an upstream-downstream relationship among the first sub-network, the second sub-network, and the third sub-network; and integrating the updated first sub-network, the updated second sub-network and the updated third sub-network to determine the intercellular multilayer signal network of the central cell and the neighbor cells.
Referring to fig. 6, fig. 6 is a block diagram of an electronic device 20 according to an embodiment of the present disclosure.
In this embodiment, the electronic device 20 may be a terminal device, such as a personal computer, a notebook computer, etc., and is not limited herein. Of course, the electronic device 20 may also be a server, such as a network server, a cloud server, a server cluster, and the like, which is not limited herein.
Illustratively, the electronic device 20 may include: a communication module 22 connected to the outside world via a network, one or more processors 24 for executing program instructions, a bus 23, a Memory 21 of different form, such as a magnetic disk, a ROM (Read-Only Memory), a RAM (Random Access Memory), or any combination thereof. The memory 21, the communication module 22 and the processor 24 are connected by a bus 23.
Illustratively, the memory 21 has stored therein a program. Processor 24 may invoke and run these programs from memory 21 so that a method of determining a gene expression regulatory mechanism based on single-cell transcriptome data may be performed by running the programs to fully and systematically reveal the regulatory mechanism of gene expression.
Also, the present embodiments provide a storage medium storing one or more programs, which are executable by one or more processors to implement the method for determining a gene expression regulation mechanism based on single-cell transcriptome data as described in the present embodiments.
In summary, the embodiments of the present application provide a method for determining a gene expression regulation and control mechanism based on single-cell transcriptome data, which includes determining a specific high-expression gene of a central cell and a specific high-expression gene of a neighbor cell, respectively constructing a first subnetwork, a second subnetwork, and a third subnetwork by combining pairing information between a ligand and a receptor, interaction information between a transcription factor and a target gene, and interaction information between the receptor and the transcription factor, and integrating to obtain an intercellular multilayer signal network of the central cell and the neighbor cell. Therefore, the gene regulation mechanism can be displayed more comprehensively and systematically, the influence of cell specificity on the gene expression regulation mechanism can be accurately reflected, and a new tool is provided for analyzing the regulation mechanism of the cell microenvironment mediation on the interested gene.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (10)
1. A method for determining a gene expression control mechanism based on single cell transcriptome data, comprising:
determining a specific high-expression gene of a central cell and a specific high-expression gene of a neighbor cell, wherein the central cell represents a cell type of a regulation mechanism of gene expression to be researched, and the neighbor cell represents a cell type having influence possibility on the gene expression of the central cell;
determining a first sub-network between the central cell and the neighbor cell according to the specific high-expression genes of the central cell and the specific high-expression genes of the neighbor cell and pairing information between a ligand and a receptor;
determining a second sub-network of the central cell according to the specific high-expression genes of the central cell and the interaction information between the transcription factors and the target genes;
determining a third sub-network of the central cell according to the specific high-expression genes of the central cell and the interaction information between the receptor and the transcription factor;
and determining an intercellular multilayer signal network of the central cell and the neighbor cells according to the first sub-network, the second sub-network and the third sub-network so as to reveal the neighbor cells to regulate the gene expression of the central cell.
2. The method for determining the gene expression regulatory mechanism based on single-cell transcriptome data of claim 1, wherein before said determining the specific highly expressed genes of the central cell and the specific highly expressed genes of the neighbor cells, said method further comprises:
determining an expression matrix of the single cell according to the single cell transcriptome data, wherein a row represents each gene and a column represents each cell;
filtering the expression matrix of the single cell, and carrying out normalization processing on the filtered expression matrix of the single cell so as to realize the pretreatment of the single cell transcriptome data;
performing dimensionality reduction, clustering and cell type identification on the single-cell transcription group data subjected to data preprocessing to determine a data set comprising multiple cell types;
determining the central cell and the neighbor cells from the data set.
3. The method for determining the gene expression regulation mechanism based on single-cell transcriptome data of claim 2, wherein said determining the specific high-expression gene of the central cell and the specific high-expression gene of the neighbor cell comprises:
determining a specific expression matrix of the central cell and a specific expression matrix of the neighbor cell according to the clustering result of the single-cell transcriptome data;
and determining the specific high-expression genes of the central cells and the specific high-expression genes of the neighbor cells according to the specific expression matrix of the central cells and the specific expression matrix of the neighbor cells and preset screening conditions.
4. The method for determining the gene expression control mechanism based on single-cell transcriptome data of claim 1, wherein said determining the first subnetwork between the central cell and the neighbor cell according to the specific high-expression gene of the central cell and the specific high-expression gene of the neighbor cell, and the pairing information between the ligand and the receptor comprises:
obtaining a first relationship list comprising pairing information between a ligand and a receptor;
according to the first relation list, determining a high-expression receptor from the specific high-expression genes of the central cell, and determining a high-expression ligand from the specific high-expression genes of the neighbor cells;
establishing a first subnetwork between the central cell and the neighbor cells based on the highly expressed receptor and the highly expressed ligand.
5. The method for determining the gene expression control mechanism based on single-cell transcriptome data of claim 1, wherein determining the second subnetwork of the central cell according to the specific highly expressed gene of the central cell and the interaction information between the transcription factor and the target gene comprises:
acquiring a second relation list containing interaction information between the transcription factor and the target gene;
determining a first gene set containing all target genes in the central cell, a second gene set containing high-expression target genes in the central cell and a third gene set containing target genes corresponding to the specified transcription factors;
determining a fourth gene set comprising significantly activated transcription factors according to the first gene set, the second gene set and the third gene set;
establishing a second subnetwork of the central cells according to the second relationship list, the second gene set, and the fourth gene set.
6. The method for determining the gene expression control mechanism based on single-cell transcriptome data of claim 1, wherein said determining the third subnetwork of said central cell according to the specific highly expressed gene of said central cell and the interaction information between the receptor and the transcription factor comprises:
obtaining a third relation list containing interaction information between the receptor and the transcription factor;
determining a fifth gene set containing all transcription factors in the central cells, a sixth gene set containing activated transcription factors in the central cells and a seventh gene set containing the transcription factors corresponding to the specified receptors;
determining an eighth gene set comprising significantly activated receptors according to the fifth gene set, the sixth gene set and the seventh gene set;
establishing a third subnetwork of the central cells according to the third relational list, the sixth gene set and the eighth gene set.
7. The method for determining the gene expression regulatory mechanism based on single-cell transcriptome data of any one of claims 1 to 6, wherein said determining the intercellular multilayer signal network of the central cell and the neighbor cells according to the first sub-network, the second sub-network and the third sub-network comprises:
updating the first, second, and third sub-networks according to an upstream-downstream relationship among the first, second, and third sub-networks;
and integrating the updated first sub-network, the updated second sub-network and the updated third sub-network to determine the intercellular multilayer signal network of the central cell and the neighbor cells.
8. An apparatus for determining a gene expression control mechanism based on single-cell transcriptome data, comprising:
a high expression gene screening unit for determining a specific high expression gene of a central cell and a specific high expression gene of a neighbor cell, wherein the central cell represents a cell type of a regulation mechanism of gene expression of the central cell to be researched, and the neighbor cell represents a cell type having influence possibility on the gene expression of the central cell;
a first sub-network construction unit, configured to determine a first sub-network between the central cell and the neighbor cell according to the specific high-expression genes of the central cell and the neighbor cell, and pairing information between a ligand and a receptor;
a second sub-network construction unit, which is used for determining a second sub-network of the central cell according to the specific high-expression genes of the central cell and the interaction information between the transcription factors and the target genes;
a third subnetwork construction unit, configured to determine a third subnetwork of the central cell according to the specific highly expressed gene of the central cell and the interaction information between the receptor and the transcription factor;
and the multilayer network model construction unit is used for determining an intercellular multilayer signal network of the central cell and the neighbor cells according to the first sub-network, the second sub-network and the third sub-network so as to reveal the regulation and control of the neighbor cells on the gene expression of the central cell.
9. A storage medium storing one or more programs executable by one or more processors to implement the method for determining the gene expression regulatory mechanism based on single-cell transcriptome of any one of claims 1 to 7.
10. An electronic device comprising a memory for storing information including program instructions and a processor for controlling execution of the program instructions, characterized in that: the program instructions, when loaded and executed by a processor, implement the method for determining a gene expression regulation mechanism based on single-cell transcriptome data of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010464757.1A CN111613268B (en) | 2020-05-27 | 2020-05-27 | Method for determining gene expression regulation mechanism based on single cell transcriptome data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010464757.1A CN111613268B (en) | 2020-05-27 | 2020-05-27 | Method for determining gene expression regulation mechanism based on single cell transcriptome data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111613268A CN111613268A (en) | 2020-09-01 |
CN111613268B true CN111613268B (en) | 2023-02-24 |
Family
ID=72203129
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010464757.1A Active CN111613268B (en) | 2020-05-27 | 2020-05-27 | Method for determining gene expression regulation mechanism based on single cell transcriptome data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111613268B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112466403B (en) * | 2020-12-31 | 2022-06-14 | 广州基迪奥生物科技有限公司 | Cell communication analysis method and system |
CN112820353B (en) * | 2021-01-22 | 2023-10-03 | 中山大学 | Method and system for analyzing cell fate conversion key transcription factors |
CN113178233B (en) * | 2021-04-27 | 2023-04-28 | 西安电子科技大学 | Large-scale single-cell transcriptome data efficient clustering method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106874704A (en) * | 2017-01-04 | 2017-06-20 | 湖南大学 | The sub- recognition methods of key regulatory in a kind of common regulated and control network of gene based on linear model |
CN109467596A (en) * | 2018-11-12 | 2019-03-15 | 湖北省农业科学院畜牧兽医研究所 | Application of the transcription factor SP 1 in regulation pig RTL1 gene expression |
CN109637588A (en) * | 2018-12-29 | 2019-04-16 | 北京百迈客生物科技有限公司 | A method of gene regulatory network is constructed based on full transcript profile high-flux sequence |
CN109726352A (en) * | 2018-12-12 | 2019-05-07 | 青岛大学 | A kind of construction method of the gene regulatory network based on Differential Equation Model |
CN109979538A (en) * | 2019-03-28 | 2019-07-05 | 广州基迪奥生物科技有限公司 | A kind of analysis method based on the unicellular transcript profile sequencing data of 10X |
-
2020
- 2020-05-27 CN CN202010464757.1A patent/CN111613268B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106874704A (en) * | 2017-01-04 | 2017-06-20 | 湖南大学 | The sub- recognition methods of key regulatory in a kind of common regulated and control network of gene based on linear model |
CN109467596A (en) * | 2018-11-12 | 2019-03-15 | 湖北省农业科学院畜牧兽医研究所 | Application of the transcription factor SP 1 in regulation pig RTL1 gene expression |
CN109726352A (en) * | 2018-12-12 | 2019-05-07 | 青岛大学 | A kind of construction method of the gene regulatory network based on Differential Equation Model |
CN109637588A (en) * | 2018-12-29 | 2019-04-16 | 北京百迈客生物科技有限公司 | A method of gene regulatory network is constructed based on full transcript profile high-flux sequence |
CN109979538A (en) * | 2019-03-28 | 2019-07-05 | 广州基迪奥生物科技有限公司 | A kind of analysis method based on the unicellular transcript profile sequencing data of 10X |
Non-Patent Citations (1)
Title |
---|
Effect of Dynamic Interaction between microRNA and Transcription Factor on Gene Expression;Xiaoqiang Sun et al.;《Research Article》;20161110;第1-2页 * |
Also Published As
Publication number | Publication date |
---|---|
CN111613268A (en) | 2020-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hung et al. | Gene set enrichment analysis: performance evaluation and usage guidelines | |
CN111613268B (en) | Method for determining gene expression regulation mechanism based on single cell transcriptome data | |
Oulas et al. | Systems bioinformatics: increasing precision of computational diagnostics and therapeutics through network-based approaches | |
Marsland III et al. | A minimal model for microbial biodiversity can reproduce experimentally observed ecological patterns | |
Chen et al. | Identifying protein complexes and functional modules—from static PPI networks to dynamic PPI networks | |
Chang et al. | Dynamic modular architecture of protein-protein interaction networks beyond the dichotomy of ‘date’and ‘party’hubs | |
Ideker et al. | Discovering regulatory and signalling circuits in molecular interaction networks | |
Jiang et al. | Constructing disease-specific gene networks using pair-wise relevance metric: application to colon cancer identifies interleukin 8, desmin and enolase 1 as the central elements | |
Liu | Identifying network-based biomarkers of complex diseases from high-throughput data | |
US20170228496A1 (en) | System and method for process control of gene sequencing | |
Dhawan et al. | Guidelines for using sigQC for systematic evaluation of gene signatures | |
Taskesen et al. | Pan-cancer subtyping in a 2D-map shows substructures that are driven by specific combinations of molecular characteristics | |
Maghsoudi et al. | A comprehensive survey of the approaches for pathway analysis using multi-omics data integration | |
CN111312334B (en) | Receptor-ligand system analysis method for influencing intercellular communication | |
Farahbod et al. | Differential coexpression in human tissues and the confounding effect of mean expression levels | |
Chung et al. | Decoding the exposome: data science methodologies and implications in exposome-wide association studies (ExWASs) | |
CN117079804A (en) | Method and system for constructing digestive system tumor clinical result prediction model | |
Overall et al. | The small world of adult hippocampal neurogenesis | |
Bartlett et al. | An eQTL biological data visualization challenge and approaches from the visualization community | |
Chand et al. | Network biology approach for identifying key regulatory genes by expression based study of breast cancer | |
Yépez et al. | Detection of aberrant events in RNA sequencing data | |
Seffernick et al. | High-dimensional genomic feature selection with the ordered stereotype logit model | |
Lee et al. | ASpediaFI: functional interaction analysis of alternative splicing events | |
CN118314951B (en) | Glioblastoma prognosis biomarker screening analysis method and system | |
Ghulam et al. | A Review of Pathway Databases and Related Methods Analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |