CN111370055A

CN111370055A - A method for establishing an intron retention prediction model and its prediction method

Info

Publication number: CN111370055A
Application number: CN202010146731.2A
Authority: CN
Inventors: 李洪东; 郑剑涛; 林翠香
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2020-07-03
Anticipated expiration: 2040-03-05
Also published as: CN111370055B

Abstract

The invention discloses a method for establishing an intron retention prediction model, which includes collecting simulation data and real data related to intron retention; defining all independent intron sets in a genome and taking them as standard templates; The set intron sequence reading distribution pattern image data set is processed and processed to obtain the processed data set; the processed data set is divided into training set and test set according to the set ratio; the training set is used to train the neural network model to obtain the final establishment A neural network intron retention prediction model. The invention also discloses a prediction method including the method for establishing the intron retention prediction model. The invention can visualize and predict introns based on the distribution pattern of intron retention readings, and has high reliability and accuracy.

Description

A method for establishing an intron retention prediction model and its prediction method

技术领域technical field

本发明具体设计一种内含子保留预测模型建立方法及其预测方法。The present invention specifically designs a method for establishing an intron retention prediction model and a method for predicting the same.

背景技术Background technique

内含子保留是可变剪接的一种，是指前体mRNA中的内含子没有被剪接出来而保留在成熟mRNA中。内含子保留以前被认为是错误剪接的结果，得到的关注较少。最近有许多研究表明：内含子保留与基因表达调控和复杂疾病(如阿尔茨海默病)相关；并且随着高通量测序技术的发展，目前已经有许多可用于内含子保留检测的方法提出，以iREAD和IRFinder较为突出。其中iREAD通过假设内含子保留的读数是均匀分布的，计算熵值来检测内含子保留，相应的过滤指标较为严格。IRFinder则通过计算IR-ratio指示内含子出现在转录本中的比例来检测内含子保留。Intron retention is a type of alternative splicing, which means that the intron in the precursor mRNA is not spliced out and remains in the mature mRNA. Intron retention was previously thought to be the result of incorrect splicing and received less attention. Many recent studies have shown that intron retention is related to gene expression regulation and complex diseases (such as Alzheimer's disease); and with the development of high-throughput sequencing technology, there are now many available for intron retention detection. The method proposed, iREAD and IRFinder are more prominent. Among them, iREAD detects intron retention by assuming that the reads of intron retention are evenly distributed, and calculates the entropy value, and the corresponding filter index is relatively strict. IRFinder detects intron retention by calculating the IR-ratio indicating the proportion of introns present in the transcript.

尽管上述的方法已经成功地应用到了真实环境中，但是，根据序列特征来进行分析，或多或少会受限于内含子保留可能引起的偏差而导致方法鲁棒性不足，从而使得目前的方法可靠性不高，制约了相关技术的发展。Although the above methods have been successfully applied in real environments, the analysis based on sequence features is more or less limited by the biases that may be caused by intron retention, resulting in insufficient robustness of the method, which makes the current The reliability of the method is not high, which restricts the development of related technologies.

发明内容SUMMARY OF THE INVENTION

本发明的目的之一在于提供一种可靠性高且准确性好的内含子保留预测模型建立方法。One of the objectives of the present invention is to provide a method for establishing an intron retention prediction model with high reliability and accuracy.

本发明的目的之二在于提供一种包括了所述内含子保留预测模型建立方法的预测方法。Another object of the present invention is to provide a prediction method including the method for establishing the intron retention prediction model.

本发明提供的这种内含子保留预测模型建立方法，包括如下步骤：This intron retention prediction model establishment method provided by the invention comprises the following steps:

S1.收集内含子保留相关的模拟数据和真实数据；S1. Collect simulated data and real data related to intron retention;

S2.定义基因组中所有独立内含子集合并作为标准模板；S2. Define all independent intron sets in the genome and use them as standard templates;

S3.获取步骤S1得到的模拟数据中所设定的内含子序列读数分布模式图片数据集，并进行预处理得到处理后的数据集；S3. Obtain the intron sequence reading distribution pattern picture data set set in the simulation data obtained in step S1, and perform preprocessing to obtain the processed data set;

S4.将步骤S3得到的处理后的数据集按照设定比例划分为训练集和测试集；S4. Divide the processed data set obtained in step S3 into a training set and a test set according to a set ratio;

S5.采用步骤S4得到的训练集训练神经网络模型，从而得到最终建立的神经网络内含子保留预测模型。S5. Use the training set obtained in step S4 to train the neural network model, thereby obtaining the finally established neural network intron retention prediction model.

所述的内含子保留预测模型建立方法，还包括如下步骤：The method for establishing an intron retention prediction model further comprises the following steps:

S6.根据步骤S5得到的神经网络内含子保留预测模型，在步骤S4得到的测试集上计算神经网络内含子保留预测模型的评价参数；S6. According to the neural network intron retention prediction model obtained in step S5, the evaluation parameters of the neural network intron retention prediction model are calculated on the test set obtained in step S4;

S7.获取步骤S1得到的真实数据的内含子序列读数分布模式图片测试集；S7. Obtain the intron sequence reading distribution pattern picture test set of the real data obtained in step S1;

S8.根据步骤S5得到的神经网络内含子保留预测模型，在步骤S7得到的测试集上预测内含子保留结果，从而得到预测内含子保留集合；S8. according to the neural network intron retention prediction model obtained in step S5, predict the intron retention result on the test set obtained in step S7, thereby obtaining the predicted intron retention set;

S9.获取步骤S8得到的预测内含子保留集合中，启示坐标外显子侧W1个碱基、内含子侧N1个碱基，共W1+N1个碱基的5’端序列；S9. Obtain the predicted intron retention set obtained in step S8, revealing the coordinates of W1 bases on the exon side and N1 bases on the intron side, a total of W1+N1 bases 5'-end sequence;

S10.获取步骤S8得到的预测内含子保留集合中，启示坐标外显子侧W2个碱基、内含子侧N2个碱基，共W2+N2个碱基的3’端序列；S10. Obtain the predicted intron retention set obtained in step S8, revealing the coordinates of the exon side W2 bases, the intron side N2 bases, a total of W2+N2 bases 3'-end sequences;

S11.根据步骤S9获得的W1+N1个碱基的5’端序列和步骤S10获得的W2+N2个碱基的3’端序列，计算剪接位点强度，从而得到5’端平均剪接位点强度值和3’端平均剪接位点强度值；S11. According to the 5'-end sequence of W1+N1 bases obtained in step S9 and the 3'-end sequence of W2+N2 bases obtained in step S10, calculate the splice site intensity, thereby obtaining the average splice site at the 5' end Intensity value and average splice site intensity value at 3' end;

S12.根据步骤S11得到的5’端平均剪接位点强度值和3’端平均剪接位点强度值，对步骤S5建立的神经网络内含子保留预测模型进行评价。S12. According to the average splice site intensity value at the 5' end and the average splice site intensity value at the 3' end obtained in step S11, evaluate the neural network intron retention prediction model established in step S5.

步骤S1所述的收集内含子保留相关的模拟数据和真实数据，具体为采用BEER算法生成含有确定内含子数目的模拟数据序列文件SIMU30；所述模拟数据序列文件SIMU30的测序深度为三千万，读数长度为100个碱基，设定生成基因15000个，内含子69338个；以及从阿尔茨海默病加速药物合作项目的Tau和APP小鼠模型研究中的一个真实数据序列文件APP，测序深度为一亿，读数长度为101个碱基。The described collection of intron retention related simulation data and real data in step S1 is specifically to adopt the BEER algorithm to generate a simulation data sequence file SIMU30 containing a certain number of introns; the sequencing depth of the simulated data sequence file SIMU30 is 3,000 10,000, read length is 100 bases, set to generate 15,000 genes, 69,338 introns; and a real data sequence file APP from the Tau and APP mouse model research of the Alzheimer's Disease Accelerated Drug Collaboration Project , the sequencing depth is 100 million, and the read length is 101 bases.

步骤S2所述的定义基因组中所有独立内含子集合并作为标准模板，具体为采用如下步骤进行定义：The set of all independent introns in the defined genome described in step S2 is combined as a standard template, specifically, the following steps are used to define:

A.从GRCm38小鼠基因组的release-75版本的注释gtf文件，提取所有的独立内含子集合Independent_intron；所述独立内含子的定义为不与任何同型外显子重叠的内含子；A. From the annotated gtf file of the release-75 version of the GRCm38 mouse genome, extract all independent intron sets Independent_intron; the independent introns are defined as introns that do not overlap with any homotypic exons;

B.在步骤A得到的独立内含子集合Independent_intron中，以基因为单位，合并坐标区间有重叠的内含子，得到最终的独立内含子集合intron cluster。B. In the independent intron set Independent_intron obtained in step A, take the gene as the unit, merge the overlapping introns in the coordinate interval, and obtain the final independent intron set intron cluster.

步骤A所述的提取所有的独立内含子集合Independent_intron，具体为合并一个染色体中的所有外显子，然后从基因区域删除所有外显子，从而得到所有的独立内含子。The extraction of all the independent intron sets Independent_intron described in step A, specifically, is to merge all the exons in a chromosome, and then delete all the exons from the gene region, thereby obtaining all the independent introns.

步骤S3所述的获取步骤S1得到的模拟数据中所设定的内含子序列读数分布模式图片数据集，并进行预处理得到处理后的数据集，具体为采用如下步骤获取数据集并进行数据：The step S3 obtains the intron sequence reading distribution pattern picture data set set in the simulation data obtained in the step S1, and performs preprocessing to obtain the processed data set. Specifically, the following steps are used to obtain the data set and analyze the data. :

a.将步骤S1得到的模拟数据序列文件SIMU30中的每个内含子进行IGV可视化，得到初步的可视化图像；a. Perform IGV visualization on each intron in the simulation data sequence file SIMU30 obtained in step S1 to obtain a preliminary visualization image;

b.分别保存每个内含子5’端和3’端左、右各20个碱基，长度一共为40个碱基的两段序列可视化图像；可视化图像的高度为100mm，同时对代表碱基丰度的条形图高度进行标准化处理；b. Save the visual images of two sequences of 20 bases on the left and right sides of the 5' end and 3' end of each intron, with a total length of 40 bases; the height of the visual image is 100mm, and the representative base The bar graph height of base abundance is normalized;

c.对于步骤b得到的图像，裁剪整张图像的纵长为131～231像素的部分，以及横长280～1070像素的部分；c. For the image obtained in step b, crop the part with a vertical length of 131-231 pixels and a part with a horizontal length of 280-1070 pixels;

d.将步骤c裁剪得到的图像进行横向合并，从而得到最终的处理后的数据集。d. Merge the images cropped in step c horizontally to obtain the final processed data set.

步骤S4所述的将步骤S3得到的处理后的数据集按照设定比例划分为训练集和测试集，具体为在步骤S1得到的模拟数据序列文件SIMU30中，定义序列总读数大于第一设定值、FPKM(每百万读数中匹配到基因中每千个碱基的片段数，Fragments Per KilobaseMillion)大于第二设定值且连续读数大于第三设定值的内含子为正样本，剩余的内含子为负样本；然后在正负样本中，随机抽取X2个正样本和X2个负样本，构成最终的数据集；然后按照设定的比例将数据集划分为训练集和测试集；X2为正整数。Described in step S4, the processed data set obtained in step S3 is divided into a training set and a test set according to a set ratio, specifically, in the simulated data sequence file SIMU30 obtained in step S1, it is defined that the total reading of the sequence is greater than the first set. Value, FPKM (fragments per kilobase per million reads matched to the gene, Fragments Per KilobaseMillion) is greater than the second set value and the introns with consecutive reads greater than the third set value are positive samples, and the remaining The introns are negative samples; then, in the positive and negative samples, X2 positive samples and X2 negative samples are randomly selected to form the final data set; then the data set is divided into training set and test set according to the set ratio; X2 is a positive integer.

步骤S5所述的神经网络模型，具体为VGG16网络结构模型。The neural network model described in step S5 is specifically a VGG16 network structure model.

步骤S5所述的采用步骤S4得到的训练集训练神经网络模型，从而得到最终建立的神经网络内含子保留预测模型，具体为采用如下步骤训练模型：Described in step S5, using the training set obtained in step S4 to train the neural network model, so as to obtain the finally established neural network intron retention prediction model, specifically adopting the following steps to train the model:

(1)获得在ImageNet任务上已经训练好的VGG16网络结构模型以及对应的权重参数文件；所述网络结构模型工包括13个卷积层；(1) Obtain the VGG16 network structure model that has been trained on the ImageNet task and the corresponding weight parameter file; the network structure model includes 13 convolutional layers;

(2)加载步骤(1)得到的网络及权重作为预训练网络，但冻结该网络从而保证该网络不参与训练；(2) Load the network and weight obtained in step (1) as a pre-training network, but freeze the network to ensure that the network does not participate in training;

(3)定义一个二分类网络，在步骤S4得到的训练集上进行训练；所述二分类网络共有3层，前2层为全连接层，神经元个数分别为256和64，每层后面接一个Dropout层防止过拟合，随机丢弃神经元的概率分别设为0.5和0.3。最后一层为sigmoid层，用于二分类；(3) Define a two-class network, and perform training on the training set obtained in step S4; the two-class network has three layers in total, the first two layers are fully connected layers, and the number of neurons is 256 and 64 respectively. A dropout layer is followed to prevent overfitting, and the probability of randomly dropping neurons is set to 0.5 and 0.3, respectively. The last layer is the sigmoid layer for binary classification;

(4)分类网络训练好后，解冻预训练网络的后3层卷积层，再次用步骤S4所得训练集对分类网络和预训练网络一起训练，并调整权重；(4) After the classification network is trained, unfreeze the last three convolution layers of the pre-training network, use the training set obtained in step S4 again to train the classification network and the pre-training network together, and adjust the weights;

(5)设定模型训练过程的参数如下：(5) The parameters of the model training process are set as follows:

模型训练总的参数数目为3300万，其中可训练参数数目为2600万，不可训练参数数目为700万；The total number of parameters for model training is 33 million, of which the number of trainable parameters is 26 million, and the number of non-trainable parameters is 7 million;

损失函数为二分类交叉熵损失，计算公式为The loss function is the two-category cross entropy loss, and the calculation formula is

其中i为每个样本，t_i为样本i的真实标签；y_i为样本i的预测标签；where i is each sample, t _i is the real label of sample i; y _i is the predicted label of sample i;

优化器为RMSprop，学习率为2e^-5，迭代次数为30；The optimizer is RMSprop, the learning rate is 2e- ⁵ , and the number of iterations is 30;

评价指标为accuracy，计算公式为：The evaluation index is accuracy, and the calculation formula is:

其中Truepositive为预测为正且真实为正的样本数；Turenegative为预测为负且真实为负的样本数；Allsamples为总样本数；Among them, Truepositive is the number of samples that are predicted to be positive and are actually positive; Truenegative is the number of samples that are predicted to be negative and are actually negative; Allsamples is the total number of samples;

设置ReduceLROnPlateau每2次迭代监测学习率，若监测到学习率未下降，则调整学习率降低50％；Set ReduceLROnPlateau to monitor the learning rate every 2 iterations. If the monitoring learning rate does not decrease, adjust the learning rate to decrease by 50%;

设置若评价指标accuracy在10次迭代中均未下降，则提前停止迭代。Set if the evaluation index accuracy does not decrease in 10 iterations, stop the iteration in advance.

步骤S6所述的在步骤S4得到的测试集上计算神经网络内含子保留预测模型的评价参数，具体为在步骤S4得到的测试集上计算神经网络内含子保留预测模型的AUC值。The calculation of the evaluation parameters of the neural network intron retention prediction model on the test set obtained in step S4 in step S6 is specifically calculating the AUC value of the neural network intron retention prediction model on the test set obtained in step S4.

步骤S7所述的获取步骤S1得到的真实数据的内含子序列读数分布模式图片测试集，具体为将步骤S1得到的真实数据的序列文件APP输入到预测工具iREAD和预测工具IRFinder中，分别得到两组内含子保留预测集合IR1和IR2；将IR1和IR2根据匹配坐标区间长度最大的规则映射到独立内含子集合intron cluster上，再取两者交集，得到交集IC；然后，将交集IC中的各内含子坐标进行IGV可视化、图片裁剪和合并操作，从而得到真实数据的内含子序列读数分布模式图片测试集real_test。The step S7 obtaining the intron sequence reading distribution pattern picture test set of the real data obtained in the step S1 is specifically inputting the sequence file APP of the real data obtained in the step S1 into the prediction tool iREAD and the prediction tool IRFinder, and obtaining respectively Two sets of intron retention prediction sets IR1 and IR2; map IR1 and IR2 to the independent intron set intron cluster according to the rule with the largest matching coordinate interval length, and then take the intersection of the two to obtain the intersection IC; then, the intersection IC IGV visualization, image cropping and merging operations are performed on each intron coordinate in the image, so as to obtain the image test set real_test of the intron sequence read distribution pattern of the real data.

步骤S11所述的根据步骤S9获得的W1+N1个碱基的5’端序列和步骤S10获得的W2+N2个碱基的3’端序列，计算剪接位点强度，从而得到5’端平均剪接位点强度值和3’端平均剪接位点强度值，具体为将步骤S9得到的5’端序列score5ss序列集合和步骤S10得到的3’端序列score3ss序列集合输入到MaxEntScan模型中，采用最大熵模型进行打分，从而得到给定的剪接位点强度值；然后对5’端序列和3’端序列所对应的剪接位点强度取平均值，从而得到最终的5’端平均剪接位点强度值和3’端平均剪接位点强度值。According to the 5'-end sequence of W1+N1 bases obtained in step S9 and the 3'-end sequence of W2+N2 bases obtained in step S10 described in step S11, the splice site strength is calculated, so as to obtain the average 5'-end sequence. The splice site intensity value and the average splice site intensity value at the 3' end are input into the MaxEntScan model by inputting the 5' end sequence score5ss sequence set obtained in step S9 and the 3' end sequence score3ss sequence set obtained in step S10 into the MaxEntScan model. The entropy model is used to score to obtain a given splice site intensity value; then the average splice site intensity corresponding to the 5'-end sequence and the 3'-end sequence is averaged to obtain the final 5'-end average splice site intensity value and the average splice site intensity value at the 3' end.

步骤S12所述的根据步骤S11得到的5’端平均剪接位点强度值和3’端平均剪接位点强度值，对步骤S5建立的神经网络内含子保留预测模型进行评价，具体为若神经网络内含子保留预测模型的5’端平均剪接位点强度值和3’端平均剪接位点强度值越小，则神经网络内含子保留预测模型的预测效果越好。In step S12, the neural network intron retention prediction model established in step S5 is evaluated according to the 5'-end average splice site intensity value and the 3'-end average splice site intensity value obtained in step S11. The smaller the 5'-end average splice site intensity value and the 3'-end average splice site intensity value of the network intron retention prediction model, the better the prediction effect of the neural network intron retention prediction model.

本发明还提供了一种包括上述内含子保留预测模型建立方法的预测方法，具体还包括如下步骤：The present invention also provides a prediction method comprising the above-mentioned intron retention prediction model establishment method, which specifically further comprises the following steps:

S13.采用步骤S5得到的神经网络内含子保留预测模型，对内含子保留结果进行预测。S13. Use the neural network intron retention prediction model obtained in step S5 to predict the intron retention result.

本发明提供的这种内含子保留预测模型建立方法及其预测方法，基于内含子保留读数分布模式的内含子保留深度学习预测方法能更通用且易解释地预测内含子保留；基于内含子保留读数分布模式，同时结合深度学习模型构建以及迁移学习应用，迁移了大型图像分类任务的知识结构，完成并改进了内含子保留预测任务的学习效果；同时还在没有金标准的真实数据集上进行了预测效果评价，提出对预测内含子保留序列的5’和3’端序列计算平均剪接位点强度来衡量整体预测效果优劣；因此，本发明方法能够基于内含子保留读数分布模式对内含子进行可视化以及预测，而且可靠性高，准确性好。The intron retention prediction model establishment method and prediction method provided by the present invention, the intron retention deep learning prediction method based on the intron retention read distribution pattern can predict intron retention more generally and easily; The intron retention read distribution pattern, combined with deep learning model construction and transfer learning applications, transfers the knowledge structure of large-scale image classification tasks, and completes and improves the learning effect of intron retention prediction tasks; at the same time, there is no gold standard. The prediction effect was evaluated on the real data set, and it was proposed to calculate the average splice site strength of the 5' and 3' end sequences of the predicted intron-reserved sequences to measure the overall prediction effect; therefore, the method of the present invention can be based on introns. Preserve read distribution patterns for visualization and prediction of introns with high reliability and accuracy.

附图说明Description of drawings

图1为本发明的内含子保留预测模型建立方法的方法流程示意图。FIG. 1 is a schematic flowchart of a method for establishing an intron retention prediction model according to the present invention.

图2为本发明的内含子保留读数分布模式可视化结果示意图。Figure 2 is a schematic diagram of the visualization results of the distribution pattern of intron retention reads of the present invention.

图3为本发明的深度学习模型VGG16结构示意图。FIG. 3 is a schematic structural diagram of the deep learning model VGG16 of the present invention.

图4为本发明的预测方法流程示意图。FIG. 4 is a schematic flowchart of the prediction method of the present invention.

具体实施方式Detailed ways

如图1所示为本发明的内含子保留预测模型建立方法的方法流程示意图：本发明提供的这种内含子保留预测模型建立方法，包括如下步骤：As shown in Figure 1, the method flow diagram of the intron retention prediction model establishment method of the present invention: this intron retention prediction model establishment method provided by the present invention comprises the following steps:

S1.收集内含子保留相关的模拟数据和真实数据；具体为采用BEER算法生成含有确定内含子数目的模拟数据序列文件SIMU30；所述模拟数据序列文件SIMU30的测序深度为三千万，读数长度为100个碱基，设定生成基因15000个，内含子69338个；以及从阿尔茨海默病加速药物合作项目的Tau和APP小鼠模型研究中的一个真实数据序列文件APP，测序深度为一亿，读数长度为101个碱基；S1. collect simulation data and real data related to intron retention; specifically, adopt the BEER algorithm to generate a simulation data sequence file SIMU30 containing a certain number of introns; the sequencing depth of the simulated data sequence file SIMU30 is 30 million, and the number of reads is 30 million. The length is 100 bases, and it is set to generate 15,000 genes and 69,338 introns; and a real data sequence file APP from the Tau and APP mouse model research of the Alzheimer's Disease Accelerated Drug Cooperation Project, sequencing depth is 100 million, and the read length is 101 bases;

S2.定义基因组中所有独立内含子集合并作为标准模板；本发明具体可以应用于小鼠，因此所述的基因组可以为小鼠基因组；具体采用如下步骤进行定义：S2. Define all independent intron sets in the genome and use them as standard templates; the present invention can be applied to mice, so the genome can be a mouse genome; specifically, the following steps are used to define:

其中，提取所有的独立内含子集合Independent_intron，具体为合并一个染色体中的所有外显子，然后从基因区域删除所有外显子，从而得到所有的独立内含子；Among them, extract all independent intron sets Independent_intron, specifically, merge all exons in a chromosome, and then delete all exons from the gene region to obtain all independent introns;

B.在步骤A得到的独立内含子集合Independent_intron中，以基因为单位，合并坐标区间有重叠的内含子，得到最终的独立内含子集合intron cluster；B. In the independent intron set Independent_intron obtained in step A, take the gene as the unit, merge the overlapping introns in the coordinate interval, and obtain the final independent intron set intron cluster;

S3.获取步骤S1得到的模拟数据中所设定的内含子序列读数分布模式图片数据集，并进行预处理得到处理后的数据集；具体为采用如下步骤获取数据集并进行数据：S3. Obtain the intron sequence read distribution pattern picture data set set in the simulation data obtained in step S1, and perform preprocessing to obtain the processed data set; specifically, the following steps are used to obtain the data set and carry out the data:

b.由于每个内含子长度不定，且差异极大，因此分别保存每个内含子5’端和3’端左、右各20个碱基，长度一共为40个碱基的两段序列可视化图像；可视化图像的高度为100mm，同时对代表碱基丰度的条形图高度进行标准化处理；b. Since the length of each intron is indeterminate and very different, two sections of 20 bases on the left and right of the 5' end and 3' end of each intron are respectively saved, with a total length of 40 bases. Sequence visualization image; the height of the visualization image is 100mm, and the height of the bar graph representing base abundance is normalized;

c.对于步骤b得到的图像，单段序列的可视化图像原始纵长621像素，横长1150像素，因此裁剪整张图像的纵长为131～231像素的部分，以及横长280～1070像素的部分；c. For the image obtained in step b, the original visual image of a single-segment sequence is 621 pixels long and 1150 pixels horizontally, so the part of the entire image with a vertical length of 131 to 231 pixels and a horizontal length of 280 to 1070 pixels are cropped. part;

d.将步骤c裁剪得到的图像进行横向合并，从而得到最终的处理后的数据集；可视化结果如图2所示；d. Merge the images cropped in step c horizontally to obtain the final processed data set; the visualization result is shown in Figure 2;

S4.将步骤S3得到的处理后的数据集按照设定比例划分为训练集和测试集；具体为在步骤S1得到的模拟数据序列文件SIMU30中，定义序列总读数大于第一设定值(比如10)、FPKM(每百万读数中匹配到基因中每千个碱基的片段数，Fragments Per KilobaseMillion)大于第二设定值(比如0.3)且连续读数大于第三设定值(比如1)的内含子为正样本，剩余的内含子为负样本；然后在正负样本中，随机抽取X2(比如5000)个正样本和X2个负样本，构成最终的数据集；然后按照设定的比例(比如7:3)将数据集划分为训练集和测试集；X2为正整数。S4. Divide the processed data set obtained in step S3 into a training set and a test set according to a set ratio; specifically, in the simulated data sequence file SIMU30 obtained in step S1, the total reading of the defined sequence is greater than the first set value (such as 10), FPKM (the number of fragments matching each kilobase in the gene per million reads, Fragments Per KilobaseMillion) is greater than the second set value (such as 0.3) and consecutive reads are greater than the third set value (such as 1) The introns are positive samples, and the remaining introns are negative samples; then, in the positive and negative samples, randomly select X2 (such as 5000) positive samples and X2 negative samples to form the final data set; then follow the setting The ratio (such as 7:3) divides the dataset into training set and test set; X2 is a positive integer.

S5.采用步骤S4得到的训练集训练神经网络模型，从而得到最终建立的神经网络内含子保留预测模型；在具体实施时，预测模型优选为VGG16模型；且在选用VGG16为预测模型时，可以采用如下步骤训练模型：S5. use the training set obtained in step S4 to train the neural network model, thereby obtaining the finally established neural network intron retention prediction model; during specific implementation, the prediction model is preferably the VGG16 model; and when VGG16 is selected as the prediction model, you can Use the following steps to train the model:

(1)获得在ImageNet任务上已经训练好的VGG16网络结构模型(如图3所示)以及对应的权重参数文件；所述网络结构模型工包括13个卷积层；(1) Obtain the VGG16 network structure model that has been trained on the ImageNet task (as shown in Figure 3) and the corresponding weight parameter file; the network structure model includes 13 convolution layers;

设置若评价指标accuracy在10次迭代中均未下降，则提前停止迭代Set if the evaluation index accuracy does not decrease in 10 iterations, stop the iteration in advance

S6.根据步骤S5得到的神经网络内含子保留预测模型，在步骤S4得到的测试集上计算神经网络内含子保留预测模型的评价参数(优选为AUC值)；S6. according to the neural network intron retention prediction model obtained in step S5, calculate the evaluation parameter (preferably AUC value) of the neural network intron retention prediction model on the test set obtained in step S4;

S7.获取步骤S1得到的真实数据的内含子序列读数分布模式图片测试集；具体为将步骤S1得到的真实数据的序列文件APP输入到预测工具iREAD和预测工具IRFinder中，分别得到两组内含子保留预测集合IR1和IR2；将IR1和IR2根据匹配坐标区间长度最大的规则映射到独立内含子集合intron cluster上，再取两者交集，得到交集IC；然后，将交集IC中的各内含子坐标进行IGV可视化、图片裁剪和合并等操作，从而得到真实数据的内含子序列读数分布模式图片测试集real_test；S7. Obtain the intron sequence reading distribution pattern picture test set of the real data obtained in step S1; specifically, input the sequence file APP of the real data obtained in step S1 into the prediction tool iREAD and the prediction tool IRFinder, and obtain two groups of introns respectively. Intron retention prediction sets IR1 and IR2; map IR1 and IR2 to the independent intron set intron cluster according to the rule with the largest matching coordinate interval length, and then take the intersection of the two to obtain the intersection IC; The intron coordinates are subjected to IGV visualization, image cropping and merging, etc., to obtain the image test set real_test of the intron sequence read distribution pattern of the real data;

S11.根据步骤S9获得的W1+N1个碱基的5’端序列和步骤S10获得的W2+N2个碱基的3’端序列，计算剪接位点强度，从而得到5’端平均剪接位点强度值和3’端平均剪接位点强度值；具体为将步骤S9得到的5’端序列score5ss序列集合和步骤S10得到的3’端序列score3ss序列集合输入到MaxEntScan模型中，采用最大熵模型进行打分，从而得到给定的剪接位点强度值；然后对5’端序列和3’端序列所对应的剪接位点强度取平均值，从而得到最终的5’端平均剪接位点强度值和3’端平均剪接位点强度值；S11. According to the 5'-end sequence of W1+N1 bases obtained in step S9 and the 3'-end sequence of W2+N2 bases obtained in step S10, calculate the splice site intensity, thereby obtaining the average splice site at the 5' end The intensity value and the 3'-end average splice site intensity value; specifically, the 5'-end sequence score5ss sequence set obtained in step S9 and the 3'-end sequence score3ss sequence set obtained in step S10 are input into the MaxEntScan model, and the maximum entropy model is used to carry out Score to obtain a given splice site intensity value; then average the splice site intensities corresponding to the 5'-end sequence and the 3'-end sequence to obtain the final 5'-end average splice site intensity value and 3 ' end average splice site strength value;

S12.根据步骤S11得到的5’端平均剪接位点强度值和3’端平均剪接位点强度值，对步骤S5建立的神经网络内含子保留预测模型进行评价；具体为若神经网络内含子保留预测模型的5’端平均剪接位点强度值和3’端平均剪接位点强度值越小，则神经网络内含子保留预测模型的预测效果越好。S12. Evaluate the neural network intron retention prediction model established in step S5 according to the average splice site intensity value at the 5' end and the average splice site intensity value at the 3' end obtained in step S11; specifically, if the neural network contains The smaller the 5'-end average splice site intensity value and the 3'-end average splice site intensity value of the intron retention prediction model, the better the prediction effect of the neural network intron retention prediction model.

以下对本发明方法进行验证：The method of the present invention is verified as follows:

在模拟数据SIMU30和真实数据集APP上对本发明进行评价，同时与本发明相比较的工具有iREAD和IRFinder。The invention is evaluated on simulated data SIMU30 and real data set APP, and the tools compared with the invention are iREAD and IRFinder.

1)SIMU30模拟数据集实验分析1) Experimental analysis of SIMU30 simulation data set

对于SIMU30模拟数据的3000个测试集样本，本发明在其上的预测Accuracy达到0.925，AUC达到0.975；For 3000 test set samples of SIMU30 simulation data, the prediction Accuracy of the present invention on it reaches 0.925, and the AUC reaches 0.975;

2)APP真实数据集实验分析2) APP real data set experimental analysis

由于真实数据缺乏金标准，一方面只能以其他方法的预测标签为真实标签，测试本发明VGG16模型的AUC与其他方法的差距；另一方面可以自定义其他的评价指标，来验证本发明的有效性。AUC评价方面，本发明VGG16模型在预测真实数据图片测试集real_test后，与iREAD和IRFinder的比较见表1。real_test共68326个样本，在以iREAD为金标准时，正样本数为2816，负样本数为65510，此时本发明VGG16模型的AUC优于IRFinder。在以IRFinder为金标准时，正样本数为19044，负样本数为49282，此时本发明也优于iREAD。Due to the lack of gold standards for real data, on the one hand, the predicted labels of other methods can only be used as real labels to test the gap between the AUC of the VGG16 model of the present invention and other methods; on the other hand, other evaluation indicators can be customized to verify the performance of the present invention effectiveness. In terms of AUC evaluation, after predicting the real data picture test set real_test, the VGG16 model of the present invention is compared with iREAD and IRFinder in Table 1. The real_test has a total of 68326 samples. When iREAD is the gold standard, the number of positive samples is 2816 and the number of negative samples is 65510. At this time, the AUC of the VGG16 model of the present invention is better than that of IRFinder. When taking IRFinder as the gold standard, the number of positive samples is 19044 and the number of negative samples is 49282. At this time, the present invention is also better than iREAD.

表1本发明与iREAD和IRFinder的AUC评价结果示意表Table 1 Schematic representation of the AUC evaluation results of the present invention, iREAD and IRFinder

另外，本发明还定义了5’端和3’端剪接位点强度来衡量VGG16模型预测效果，平均剪接位点强度越低，模型整体预测效果更好。平均剪接位点强度评价结果见表2。In addition, the present invention also defines the 5'-end and 3'-end splice site strengths to measure the prediction effect of the VGG16 model. The lower the average splice site intensity, the better the overall prediction effect of the model. The average splice site strength evaluation results are shown in Table 2.

表2本发明与iREAD和IRFinder的平均剪接位点强度评价结果示意表Table 2 Schematic table of the average splice site strength evaluation results between the present invention and iREAD and IRFinder

从表2中结果来看，虽然本发明的结果在平均剪接位点强度方面略差于IRFinder和iREAD，但是注意到，随着参与计算平均剪接位点强度的内含子数增加，IRFinder和iREAD的平均剪接位点强度是随之增加的，而本发明是降低的。由此反映了本发明设计的VGG16模型在鲁棒性方面优于IRFinder和iREAD。From the results in Table 2, although the results of the present invention are slightly worse than IRFinder and iREAD in terms of average splice site intensity, it is noted that as the number of introns involved in calculating the average splice site intensity increases, IRFinder and iREAD The mean splice site strength of α increases with it, while it decreases for the present invention. This reflects that the VGG16 model designed by the present invention is superior to IRFinder and iREAD in terms of robustness.

如图4所示为本发明的预测方法流程示意图：本发明提供的这种包括上述内含子保留预测模型建立方法的预测方法，具体包括如下步骤：Figure 4 is a schematic flow chart of the prediction method of the present invention: the prediction method provided by the present invention including the above-mentioned intron retention prediction model establishment method specifically includes the following steps:

S2.定义基因组中所有独立内含子集合并作为标准模板；具体为采用如下步骤进行定义：S2. Define all independent intron sets in the genome and use them as standard templates; specifically, the following steps are used to define:

S12.根据步骤S11得到的5’端平均剪接位点强度值和3’端平均剪接位点强度值，对步骤S5建立的神经网络内含子保留预测模型进行评价；具体为若神经网络内含子保留预测模型的5’端平均剪接位点强度值和3’端平均剪接位点强度值越小，则神经网络内含子保留预测模型的预测效果越好；S12. Evaluate the neural network intron retention prediction model established in step S5 according to the average splice site intensity value at the 5' end and the average splice site intensity value at the 3' end obtained in step S11; specifically, if the neural network contains The smaller the 5'-end average splice site intensity value and the 3'-end average splice site intensity value of the intron retention prediction model, the better the prediction effect of the neural network intron retention prediction model;

Claims

1. a method for establishing an intron retention prediction model, comprising the steps:

S1. Collect simulated data and real data related to intron retention;

S2. Define all independent intron sets in the genome and use them as standard templates;

S3. Obtain the intron sequence reading distribution pattern picture data set set in the simulation data obtained in step S1, and perform preprocessing to obtain the processed data set;

S4. Divide the processed data set obtained in step S3 into a training set and a test set according to a set ratio;

S5. Use the training set obtained in step S4 to train the neural network model, thereby obtaining the finally established neural network intron retention prediction model.

2. intron retention prediction model establishment method according to claim 1 is characterized in that also comprising the steps:

S6. According to the neural network intron retention prediction model obtained in step S5, the evaluation parameters of the neural network intron retention prediction model are calculated on the test set obtained in step S4;

S7. Obtain the intron sequence reading distribution pattern picture test set of the real data obtained in step S1;

S8. according to the neural network intron retention prediction model obtained in step S5, predict the intron retention result on the test set obtained in step S7, thereby obtaining the predicted intron retention set;

S9. Obtain the predicted intron retention set obtained in step S8, revealing the coordinates of W1 bases on the exon side and N1 bases on the intron side, a total of W1+N1 bases 5'-end sequence;

S10. Obtain the predicted intron retention set obtained in step S8, revealing the coordinates of the exon side W2 bases, the intron side N2 bases, a total of W2+N2 bases 3'-end sequences;

S11. According to the 5'-end sequence of W1+N1 bases obtained in step S9 and the 3'-end sequence of W2+N2 bases obtained in step S10, calculate the splice site intensity, thereby obtaining the average splice site at the 5' end Intensity value and average splice site intensity value at 3' end;

S12. According to the average splice site intensity value at the 5' end and the average splice site intensity value at the 3' end obtained in step S11, evaluate the neural network intron retention prediction model established in step S5.

3. intron retention prediction model establishment method according to claim 2 is characterized in that the described collection of step S1 retains relevant simulated data and real data, and is specially for adopting BEER algorithm to generate and contain certain introns The number of simulated data sequence files SIMU30; the sequencing depth of the simulated data sequence file SIMU30 is 30 million, the read length is 100 bases, and it is set to generate 15,000 genes and 69,338 introns; A real data sequence file APP in the Tau and APP mouse model research of the Disease Accelerating Drug Cooperation Project, with a sequencing depth of 100 million and a read length of 101 bases.

4. intron retention prediction model establishment method according to claim 3 is characterized in that all independent intron sets in the defined genome described in step S2 are combined as standard template, specifically adopt the following steps to define:

A. From the annotated gtf file of the release-75 version of the GRCm38 mouse genome, extract all independent intron sets Independent_intron; the independent introns are defined as introns that do not overlap with any homotypic exons;

B. In the independent intron set Independent_intron obtained in step A, take the gene as the unit, merge the overlapping introns in the coordinate interval, and obtain the final independent intron set intron cluster.

5. The method for establishing an intron retention prediction model according to claim 4, wherein the extraction of all independent intron sets Independent_intron described in step A is specifically merging all exons in a chromosome, and then extracting all the exons from the chromosome. The gene region deletes all exons, resulting in all independent introns.

6. The method for establishing an intron retention prediction model according to claim 5, characterized in that the intron sequence reading distribution pattern picture data set set in the simulation data obtained by the obtaining step S1 of the step S3, and The processed data set is obtained by preprocessing, specifically, the following steps are used to obtain the data set and perform the data:

a. Perform IGV visualization on each intron in the simulation data sequence file SIMU30 obtained in step S1 to obtain a preliminary visualization image;

b. Save the visual images of two sequences of 20 bases on the left and right sides of the 5' end and 3' end of each intron, with a total length of 40 bases; the height of the visual image is 100mm, and the representative base The bar graph height of base abundance is normalized;

c. For the image obtained in step b, crop the part with a vertical length of 131-231 pixels and a part with a horizontal length of 280-1070 pixels;

d. Merge the images cropped in step c horizontally to obtain the final processed data set.

7. intron retention prediction model establishment method according to claim 6 is characterized in that described in step S4, the processed data set obtained in step S3 is divided into training set and test set according to a set ratio, and is specifically In the simulated data sequence file SIMU30 obtained in step S1, define the introns whose total sequence readings are greater than the first set value, the FPKM is greater than the second set value, and the continuous readings are greater than the third set value as positive samples, and the remaining introns are defined as positive samples. Contain is a negative sample; then, in the positive and negative samples, randomly select X2 positive samples and X2 negative samples to form the final data set; then divide the data set into training set and test set according to the set ratio; X2 is positive integer.

8 . The method for establishing an intron retention prediction model according to claim 7 , wherein the neural network model described in step S5 is specifically a VGG16 network structure model. 9 .

9. intron retention prediction model establishment method according to claim 8 is characterized in that the training set training neural network model that adopts step S4 to obtain described in step S5, thereby obtains the neural network intron retention prediction finally established The model is specifically trained by the following steps:

(1) Obtain the VGG16 network structure model that has been trained on the ImageNet task and the corresponding weight parameter file; the network structure model includes 13 convolutional layers;

(2) Load the network and weight obtained in step (1) as a pre-training network, but freeze the network to ensure that the network does not participate in training;

(3) Define a two-class network, and perform training on the training set obtained in step S4; the two-class network has three layers in total, the first two layers are fully connected layers, and the number of neurons is 256 and 64 respectively. A Dropout layer is followed to prevent overfitting, and the probability of randomly discarding neurons is set to 0.5 and 0.3 respectively; the last layer is a sigmoid layer for binary classification;

(4) After the classification network is trained, unfreeze the last three convolution layers of the pre-training network, use the training set obtained in step S4 again to train the classification network and the pre-training network together, and adjust the weights;

(5) The parameters of the model training process are set as follows:

The total number of parameters for model training is 33 million, of which the number of trainable parameters is 26 million, and the number of non-trainable parameters is 7 million;

The loss function is the two-category cross entropy loss, and the calculation formula is

where i is each sample, t _i is the real label of sample i; y _i is the predicted label of sample i;

The optimizer is RMSprop, the learning rate is 2e- ⁵ , and the number of iterations is 30;

The evaluation index is accuracy, and the calculation formula is:

Among them, Truepositive is the number of samples that are predicted to be positive and are actually positive; Truenegative is the number of samples that are predicted to be negative and are actually negative; Allsamples is the total number of samples;

Set ReduceLROnPlateau to monitor the learning rate every 2 iterations. If the monitoring learning rate does not decrease, adjust the learning rate to decrease by 50%;

Set if the evaluation index accuracy does not decrease in 10 iterations, stop the iteration in advance.

10. The method for establishing an intron retention prediction model according to claim 9, wherein the step S6 calculates the evaluation parameter of the neural network intron retention prediction model on the test set obtained in the step S4, specifically in the The AUC value of the neural network intron retention prediction model is calculated on the test set obtained in step S4.

11. The method for establishing an intron retention prediction model according to claim 10, characterized in that the intron sequence reading distribution pattern picture test set of the real data obtained in step S7 of obtaining step S1 is specifically the step S1. The sequence file APP of the obtained real data is input into the prediction tool iREAD and the prediction tool IRFinder, and two sets of intron retention prediction sets IR1 and IR2 are obtained respectively; IR1 and IR2 are mapped to independent introns according to the rule with the longest matching coordinate interval. On the sub-set intron cluster, take the intersection of the two to obtain the intersection IC; then, perform IGV visualization, image cropping and merging operations on the coordinates of each intron in the intersection IC to obtain the intron sequence read distribution pattern of the real data. Image test set real_test.

12. The method for establishing an intron retention prediction model according to claim 11, wherein the 5'-end sequence of W1+N1 bases obtained in step S11 according to step S9 and W2+N2 obtained in step S10 The 3' end sequence of each base is calculated, and the splice site intensity is calculated to obtain the average splice site intensity value of the 5' end and the average splice site intensity value of the 3' end. Specifically, the 5' end sequence score5ss sequence obtained in step S9 is obtained. The set and the 3'-end sequence score3ss sequence set obtained in step S10 are input into the MaxEntScan model, and the maximum entropy model is used to score, thereby obtaining a given splice site strength value; then the 5'-end sequence and the 3'-end sequence corresponding to The average splice site intensity of , to obtain the final 5' end average splice site intensity value and 3' end average splice site intensity value.

13. The method for establishing an intron retention prediction model according to claim 12, characterized in that the 5'-end average splice site intensity value and the 3'-end average splice site intensity value obtained according to step S11 in step S12 , evaluate the neural network intron retention prediction model established in step S5, specifically, if the 5'-end average splice site intensity value and the 3'-end average splice site intensity value of the neural network intron retention prediction model are smaller , the prediction effect of the neural network intron retention prediction model is better.

14. A prediction method comprising the method for establishing an intron retention prediction model according to one of claims 1 to 13, further comprising the steps of:

S13. Use the neural network intron retention prediction model obtained in step S5 to predict the intron retention result.