CN101477089A - Discovery method for protein post-translational modification - Google Patents
Discovery method for protein post-translational modification Download PDFInfo
- Publication number
- CN101477089A CN101477089A CNA2009100765888A CN200910076588A CN101477089A CN 101477089 A CN101477089 A CN 101477089A CN A2009100765888 A CNA2009100765888 A CN A2009100765888A CN 200910076588 A CN200910076588 A CN 200910076588A CN 101477089 A CN101477089 A CN 101477089A
- Authority
- CN
- China
- Prior art keywords
- mass
- mrow
- modification
- distribution
- peptide
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 146
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 143
- 230000004481 post-translational protein modification Effects 0.000 title claims abstract description 98
- 238000000034 method Methods 0.000 title claims abstract description 83
- 108090000765 processed proteins & peptides Proteins 0.000 claims abstract description 195
- 230000004048 modification Effects 0.000 claims abstract description 177
- 238000012986 modification Methods 0.000 claims abstract description 177
- 238000009826 distribution Methods 0.000 claims abstract description 172
- 239000013598 vector Substances 0.000 claims abstract description 56
- 238000001819 mass spectrum Methods 0.000 claims abstract description 39
- 230000014759 maintenance of location Effects 0.000 claims description 61
- 238000001228 spectrum Methods 0.000 claims description 33
- 238000004364 calculation method Methods 0.000 claims description 31
- 239000000203 mixture Substances 0.000 claims description 20
- 230000000694 effects Effects 0.000 claims description 11
- 238000004885 tandem mass spectrometry Methods 0.000 claims description 10
- 238000004422 calculation algorithm Methods 0.000 claims description 9
- 238000002474 experimental method Methods 0.000 claims description 8
- 108091005601 modified peptides Proteins 0.000 claims description 5
- 230000003595 spectral effect Effects 0.000 claims description 5
- 238000012216 screening Methods 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 108091005573 modified proteins Proteins 0.000 claims description 3
- 230000008901 benefit Effects 0.000 abstract description 3
- 150000002500 ions Chemical class 0.000 description 23
- 102000004196 processed proteins & peptides Human genes 0.000 description 21
- 230000008569 process Effects 0.000 description 20
- 239000012634 fragment Substances 0.000 description 14
- PCTMTFRHKVHKIS-BMFZQQSSSA-N (1s,3r,4e,6e,8e,10e,12e,14e,16e,18s,19r,20r,21s,25r,27r,30r,31r,33s,35r,37s,38r)-3-[(2r,3s,4s,5s,6r)-4-amino-3,5-dihydroxy-6-methyloxan-2-yl]oxy-19,25,27,30,31,33,35,37-octahydroxy-18,20,21-trimethyl-23-oxo-22,39-dioxabicyclo[33.3.1]nonatriaconta-4,6,8,10 Chemical compound C1C=C2C[C@@H](OS(O)(=O)=O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2.O[C@H]1[C@@H](N)[C@H](O)[C@@H](C)O[C@H]1O[C@H]1/C=C/C=C/C=C/C=C/C=C/C=C/C=C/[C@H](C)[C@@H](O)[C@@H](C)[C@H](C)OC(=O)C[C@H](O)C[C@H](O)CC[C@@H](O)[C@H](O)C[C@H](O)C[C@](O)(C[C@H](O)[C@H]2C(O)=O)O[C@H]2C1 PCTMTFRHKVHKIS-BMFZQQSSSA-N 0.000 description 13
- 238000004811 liquid chromatography Methods 0.000 description 6
- 102000007079 Peptide Fragments Human genes 0.000 description 5
- 108010033276 Peptide Fragments Proteins 0.000 description 5
- 125000003275 alpha amino acid group Chemical group 0.000 description 5
- 150000001413 amino acids Chemical class 0.000 description 5
- 238000004949 mass spectrometry Methods 0.000 description 4
- XSQUKJJJFZCRTK-UHFFFAOYSA-N Urea Chemical compound NC(N)=O XSQUKJJJFZCRTK-UHFFFAOYSA-N 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 239000004202 carbamide Substances 0.000 description 3
- 230000011987 methylation Effects 0.000 description 3
- 238000007069 methylation reaction Methods 0.000 description 3
- 238000013519 translation Methods 0.000 description 3
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000007385 chemical modification Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000013467 fragmentation Methods 0.000 description 2
- 238000006062 fragmentation reaction Methods 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 239000007788 liquid Substances 0.000 description 2
- 108020004999 messenger RNA Proteins 0.000 description 2
- 230000001590 oxidative effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012509 protein identification method Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- FFEARJCKVFRZRR-BYPYZUCNSA-N L-methionine Chemical compound CSCC[C@H](N)C(O)=O FFEARJCKVFRZRR-BYPYZUCNSA-N 0.000 description 1
- 108091005804 Peptidases Proteins 0.000 description 1
- 239000004365 Protease Substances 0.000 description 1
- 108010026552 Proteome Proteins 0.000 description 1
- 102100037486 Reverse transcriptase/ribonuclease H Human genes 0.000 description 1
- 238000010847 SEQUEST Methods 0.000 description 1
- 125000002777 acetyl group Chemical group [H]C([H])([H])C(*)=O 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 230000004071 biological effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000005034 decoration Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 125000000524 functional group Chemical group 0.000 description 1
- 238000000338 in vitro Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 229930182817 methionine Natural products 0.000 description 1
- 230000009145 protein modification Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Landscapes
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
Abstract
The invention provides a detecting method of post-translational protein modification. The method comprises the following steps: calculating spectrogram difference vectors among all spectrograms by means of peptide chromatogram retaining time and peptide quality in experimental tandem mass spectrum data of protein samples; establishing candidate modification mass intervals probably containing modification mass; estimating mixed distribution of the spectrogram difference vectors on each candidate modification mass interval, and calculating the standard deviation of each distribution inside the mixed distribution so as to determine distributions inside the candidate modification mass intervals caused by post-translational protein modification according to the standard deviation; calculating the average value of the distributions caused by post-translational protein modification; obtaining an accurate mass experimental value of the post-translational protein modification according to the mass component of the average value; and obtaining the influence of post-translational protein modification on peptide chromatogram retaining time according to the retaining time component of the average value. The detecting method has the advantages of high efficiency, accuracy and robustness.
Description
Technical Field
The present invention relates to post-translational modification of proteins in biogenetics, and in particular to a method for discovering post-translational modification of proteins.
Background
It is well known that genetic information of most organisms is preserved in DNA. DNA produces messenger RNA through a transcription process, and messenger RNA produces protein through a translation process, thereby achieving the transfer of genetic information from DNA to RNA to protein, a process also known as the central law of life. In the process of producing a protein by translation from RNA, a chain molecule in which 20 amino acids are connected in a peptide bond sequence is called a peptide, and a peptide in which the molecular weight reaches a certain level is called a protein. Most proteins are translated to form proteins with certain functional groups added to certain amino acids (e.g., acetyl groups added to the N-terminus of the protein), or with other proteins or peptides added, or with changes in the chemical properties or structure of the amino acids, which is known as chemical modification, and since this occurs after the aforementioned translation process, the changes that occur in the amino acids of the protein are also known as post-translational modifications of the protein. Post-translational modifications of proteins can alter the chemical nature of amino acids, cause changes in protein structure, and extend the function of proteins. The important biological activities of many proteins are carried out after post-translational modifications have taken place. In addition, some chemical modifications are often introduced intentionally or unintentionally during the in vitro processing of protein samples.
There are hundreds of known types of protein modifications, and detecting which post-translational modifications have occurred in a protein sample is an important and difficult problem for protein identification. Liquid chromatography coupled with mass spectrometry, combined with database search calculations, is a common method for identifying proteins and their post-translational modifications in proteomics today. In this method, experimental tandem mass spectra of protein samples can be obtained by liquid chromatography in combination with a mass spectrometer. The acquisition process of the experimental tandem mass spectrum comprises the following steps: the protein sample is first hydrolyzed by a selected protease to form a peptide mixture; the peptide mixture is separated by liquid chromatography, and peptides with different physicochemical properties flow out of the chromatographic column successively (the retention time of the peptides in the chromatographic column is called retention time); the peptide flowing out of the chromatographic column continuously enters a mass spectrometer; the peptide is ionized in a mass spectrometer, peptide ions with specific mass-to-charge ratio are fragmented under the action of energy to form fragment ions, and the fragment ions are separated and detected to form a peptide fragment ion spectrum; the experimental tandem mass spectrum of the protein is obtained through the process. After the experimental tandem mass spectrum is obtained, the amino acid sequence of the peptide can be identified from the experimental tandem mass spectrum, and then the protein can be identified. The amino acid sequence of a peptide is identified from experimental tandem mass spectrometry by a method of database search calculation. In the calculation process, the protein sequences stored in the database are simulated and hydrolyzed into peptides, and then the peptides are theoretically fragmented to generate the theoretical tandem mass spectrum of various peptides; and comparing a plurality of theoretical tandem mass spectra obtained by simulation calculation with experimental tandem mass spectra obtained by combining the liquid chromatogram with the mass spectrometer in sequence, wherein the peptide sequences for generating the experimental tandem mass spectra can be found if the peptide sequences exist in a database. In the process of comparing the theoretical tandem mass spectrum with the experimental tandem mass spectrum, because the number of the theoretical tandem mass spectrum is usually very large, in order to accelerate the calculation efficiency, the candidate peptide is generally filtered according to the mass-to-charge ratio of the peptide ions in the experimental tandem mass spectrum, and the theoretical tandem mass spectrum is compared with the experimental tandem mass spectrum only when the theoretical mass-to-charge ratio of the candidate peptide is matched with the mass-to-charge ratio of the peptide ions obtained in the experiment. As can be seen from the above comparison, if some post-translational modification of the actual peptide is not taken into account in generating the candidate peptide, it is likely that the correct candidate peptide will not enter the search space and it is not possible to identify the peptide, protein and modifications thereof. Even if a candidate peptide for modification enters the search space, it is difficult to correctly identify the peptide sequence if the type of modification and the site at which the modification occurs cannot be correctly specified. In the current proteome experiment, most of the spectrogram generated by a mass spectrometer cannot be effectively analyzed, the spectrogram analysis rate is only 10% to 30%, and one important reason is that the protein has unknown or unexpected modification, so that correct candidate peptide cannot be found, and the subsequent identification process is influenced.
To identify proteins that undergo post-translational modifications, a common tandem mass spectrometry-based identification method is to assign some variable modification types during database searches, and then to generate candidate peptides that take into account both the occurrence and non-occurrence of the assigned modifications, and all possible combinations when there are multiple possible modification sites in the candidate peptide. This approach takes into account the dynamics of post-translational modifications of proteins (the same amino acid position may or may not be modified), but due to the hundreds of modification types that occur naturally or are introduced artificially (563 modification entries in the Unimod database by 28/7 of 2008), and most modifications have multiple specific sites. Therefore, it is impractical to consider too many types of embellishments in database search, which may result in explosion of search space combinations, greatly reducing the speed of database search, and at the same time, resulting in an increased false positive rate of search results. The corresponding search engines in the prior art, such as SEQUEST and Mascot, allow for a number of variable modification types to be specified that is typically no more than 10, which obviously does not meet the practical requirements. In general, the type of modification present in a protein sample is poorly understood by the experimenter and relies primarily on empirical guessing. Most of the time, oxidative modifications on methionine are the only variable modifications specified in database searches. This may miss other types of modification present in the sample. At the same time, much of the mass spectral data generated by the modified peptides cannot be resolved.
To address the computational difficulties described above, researchers have proposed non-limiting methods of identifying modifications to discover unknown or unexpected modifications present in a protein sample. MS-Alignment is currently the most well-known one of such methods (ref: Tsur D., Tanner S., Zandi E., Bafna V., Pevzner P.A. identification of post-translational modifications by blue search of mass spectra. Nature Biotechnology, 2005, 23 (12): 1562-1567). MS-Alignment aligns theoretical mass spectra with experimental mass spectra in a similar genomic sequence Alignment, allowing for the appearance of arbitrary modifications. However, MS-Alignment is computationally very complex because it releases the peptide mass limit when searching databases and uses a dynamic programming algorithm to compare fragment ion spectra in tandem mass spectra. In addition, MS-Alignment requires that experimental tandem spectra must have good signal-to-noise ratios and sufficient similarity to theoretical tandem spectra. In practice, however, the modified peptides often produce tandem patterns of fragmentation that are irregular, or even incomplete, under the influence of the modification. Thus, MS-Alignment has limitations in both speed and accuracy.
Disclosure of Invention
The invention aims to overcome the defects that the calculation complexity is high and the accuracy is influenced by the quality of an experimental tandem spectrogram in the process of detecting protein posttranslational modification by adopting fragment ion spectrum information in the conventional method, thereby providing an efficient and accurate discovery method for protein posttranslational modification.
In order to achieve the above object, the present invention provides a method for discovering a post-translational modification of a protein, comprising:
step 1), calculating spectrogram difference vectors between all spectrograms by using the peptide chromatogram retention time and the peptide mass in the experimental tandem mass spectrum data of the protein sample, wherein the spectrogram difference vectors represent the peptide mass difference and the peptide chromatogram retention time difference between two experimental tandem mass spectrograms;
step 2), establishing a candidate modification quality interval possibly containing modification quality;
step 3), on each candidate modification mass interval, estimating the mixed distribution of the spectrogram difference vectors, calculating the standard deviation of each distribution in the mixed distribution, and determining the distribution caused by the protein posttranslational modification in the candidate modification mass interval according to the standard deviation;
step 4), calculating the exact mass value of the post-translational modification of the protein and the effect of the post-translational modification of the protein on the retention time of the peptide chromatogram based on the property of the distribution comprising the post-translational modification of the protein.
In the above technical solution, before the step 1), redundant spectrogram data is removed from the experimental tandem mass spectrometry data set.
In the above technical solution, the removing of redundant spectrogram data includes: and comparing the peptide masses in the spectrogram data, taking the spectrogram data with approximate peptide mass as similar spectrogram data, and only keeping one spectrogram data in the set of similar spectrogram data.
In the above technical solution, further comprising:
step 5), inferring the type of post-translational modification of the protein from the exact mass value of the post-translational modification of the protein and the effect on the retention time of the peptide chromatogram.
In the above technical solution, in the step 2), a distribution histogram of peptide mass differences is established according to the spectrogram difference vector, peptide mass differences with high occurrence frequency are screened from the distribution histogram of peptide mass differences, and the candidate modification mass interval is established on the distribution histogram of peptide mass differences by using the peptide mass differences with high occurrence frequency.
In the above technical solution, the screening of peptide quality differences with high frequency of occurrence from a distribution histogram of peptide quality differences includes:
step 2-1-1), establishing a mass window with integer mass values as the center on the distribution histogram of the peptide mass difference;
step 2-1-2), extracting the peptide mass difference Deltam with the highest frequency of occurrence in the mass windowf;
Step 2-1-3), for the peptide mass difference Δ m which occurs most frequently in each windowf(iii) establishing a distribution histogram associated with the occurrence counts, estimating the random distribution of the distribution histogram, and calculating the occurrence counts of the peptide mass difference having the highest frequency of occurrence in a certain window based on the estimation result(s) (ii)) P-values from random distributions;
step 2-1-4), counts for said p-value being less than a first threshold value (s) Corresponding toThe difference in peptide mass is considered to occur frequently.
In the above technical solution, the first threshold includes 0.01.
In the above technical solution, the establishing the candidate modification mass interval on the distribution histogram of the peptide mass difference by using the peptide mass difference with the high frequency of occurrence comprises:
step 2-2-1), searching the nearest integer mass value in the vicinity of the peptide mass difference value with high occurrence frequency;
step 2-2-2), selecting the size of epsilon Da around the integer quality value so as to obtain a candidate modification quality interval; the epsilon may comprise any value between 0.3 and 0.5.
In the above technical solution, in the step 2), a mass window is established on the whole distribution interval of the peptide mass difference, and all the established mass windows are used as candidate modified mass intervals.
In the above technical solution, the establishing of the mass window includes establishing a mass window with a width of 2 epsilon and taking an integer mass as a center; the epsilon may comprise any value between 0.3 and 0.5.
In the above technical solution, in the step 3), the mixing distribution includes a random distribution and n distributions caused by modification, and the calculating a standard deviation of each distribution in the mixing distribution, and determining the distribution caused by the post-translational modification of the protein in the candidate modification mass interval from the standard deviation includes:
step 3-1-1), setting the value of n to 1;
step 3-1-2), estimating parameters of probability density functions of spectrogram difference vectors in a current candidate modification quality interval, and selecting the first n distributions with the minimum standard deviation as candidate distributions caused by modification; the probability density function is:
wherein f isRand(Delta) represents the probability density function of the random distribution within the candidate modification mass interval, fMod,jA probability density function representing the distribution caused by the jth modification in the candidate modification quality interval, where α is a mixing coefficient;
step 3-1-3), observing and estimating standard deviation sigma of components delta m and delta Rt of jth modified distribution in the obtained parametersm,jAnd σRt,jIf for all j 1, 2m,j<TmAnd σRt,j<TRtThen the mass interval is considered to contain at least n modifications, and the value of n is added by 1, and step 3-1-2) is re-executed, wherein TmAnd TRtAre two thresholds; when the j-th modified distribution exists, so that σm,j<TmOr σRt,j<TRtAnd then, confirming that the quality interval only contains n-1 modifications, subtracting 1 from the value of n, and finishing the operation after re-executing the parameter estimation in the step 3-1-2).
In the above technical solution, in the step 3-1-2), a expectation-maximization algorithm is adopted to estimate parameters of a probability density function of a spectrogram difference vector in a current candidate modification quality interval.
In the above technical solution, in the step 3), the mixture distribution includes a random gaussian distribution and a gaussian distribution resulting from modification, and the calculating a standard deviation of each distribution in the mixture distribution, and determining, from the standard deviation, a distribution resulting from the post-translational modification of the protein in the candidate modification mass interval includes:
step 3-2-1), estimating parameters of probability density functions of Gaussian distribution of spectrogram difference vectors in the current candidate modification quality interval by adopting an expectation-maximization algorithm; the probability density function is
f(Δ)=αRandf(Δ|μRand,ΣRand)+αModf(Δ|μMod,ΣMod)
αRand+αMod=1
Wherein alpha isRandAnd alphaMod isMixing coefficient, f (Δ | μ, ∑) is a probability density function of a two-dimensional gaussian distribution with mean μ and covariance matrix Σ:
step 3-2-2), when the standard deviation of one distribution in the estimated parameters is much smaller than that of the other distribution, the distribution with the smaller standard deviation is the distribution caused by the post-translational modification of the protein.
The invention also provides a discovery device for protein posttranslational modification, which comprises: the device comprises a spectrogram difference vector calculation module, a candidate modification mass interval establishment module, a protein post-translation modification distribution discovery module and an accurate mass experimental value calculation module; wherein,
the spectrogram difference vector calculation module calculates spectrogram difference vectors among all spectrograms by using the peptide chromatogram retention time and the peptide mass in the experimental tandem mass spectrum data of the protein sample, wherein the spectrogram difference vectors represent the peptide mass difference and the peptide chromatogram retention time difference between two experimental tandem mass spectrograms;
the candidate modification quality interval establishing module establishes a candidate modification quality interval possibly containing modification quality;
the protein posttranslational modification distribution finding module estimates mixed distribution of the spectrogram difference vectors on each candidate modification mass interval, calculates standard deviation of each distribution in the mixed distribution, and determines the distribution caused by the protein posttranslational modification in the candidate modification mass interval according to the standard deviation;
the accurate mass experiment value calculation module calculates a mean value of distribution caused by the protein post-translational modification, obtains an accurate mass experiment value of the protein post-translational modification from a mass component of the mean value, and obtains an influence of the protein post-translational modification on a peptide chromatogram retention time from a retention time component of the mean value.
In the technical scheme, the device further comprises a redundant data removing module, and the redundant data removing module removes redundant spectrogram data in the experimental tandem mass spectrometry data set.
In the technical scheme, the model further comprises a protein posttranslational modification type inference module, and the module infers the type of the protein posttranslational modification according to the accurate mass experimental value of the protein posttranslational modification and the influence on the retention time of the peptide chromatogram.
The invention also provides a protein identification method, which comprises the following steps:
step 1), determining the quality and type of protein posttranslational modification by adopting the discovery method of protein posttranslational modification;
step 2), in database search, the type of protein posttranslational modification found is designated as variable modification parameters, and the identification of modified peptides and proteins is realized.
The invention further provides a detection method for modifying the relevant spectrogram pair, which comprises the following steps:
step 1), calculating parameter estimation in mixed distribution by adopting the discovery method of protein posttranslational modification;
step 2), calculating a difference vector of a pair of spectrograms;
and 3) calculating the posterior probability associated with the kth modification of the spectrogram pair by using the difference vector of the spectrogram pair and the parameter estimation in the mixed distribution.
The method of the invention has the following advantages:
1. the method only utilizes two-dimensional information of peptide mass and peptide chromatogram retention time to cluster the spectrogram, and does not utilize complex fragment ion spectrum information, so the method has the advantage of high calculation speed.
2. The accuracy of calculation, the method of the invention adopts the information of the retention time of the peptide chromatogram, and is more accurate than the method which only adopts the peptide quality information.
3. The robustness of calculation and modification often bring influence which is difficult to predict on a peptide fragmentation mode, so that the accuracy of spectrogram clustering is reduced. Experiments on real data show that the method can effectively find the modification types existing in the sample and provide important guiding information for peptide identification and spectrogram analysis.
Drawings
FIG. 1 is a schematic representation of a primary mass spectrum containing peptide ABCD obtained by mass charge ratio separation of a peptide mixture by a mass spectrometer;
FIG. 2 is a schematic representation of a mass-to-charge ratio mass spectrum of a peptide ABCD-containing peptide mixture obtained by mass spectrometry;
FIG. 3 is a flow chart of a method of discovering a post-translational modification of a protein of the invention;
FIG. 4 is an exemplary plot of a Δ m distribution histogram employed in the present invention;
FIG. 5 shows counts (. DELTA.m) involved in the present inventionf) An exemplary graph of the distribution histogram of (a);
FIG. 6 is an exemplary diagram of a two-dimensional histogram of spectrogram difference vectors;
FIG. 7 is an exemplary plot of a scatter-histogram of spectral difference vectors.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
Assuming a protein sample, the protein sample is first enzymatically hydrolyzed by the existing biochemical technology to form a peptide mixture, then the peptide mixture is separated by liquid chromatography (the chromatographic retention time of peptides with different physicochemical properties is different), then the peptide mixture is separated by a mass spectrometer according to the mass-to-charge ratio (mass-to-charge ratio), and the separated peptides are broken into fragment ions and detected to form a peptide fragment ion spectrum. The operation process enables the experimental tandem mass spectrum generated by the liquid chromatogram-mass spectrum combination to have three kinds of information of peptide chromatogram retention time, peptide mass and fragment ion spectrum. It should be noted that, since the charged number of the peptide ion is detectable in high-precision mass spectrum data, the peptide mass can be calculated from the mass-to-charge ratio of the peptide ion.
For example, a protein sample usually contains a plurality of proteins (e.g., several tens, several hundreds, or thousands), and each protein generally contains a large number of proteins. If one of the proteins is assumed to have the following amino acid sequence: hicdefg.
ABCDEFG......HIJK
ABCDEFG......HIJK
... ...
ABCDEFG......HIJK。
If there is a posttranslational modification (hereinafter, also simply referred to as a modification) of a protein in C among the above-mentioned types of proteins, a protein represented as follows (wherein the symbol' represents a modification) is also present in the protein sample:
ABCDEFG……HIJK
ABCDEFG……HIJK
... ...
ABCDEFG……HIJK。
after enzymolysis, various peptides such as ABCD, EFG and HIJK are generated, peptide mixtures consisting of the peptides and peptides generated by enzymolysis of other proteins are separated through liquid chromatography, the separated peptide mixtures continuously enter a mass spectrometer, the peptides are ionized in the mass spectrometer and are scanned and detected, and a primary mass spectrogram is generated. Peptides of different physicochemical properties were separated from the liquid chromatography at different times. Thus, different peptide fragments may appear in different primary mass spectra. For example, peptide ABCD appears in one primary mass spectrum as shown in fig. 1, whereas peptide ABCD has a modification that changes physicochemical properties and thus appears in another subsequent primary mass spectrum as shown in fig. 2. The masses of the two peptides differ by 57 daltons (Da) and the retention times differ by 0.01 seconds(s). The peptide ions can be further fragmented to generate fragment ion spectra, since the present invention does not use fragment ion spectrum information and will not be described further herein. Through the chromatographic-mass spectrometry combined process, the experimental tandem mass spectrum of the peptide ABCD comprises information such as chromatographic retention time of the peptide ABCD, peptide mass obtained when the peptide ABCD generates a primary mass spectrum, fragment ion spectrum of the peptide ABCD and the like. The same is true for other types of peptides. It should be noted that, when the number of a certain peptide in a protein is large, the mass spectrometer scans the peptide for many times, so that a plurality of repeated experimental tandem mass spectrograms are generated for the peptide, and data redundancy of the experimental tandem mass spectrograms is caused. It should be noted that the protein sequences in the above examples are only hypothetical protein sequences constructed, and do not represent actual protein sequences.
Due to the dynamic nature of modification, modified and unmodified peptides often coexist, and the mass spectra generated by the peptides in the two states are all related to the mass of the peptide ions, the retention time of the chromatogram and the fragment ion spectrum. The invention provides a method for clustering mass spectrograms by using peptide mass and peptide chromatogram retention time information aiming at high-precision mass spectrometric data so as to find a high-abundance modification type in a sample. In this method, each pair of spectra is represented as a two-dimensional vector consisting of the difference in mass of the peptide and the difference in retention time of the peptide. A pair of spectra generated from peptides of identical sequence but differing only by certain modifications, have nearly fixed peptides of poor quality and very close chromatographic retention times. Modification-related spectrogram pairs can be distinguished from random spectrogram pairs using a mixed distribution probability model. By the method, the modification quality can be accurately determined, the influence of modification on retention time can be described, and a spectrogram pair related to modification can be found. Because the method only adopts two types of information of the peptide mass and the peptide chromatogram retention time in the experimental tandem mass spectrum data, and the size of the data contained in the fragment ion spectrum is far larger than the peptide mass and the peptide chromatogram retention time, the abandonment of the data of the fragment ion spectrum is also helpful for improving the calculation efficiency of the whole method. Is quite different from the MS-Alignment algorithm which adopts fragment ion spectrum information in the prior art. Furthermore, the modification discovery method used in the present invention does not require searching of protein sequence databases and is therefore much more efficient than MS-Alignment.
The following describes a specific implementation of the method of the present invention with reference to fig. 3 in conjunction with the aforementioned example.
As can be seen from the foregoing description, there is a phenomenon of data redundancy in the experimental tandem mass spectrometry data set generated from a protein sample. Data redundancy has two disadvantages, firstly, redundant spectrograms increase calculated amount, and as can be known from the subsequent description of the invention, the method of the invention needs to calculate and process the peptide mass difference and the peptide chromatogram retention time difference between all spectrogram pairs, and the scale of the calculation processing process is in direct proportion to the square of the spectrogram number; the second is that the redundant spectra may have an adverse effect on the distribution of peptide mass and peptide chromatographic retention time differences. For the above reasons, in a preferred embodiment of the present invention, before the specific treatment is performed on the retention time and the peptide mass of the peptide chromatogram in the experimental tandem mass spectrum, redundant data in the retention time and the peptide mass can be removed, so as to improve the subsequent calculation efficiency and accuracy. In the process of removing redundant data, a measurement function of the similarity between spectrograms can be defined, and only one of the spectrograms is reserved as a representative spectrogram with high enough similarity. A simple method of measuring spectral similarity is to compare the peptide masses and two spectra can be considered similar if the peptide masses of the two spectra are close enough, i.e. when the difference in peptide masses of the two spectra is within a specified range. For high precision mass spectrometers (e.g.LTQ-FT and LTQ-Orbitrap types), a mass difference range of 1-20ppm is a suitable choice. The larger the quality difference range is, the more obvious the redundancy removing effect is, and the larger the data reduction degree is. Taking the protein sample as an example, if 5000 spectra are generated by peptide ABCD having protein post-translational modification in the whole protein and 10000 spectra are generated by peptide ABCD not having protein post-translational modification, 1 spectra are left by peptide ABCD and ABCD, respectively, after removing redundant data. This obviously enables a significant reduction in the amount of computation, thereby improving the computational efficiency. It should be noted that, although the method of removing redundant data using peptide mass is simple in the above description, it is possible to remove spectra not from the same peptide fragment, and the modification discovery method of the present invention is not affected by this, since a representative subset of the spectra data is actually sufficient for discovering the highly abundant modification types. The removal of redundant data helps to reduce the amount of calculation and improve the calculation efficiency, but if the operation is not executed, the implementation of the method of the invention is not influenced.
In implementing the method of the present invention, a spectrogram difference vector (denoted by Δ) between all spectrograms is first calculated using the peptide mass and the peptide chromatogram retention time of each spectrogram in the experimental tandem mass spectrometry dataset. The profile difference vector is a two-dimensional vector comprising two components, the peptide mass difference (denoted Δ m) and the chromatographic retention time difference (denoted Δ Rt). A two-dimensional vector represents a spectrogram pair consisting of any two spectrograms in an experimental tandem mass spectrometry dataset. A spectrogram pair, represented by a two-dimensional spectrogram difference vector, has the following expression:
Δ=<Δm,ΔRt> (1)
where Δ m is measured in daltons (Da) and Δ Rt is measured in seconds(s). Since in the prior art, the peptide chromatogram retention time can also be approximated by the mass spectrometry scan number, the unit of measure for Δ Rt in this case can also be the number of scans (scans).
For example, if there are 100000 spectra in the experimental tandem mass spectrometry dataset, the generated spectra difference vectors will be sharedThese vectors constitute a set of spectrogram difference vectors.
After a spectrogram difference vector set is obtained, screening a candidate modification quality interval by using a peptide quality difference delta m component in the vector, wherein the candidate modification quality interval is a quality interval which possibly contains modification quality. The method firstly positions a candidate modification quality interval and then further determines the accurate modification quality.
Counting all Δ m components in the spectrogram difference vector set can obtain a Δ m distribution histogram as shown in fig. 4, where the abscissa in the histogram represents the size of Δ m and the ordinate represents the number of occurrences of a certain Δ m size. From biological knowledge it is known that modified and unmodified forms of the same type of peptide tend to occur simultaneously in a sample. Since the mass difference between the modified and unmodified forms of the same type of peptide is within a specific range and appears many times for high abundance modifications, the mass difference between different types of peptides is usually in a random distribution. Therefore, based on this characteristic, Δ m with a high frequency of occurrence found from the Δ m distribution histogram is a possible modification quality candidate approximation, and a modification quality candidate interval can be located according to the found high frequency Δ m.
In searching for Δ m, which occurs frequently, a probabilistic method may be employed. In one embodiment of the invention, a simple and effective solution is employed. In the method, for each 1Da mass window centered on an integer mass value, e.g., (0.5, 1.5), (1.5, 2.5), (2.5, 3.5.) Δ m for the highest frequency within the window is extracted, using Δ m as the extracted valuefAnd (4) showing. If expressed by a formula, then
Where counts (Δ m) is the number of times Δ m occurs and n is the number of quality windows. For a typical protein sample, the majorityAre all randomly generated, but some may be due to post-translational modifications of the protein, for example the highest frequency 57.02Da in fig. 4 is due to urea methylation (carbammidomethylation) modifications. At a value of Δ mfThereafter, further investigation of counts (Δ m)f) Distribution histogram of (1), counts (Δ m) can be foundf) The random portion of the distribution approximates a gaussian distribution, such as that shown in fig. 5. In one embodiment of the invention, a heuristic approach is used to estimate counts (Δ m)f) Is randomly distributed. In the method, first, let cminAnd cmedRespectively representing data setsMinimum and median of; then use ratio 2cmed-cminIs small and smallData estimation counts (Δ)mf) Parameters of random gaussian distribution including mean and standard deviation; finally, for eachCan be calculated based on the estimated Gaussian distributionP-value from random Gaussian distribution (i.e. inUnder the assumption that it is generated randomly,a probability that the value of (a) is equal to or greater than the actual observed value). With p-value less than a certain threshold value (e.g. 0.01)Corresponding toIs considered to be Δ m at high frequencies.
After obtaining Δ m of high frequency, the candidate modification quality interval can be determined accordingly. In general, the width of the candidate modification mass interval does not exceed 1Da, and can be set to a range of ε Da around the integer mass value closest to the high frequency Δ m. For example, assuming a high frequency Δ m of 57.02Da, the corresponding candidate modification mass interval is (57- ε, 57+ ε) Da. When epsilon is 0.3 to 0.5, the effect is better. In general, an epsilon value of 0.5 can achieve the purpose.
After the candidate modification mass interval is obtained, a more accurate mass value of the protein post-translational modification can be determined by combining another component in the spectrogram difference vector, namely the retention time of the peptide chromatogram, on the basis of the peptide mass information. Peptide chromatographic retention time is another dimension of information that is relatively independent of peptide mass. The modified form and the non-modified form of the same peptide fragment have the same amino acid sequence and only differ by one modified group, so that the modified form and the non-modified form have similar physicochemical properties and the retention times of the modified form and the non-modified form are close to each other. Also, the effect of the same modification on peptide retention time tends to be relatively constant. Therefore, the distribution of peptide retention time differences for a pair of spectra associated with a certain modification should be relatively concentrated, showing a consistent distribution trend. For example, a modification may tend to increase or decrease the retention time of a peptide, or have no significant effect. An example of a two-dimensional histogram of the Δ vector is given in fig. 6, from which it can be seen that using the peptide mass information and the peptide chromatogram retention time information in the two-dimensional vector, a sharp peak can be obtained, which is due to the urea methylation modification.
In the process of obtaining an accurate mass experimental value of protein posttranslational modification by using a spectrogram difference vector, for each candidate modification mass interval, the distribution of the spectrogram difference vector delta in the interval is assumed to be composed of a mixture of a plurality of components, including a random distribution and n distributions caused by modification, that is, the probability density function of delta is:
wherein f isRand(Delta) represents the probability density function of the random distribution within the candidate modification mass interval, fMod,jAnd the probability density function represents the jth distribution caused by modification in the candidate modification quality interval, the integer n represents the number of the distributions caused by modification, and alpha is a mixing coefficient. To know the specific information of the modification included in the distribution caused by the modification, it is first necessary to confirm the number of the distributions caused by the modification included in one mixed distribution on the basis of the aforementioned formula, then determine which distributions are the distributions caused by the modification on the premise that the mixed distribution has the modification distribution, and finally calculate the accurate quality experimental value of the modification and the like according to the attribute information included in the distributions.
In a mixed distribution, there may be no distribution resulting from modification, there may be only one distribution resulting from modification, or there may be a distribution resulting from multiple modifications. In order to determine how many modifications, i.e. the value of n, are included in a candidate modification quality interval, the present invention employs a stepwise trial and error strategy. Let N take the values 1, 2,.. N, respectively, and for each value of N, estimate the parameters of the above-mentioned mixed distribution by using an expectation-maximization (EM) algorithm, where N is a relatively large integer (e.g., 10). For each estimated result, the standard deviations (in σ m and Δ Rt, respectively) of the estimated components Δ m and Δ Rt of the jth modified distribution were observedm,jAnd σRt,jExpress), if for all j 1, 2m,jAnd σRt,jSufficiently small, i.e. sigmam,j<TmAnd σRt,j<TRtThen the mass interval is considered to contain at least n modifications and the value of n is added to 1 and the mixture distribution is re-estimated, where TmAnd TRtAre two thresholds. When the j-th modified distribution exists, so that σm,j<TmOr σRt,j<TRtAnd in the meantime, the quality interval is considered to only contain n-1 modifications, the value of n is reduced by 1, the parameters of the mixed distribution are re-estimated, and the gradual trial process is stopped.
After obtaining the number of modified distributions and the parameters of the mixed distributions, which distributions are caused by modification and which distributions are random distributions unrelated to modification can be determined according to the standard deviation corresponding to each distribution in the mixed distributions by the method described above. An example of a delta distribution is given in fig. 7, where the mass interval contains only one modification (urea methylation). As can be seen from fig. 7, in both dimensions Δ m and Δ Rt of the Δ vector, the main (random) part of the data approximates a gaussian distribution with a large standard deviation, while the part of the decoration correlation (indicated by squares in the scatter plot and ellipses in the histogram) approximates a gaussian distribution with a small standard deviation.
After determining which distribution is a distribution resulting from modification, an experimental value of the corresponding modification quality can be calculated based on the attribute information of the distribution. For example, for the j-th modification (j ═ 1, 2., n), the properties such as the mean value and standard deviation of the peptide mass difference and retention time difference corresponding to the modification can be obtained, wherein the mean value of the peptide mass difference can be used as the modified mass experimental value of the modification, and the mean value of the peptide retention time difference characterizes the influence of the modification on the retention time.
Since the number of modification types contained in a protein sample is not usually too large, in order to simplify the calculation, it is also possible in a preferred embodiment of the invention to assume that at most one modification mass is contained within each of the candidate modification mass intervals, and further to assume that the components of the Δ distribution resulting from the modification conform to one gaussian distribution and the random Δ distribution conforms to another gaussian distribution. On the assumption of the above, the Δ mixture distribution formulas expressed by the foregoing formulas (3) and (4) can be simplified to the following two formulas:
f(Δ)=αRandf(Δ|μRand,∑Rand)+αModf(Δ|μMod,∑Mod) (5)
αRand+αMod=1 (6)
where f (Δ | μ, Σ) is a probability density function of a two-dimensional gaussian distribution with a mean μ and a covariance matrix Σ:
subscripts Rand and Mod indicate random and modification correlations, respectively, αRandAnd alphaModIs the mixing coefficient. The coefficients of the gaussian mixture distribution can be estimated using the expectation-maximization (EM) algorithm. Estimated muModThe mass component of the value may be used as a predicted value of the modified mass and the retention time component may be used as a measure of the effect of the modification on the retention time. .
In the above-described process of determining the modification quality, the modification quality value is obtained in the candidate modification quality region on the basis of the small number of candidate modification quality regions selected. In other embodiments of the methods of the invention, the process of screening candidate modification mass intervals may be omitted and all possible candidate modification mass intervals are established over the entire distribution interval of peptide mass differences. The specific operation method comprises the following steps: before the modified mass value is obtained, all mass windows with the integer mass as the center and the width of 2 epsilon are established in the whole distribution interval of the peptide mass difference, and then all the mass windows are used as candidate modified mass intervals to carry out corresponding calculation according to the operation of obtaining the modified mass value. Wherein epsilon can be 0.3 to 0.5, and can be 0.5 under general conditions.
Using the above estimates of the mixture distribution, it is possible to obtain very accurate values of the quality of the modification and the quantitative influence of the modification on the retention time of the peptide, from which the type of modification can be deduced. First, a post-translational modification database (e.g., Unimod, http:// www.unimod.com) is searched with modification quality values to infer the specific type of modification. Second, considering that different types of modifications have different effects on the retention time of a peptide, an offset in retention time as an independent source of information in another dimension can help to infer the type of modification, e.g., oxidative modification reduces the retention time of a peptide.
The above operation of the method of the present invention has realized calculation of mass size of protein posttranslational modification and estimation of modification type, and on the basis of the calculation result, various applications such as identification of protein, detection of modification-related spectrogram pair, etc. can be made.
In a protein identification method, after the quality and type of modification is determined, these modifications found may be taken into account in protein identification algorithms and software, for example, in a method for identifying proteins using database searches, modified peptides and proteins may be identified by assigning the type of modification found to be a variable modification parameter.
In the method for detecting the modified correlation spectrogram pair, after parameter estimation in the mixed distribution is obtained according to the formula (3) and the formula (4), the difference vector delta of a pair of spectrograms is given, so that the posterior probability that the spectrograms are related by the kth modification, namely the posterior probability can be calculated
Wherein, ModkIndicating that the spectrogram pair is associated with the kth modification, i.e., one spectrogram in the spectrogram pair has the same peptide sequence but one kth modification more than the other spectrogram.
For the simplified gaussian mixture models in equations (5), (6), and (7), the posterior probability calculation equation is:
the spectrum pairs with a posterior probability greater than a given threshold are considered as modification-related spectrum pairs for further analysis by the user, such as de novo peptide sequencing or propagation of peptide sequences between spectrum pairs, and the like.
According to the present invention, there may also be provided a corresponding apparatus for discovering a post-translational modification of a protein, comprising: the system comprises a spectrogram difference vector calculation module, a candidate modification mass interval establishment module, a protein post-translation modification distribution discovery module and an accurate mass value calculation module; wherein,
the spectrogram difference vector calculation module calculates spectrogram difference vectors by using the peptide chromatogram retention time and the peptide mass in spectrogram data in an experimental tandem mass spectrum data set of the protein sample, wherein the spectrogram difference vectors represent the peptide mass difference and the peptide chromatogram retention time difference between two experimental tandem mass spectrograms;
the candidate modification quality interval establishing module establishes a candidate modification quality interval possibly containing modification quality;
the protein posttranslational modification distribution finding module estimates mixed distribution of the spectrogram difference vectors on each candidate modification mass interval, calculates standard deviation of each distribution in the mixed distribution, and determines the distribution caused by the protein posttranslational modification in the candidate modification mass interval according to the standard deviation;
the accurate mass value calculation module calculates a mean of the distribution resulting from the post-translational modification of the protein, obtains an accurate mass value of the post-translational modification of the protein from a mass component of the mean, and obtains an influence of the post-translational modification of the protein on a retention time of the peptide chromatogram from a retention time component of the mean.
The device for discovering the protein post-translational modification also comprises a redundant data removing module, wherein the redundant data removing module removes redundant spectrogram data in the experimental tandem mass spectrum data set.
The device for discovering the posttranslational modification of the protein further comprises a module for discovering the type of the posttranslational modification of the protein, which module infers the type of the posttranslational modification of the protein according to the precise mass value of the posttranslational modification of the protein and the influence on the retention time of the peptide chromatogram.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (18)
1. A method for discovering post-translational modifications of a protein, comprising:
step 1), calculating spectrogram difference vectors between all spectrograms by using the peptide chromatogram retention time and the peptide mass in the experimental tandem mass spectrum data of the protein sample, wherein the spectrogram difference vectors represent the peptide mass difference and the peptide chromatogram retention time difference between two experimental tandem mass spectrograms;
step 2), establishing a candidate modification quality interval possibly containing modification quality;
step 3), on each candidate modification mass interval, estimating the mixed distribution of the spectrogram difference vectors, calculating the standard deviation of each distribution in the mixed distribution, and determining the distribution caused by the protein posttranslational modification in the candidate modification mass interval according to the standard deviation;
step 4), calculating the exact mass value of the post-translational modification of the protein and the effect of the post-translational modification of the protein on the retention time of the peptide chromatogram based on the property of the distribution comprising the post-translational modification of the protein.
2. The method for discovering post-translational modifications of a protein according to claim 1, further comprising removing redundant spectral data from the set of experimental tandem mass spectrometry data prior to step 1).
3. The method of claim 2, wherein the removing of redundant profile data comprises: and comparing the peptide masses in the spectrogram data, taking the spectrogram data with approximate peptide mass as similar spectrogram data, and only keeping one spectrogram data in the set of similar spectrogram data.
4. The method for finding a post-translational modification of a protein according to claim 1 or 2, further comprising:
step 5), inferring the type of post-translational modification of the protein from the exact mass value of the post-translational modification of the protein and the effect on the retention time of the peptide chromatogram.
5. The method for finding post-translational modifications of protein according to claim 1, 2 or 4, wherein in the step 2), a distribution histogram of peptide mass differences is created according to the spectrogram difference vector, peptide mass differences with high occurrence frequency are screened from the distribution histogram of peptide mass differences, and the candidate modification mass interval is created on the distribution histogram of peptide mass differences by using the peptide mass differences with high occurrence frequency.
6. The method for finding post-translational modifications of proteins according to claim 5, wherein the step of screening the distribution histogram of peptide mass differences for the peptide mass differences with high frequency of occurrence comprises:
step 2-1-1), establishing a mass window with integer mass values as the center on the distribution histogram of the peptide mass difference;
step 2-1-2), extracting the peptide mass difference delta mf with the highest occurrence frequency in the mass window;
step 2-1-3), establishing a distribution histogram related to occurrence frequency counts for the peptide mass difference Δ mf with the highest occurrence frequency in each window, estimating the random distribution of the distribution histogram, and calculating the occurrence frequency of the peptide mass difference with the highest occurrence frequency in a certain window according to the estimation resultP-values from random distributions;
7. The method of claim 6, wherein the first threshold comprises 0.01.
8. The method of claim 5, wherein said using said peptide mass differences with high frequency of occurrence to establish said candidate modification mass interval on said histogram of distribution of peptide mass differences comprises:
step 2-2-1), searching the nearest integer mass value in the vicinity of the peptide mass difference value with high occurrence frequency;
step 2-2-2), selecting the size of epsilon Da around the integer quality value so as to obtain a candidate modification quality interval; the epsilon may comprise any value between 0.3 and 0.5.
9. The method for finding the posttranslational modification of protein according to claim 1, 2 or 4, wherein in the step 2), a mass window is established over the entire distribution interval of the peptide mass difference, and all the established mass windows are used as candidate modification mass intervals.
10. The method of claim 9, wherein said establishing a mass window comprises establishing a mass window centered at integer mass with a width of 2 s: the epsilon may comprise any value between 0.3 and 0.5.
11. The method for finding post-translational modifications of proteins as claimed in claim 1, 2 or 4, wherein in step 3), said mixed distribution comprises a random distribution and n distributions resulting from modifications, and said calculating a standard deviation of each distribution in said mixed distribution, and determining the distribution resulting from post-translational modifications of proteins in said candidate modification mass interval from said standard deviation comprises:
step 3-1-1), setting the value of n to 1;
step 3-1-2), estimating parameters of probability density functions of spectrogram difference vectors in a current candidate modification quality interval, and selecting the first n distributions with the minimum standard deviation as candidate distributions caused by modification; the probability density function is:
wherein f isRand(Delta) represents the probability density function of the random distribution within the candidate modification mass interval, fMod,jA probability density function representing the distribution caused by the jth modification in the candidate modification quality interval, where α is a mixing coefficient;
step 3-1-3), observing and estimating standard deviation sigma of components delta m and delta Rt of jth modified distribution in the obtained parametersm,jAnd σRt,jIf for all j 1, 2m,j<TmAnd σRt,j<TRtThen the mass interval is considered to contain at least n modifications, and the value of n is added by 1, and step 3-1-2) is re-executed, wherein TmAnd TRtAre two thresholds; when the j-th modified distribution exists, so that σm,j<TmOr σRt,j<TRtAnd then, confirming that the quality interval only contains n-1 modifications, subtracting 1 from the value of n, and finishing the operation after re-executing the parameter estimation in the step 3-1-2).
12. The method for discovering post-translational modification of protein according to claim 11, wherein in the step 3-1-2), parameters of probability density function of spectrogram difference vector in the current candidate modification mass interval are estimated by expectation-maximization algorithm.
13. The method for finding post-translational modifications of proteins of claim 1, 2 or 4, wherein in step 3), said mixture distribution comprises a random gaussian distribution and a gaussian distribution resulting from modifications, and said calculating the standard deviation of each distribution in said mixture distribution, and determining the distribution resulting from post-translational modifications of proteins in said candidate modification mass interval from said standard deviation comprises:
step 3-2-1), estimating parameters of probability density functions of Gaussian distribution of spectrogram difference vectors in the current candidate modification quality interval by adopting an expectation-maximization algorithm; the probability density function is
f(Δ)=αRandf(Δ|μRand,Rand)+αModf(Δ|μMod,ΣMod)
αRand+αMod=1
Wherein alpha isRandAnd alphaModIs the mixing coefficient, f (Δ | μ, Σ) is a probability density function of a two-dimensional gaussian distribution with mean μ and covariance matrix Σ:
step 3-2-2), when the standard deviation of one distribution in the estimated parameters is much smaller than that of the other distribution, the distribution with the smaller standard deviation is the distribution caused by the post-translational modification of the protein.
14. A device for discovery of post-translational modifications of a protein, comprising: the device comprises a spectrogram difference vector calculation module, a candidate modification mass interval establishment module, a protein post-translation modification distribution discovery module and an accurate mass experimental value calculation module; wherein,
the spectrogram difference vector calculation module calculates spectrogram difference vectors among all spectrograms by using the peptide chromatogram retention time and the peptide mass in the experimental tandem mass spectrum data of the protein sample, wherein the spectrogram difference vectors represent the peptide mass difference and the peptide chromatogram retention time difference between two experimental tandem mass spectrograms;
the candidate modification quality interval establishing module establishes a candidate modification quality interval possibly containing modification quality;
the protein posttranslational modification distribution finding module estimates mixed distribution of the spectrogram difference vectors on each candidate modification mass interval, calculates standard deviation of each distribution in the mixed distribution, and determines the distribution caused by the protein posttranslational modification in the candidate modification mass interval according to the standard deviation;
the accurate mass experiment value calculation module calculates a mean value of distribution caused by the protein post-translational modification, obtains an accurate mass experiment value of the protein post-translational modification from a mass component of the mean value, and obtains an influence of the protein post-translational modification on a peptide chromatogram retention time from a retention time component of the mean value.
15. The apparatus for discovering post-translational modification of protein according to claim 14, further comprising a redundant data removal module, wherein the redundant data removal module removes redundant spectral data from the set of experimental tandem mass spectrometry data.
16. The apparatus for discovering post-translational modification of protein according to claim 14 or 15, further comprising a module for inferring the type of post-translational modification of protein based on the precise mass experiment value of the post-translational modification of protein and the influence on the retention time of peptide chromatogram.
17. A method of protein identification comprising:
step 1) determining the quality and type of the posttranslational modification of the protein using the method of discovery of the posttranslational modification of the protein according to any one of claims 1 to 13;
step 2), in database search, the type of protein posttranslational modification found is designated as variable modification parameters, and the identification of modified peptides and proteins is realized.
18. A method for detecting a modified pair of related spectra, comprising:
step 1) calculating an estimate of parameters in the mixture distribution using the method of discovery of post-translational modifications of proteins according to any one of claims 1 to 13;
step 2), calculating a difference vector of a pair of spectrograms;
and 3) calculating the posterior probability associated with the kth modification of the spectrogram pair by using the difference vector of the spectrogram pair and the parameter estimation in the mixed distribution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009100765888A CN101477089B (en) | 2009-01-09 | 2009-01-09 | Discovery method for protein post-translational modification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009100765888A CN101477089B (en) | 2009-01-09 | 2009-01-09 | Discovery method for protein post-translational modification |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101477089A true CN101477089A (en) | 2009-07-08 |
CN101477089B CN101477089B (en) | 2012-06-13 |
Family
ID=40837834
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2009100765888A Active CN101477089B (en) | 2009-01-09 | 2009-01-09 | Discovery method for protein post-translational modification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101477089B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102411679A (en) * | 2010-09-26 | 2012-04-11 | 中国科学院计算技术研究所 | Large-scale distributed parallel acceleration method and system for protein identification |
CN102472732A (en) * | 2009-07-31 | 2012-05-23 | 西门子公司 | Chromatographic filtering method |
CN104053989A (en) * | 2012-01-16 | 2014-09-17 | 莱克公司 | Systems and methods to process and group chromatographic peaks |
CN104182658A (en) * | 2014-08-06 | 2014-12-03 | 中国科学院计算技术研究所 | Tandem mass spectrogram identification method |
CN107991411A (en) * | 2014-05-21 | 2018-05-04 | 萨默费尼根有限公司 | It is used for the method for mass spectrum biopolymer analysis using the oligomer scheduling of optimization |
CN108052801A (en) * | 2017-11-30 | 2018-05-18 | 中国科学院计算技术研究所 | A kind of N sugar structure base construction methods and system based on regular expression |
WO2018176808A1 (en) * | 2017-03-29 | 2018-10-04 | 山东大学 | Screening and use of biomarker related to severe oligoasthenospermia |
CN110033822A (en) * | 2019-03-29 | 2019-07-19 | 华中科技大学 | Protein coding method and protein post-translational modification site estimation method and system |
-
2009
- 2009-01-09 CN CN2009100765888A patent/CN101477089B/en active Active
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102472732A (en) * | 2009-07-31 | 2012-05-23 | 西门子公司 | Chromatographic filtering method |
CN102472732B (en) * | 2009-07-31 | 2014-05-07 | 西门子公司 | Method for filtering a chromatogram |
CN102411679A (en) * | 2010-09-26 | 2012-04-11 | 中国科学院计算技术研究所 | Large-scale distributed parallel acceleration method and system for protein identification |
CN104053989A (en) * | 2012-01-16 | 2014-09-17 | 莱克公司 | Systems and methods to process and group chromatographic peaks |
CN104053989B (en) * | 2012-01-16 | 2016-10-19 | 莱克公司 | System and method chromatographic peak processed and be grouped |
CN107991411A (en) * | 2014-05-21 | 2018-05-04 | 萨默费尼根有限公司 | It is used for the method for mass spectrum biopolymer analysis using the oligomer scheduling of optimization |
CN107991411B (en) * | 2014-05-21 | 2020-10-16 | 萨默费尼根有限公司 | Method for mass spectrometry biopolymer analysis using optimized oligomer scheduling |
CN104182658B (en) * | 2014-08-06 | 2017-05-03 | 中国科学院计算技术研究所 | Tandem mass spectrogram identification method |
CN104182658A (en) * | 2014-08-06 | 2014-12-03 | 中国科学院计算技术研究所 | Tandem mass spectrogram identification method |
WO2018176808A1 (en) * | 2017-03-29 | 2018-10-04 | 山东大学 | Screening and use of biomarker related to severe oligoasthenospermia |
CN108052801A (en) * | 2017-11-30 | 2018-05-18 | 中国科学院计算技术研究所 | A kind of N sugar structure base construction methods and system based on regular expression |
CN108052801B (en) * | 2017-11-30 | 2020-06-26 | 中国科学院计算技术研究所 | Regular expression-based N-sugar structure library construction method and system |
CN110033822A (en) * | 2019-03-29 | 2019-07-19 | 华中科技大学 | Protein coding method and protein post-translational modification site estimation method and system |
Also Published As
Publication number | Publication date |
---|---|
CN101477089B (en) | 2012-06-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101477089B (en) | Discovery method for protein post-translational modification | |
Colinge et al. | OLAV: Towards high‐throughput tandem mass spectrometry data identification | |
US6489608B1 (en) | Method of determining peptide sequences by mass spectrometry | |
US6489121B1 (en) | Methods of identifying peptides and proteins by mass spectrometry | |
US7409296B2 (en) | System and method for scoring peptide matches | |
CA2495378C (en) | Method for characterizing biomolecules utilizing a result driven strategy | |
US8987662B2 (en) | System and method for performing tandem mass spectrometry analysis | |
EP1722315B1 (en) | Method and apparatus for classifying ionized molecular fragments | |
EP2032238A2 (en) | Analyzing mass spectral data | |
US20100299076A1 (en) | Mass Spectrometry System | |
JP2019505780A (en) | Structure determination method of biopolymer based on mass spectrometry | |
Salmi et al. | Filtering strategies for improving protein identification in high‐throughput MS/MS studies | |
CN114639445B (en) | Polypeptide histology identification method based on Bayesian evaluation and sequence search library | |
WO2019175568A1 (en) | Methods and systems for analysis | |
Yu et al. | Statistical methods in proteomics | |
EP3002696B1 (en) | Methods for generating, searching and statistically validating a peptide fragment ion library | |
CN115436347A (en) | Physicochemical property scoring for structure identification in ion spectroscopy | |
US11600359B2 (en) | Methods and systems for analysis of mass spectrometry data | |
Liu et al. | PRIMA: peptide robust identification from MS/MS spectra | |
Wu et al. | Peptide identification via tandem mass spectrometry | |
CN116741280A (en) | Depth mass spectrum prediction method, system, equipment and medium based on data fine tuning | |
CN118230813A (en) | Method and system for evaluating false positive rate of sequencing result of polypeptide mass spectrum data from head | |
Liu et al. | A novel approach to speed up peptide sequencing via MS/MS spectra analysis | |
Finney | Tools and Analyses for Differential Label-Free Proteomics Using Mass Spectrometry | |
Hundertmark et al. | Fuzzy clustering of likelihood curves for finding interesting patterns in expression profiles |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |