Reproducibility of computational studies is a hallmark of scientific methodology. It enables rese... more Reproducibility of computational studies is a hallmark of scientific methodology. It enables researchers to build with confidence on the methods and findings of others, reuse and extend computational pipelines, and thereby drive scientific progress. Since many experimental studies rely on computational analyses, biologists need guidance on how to set up and document reproducible data analyses or simulations. In this paper, we address several questions about reproducibility. For example, what are the technical and non-technical barriers to reproducible computational studies? What opportunities and challenges do computational notebooks offer to overcome some of these barriers? What tools are available and how can they be used effectively? We have developed a set of rules to serve as a guide to scientists with a specific focus on computational notebook systems, such as Jupyter Notebooks, which have become a tool of choice for many applications. Notebooks combine detailed workflows with...
Knowledge representation and reasoning (KR&R) has been successfully implemented in many fields to... more Knowledge representation and reasoning (KR&R) has been successfully implemented in many fields to enable computers to solve complex problems with AI methods. However, its application to biomedicine has been lagging in part due to the daunting complexity of molecular and cellular pathways that govern human physiology and pathology. In this article, we describe concrete uses of Scalable PrecisiOn Medicine Knowledge Engine (SPOKE), an open knowledge network that connects curated information from thirty-seven specialized and human-curated databases into a single property graph, with 3 million nodes and 15 million edges to date. Applications discussed in this article include drug discovery, COVID-19 research and chronic disease diagnosis, and management.
Background: Biological data have traditionally been stored and made publicly available through a ... more Background: Biological data have traditionally been stored and made publicly available through a variety of on-line databases, whereas biological knowledge has traditionally been found in the printed literature. With journals now online and providing an increasing amount of open access content, often free of copyright restriction, this distinction between database and literature is blurring. To exploit this opportunity we present the integration of open access literature with the RCSB Protein Data Bank (PDB). Results: BioLit provides an enhanced view of articles with markup of semantic data and links to biological databases, based on the content of the article. For example, words matching to existing biological ontologies are highlighted and database identifiers are linked to their database of origin. Among other functions, it identifies PDB IDs that are mentioned in the open access literature, by parsing the full text for all research articles in PubMed Central (PMC) and exposing t...
The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven e... more The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Here we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility (P. aureginosa only). We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. We conclude that, w...
The Protein Data Bank (PDB) was established in 1971 as a repository for the three dimensional str... more The Protein Data Bank (PDB) was established in 1971 as a repository for the three dimensional structures of biological macromolecules. Since then, more than 85 000 biological macromolecule structures have been determined and made available in the PDB archive. Through analysis of the corpus of data, it is possible to identify trends that can be used to inform us abou the future of structural biology and to plan the best ways to improve the management of the ever‐growing amount of PDB data.
Summary We developed a new software tool, BioJava-ModFinder, for identifying protein modification... more Summary We developed a new software tool, BioJava-ModFinder, for identifying protein modifications observed in 3D structures archived in the Protein Data Bank (PDB). Information on more than 400 types of protein modifications were collected and curated from annotations in PDB, RESID, and PSI-MOD. We divided these modifications into three categories: modified residues, attachment modifications, and cross-links. We have developed a systematic method to identify these modifications in 3D protein structures. We have integrated this package with the RCSB PDB web application and added protein modification annotations to the sequence diagram and structure display. By scanning all 3D structures in the PDB using BioJava-ModFinder, we identified more than 30 000 structures with protein modifications, which can be searched, browsed, and visualized on the RCSB PDB website. Availability and Implementation BioJava-ModFinder is available as open source (LGPL license) at (https://github.com/biojava...
NATO Science for Peace and Security Series A: Chemistry and Biology, 2015
The increasing size and complexity of the three dimensional (3D) structures of biomacromolecules ... more The increasing size and complexity of the three dimensional (3D) structures of biomacromolecules in the Protein Data Bank (PDB) is a reflection of the growth in the field of structural biology. Although the PDB archive was initially used only in the field of structural biology, it has grown to become a valuable resource for understanding biology at a molecular level and is critical for designing new therapeutic options for various diseases. The many uses of the PDB archive depend upon on the tools and resources for both data management and for data access and analysis.
Reproducibility of computational studies is a hallmark of scientific methodology. It enables rese... more Reproducibility of computational studies is a hallmark of scientific methodology. It enables researchers to build with confidence on the methods and findings of others, reuse and extend computational pipelines, and thereby drive scientific progress. Since many experimental studies rely on computational analyses, biologists need guidance on how to set up and document reproducible data analyses or simulations. In this paper, we address several questions about reproducibility. For example, what are the technical and non-technical barriers to reproducible computational studies? What opportunities and challenges do computational notebooks offer to overcome some of these barriers? What tools are available and how can they be used effectively? We have developed a set of rules to serve as a guide to scientists with a specific focus on computational notebook systems, such as Jupyter Notebooks, which have become a tool of choice for many applications. Notebooks combine detailed workflows with...
Knowledge representation and reasoning (KR&R) has been successfully implemented in many fields to... more Knowledge representation and reasoning (KR&R) has been successfully implemented in many fields to enable computers to solve complex problems with AI methods. However, its application to biomedicine has been lagging in part due to the daunting complexity of molecular and cellular pathways that govern human physiology and pathology. In this article, we describe concrete uses of Scalable PrecisiOn Medicine Knowledge Engine (SPOKE), an open knowledge network that connects curated information from thirty-seven specialized and human-curated databases into a single property graph, with 3 million nodes and 15 million edges to date. Applications discussed in this article include drug discovery, COVID-19 research and chronic disease diagnosis, and management.
Background: Biological data have traditionally been stored and made publicly available through a ... more Background: Biological data have traditionally been stored and made publicly available through a variety of on-line databases, whereas biological knowledge has traditionally been found in the printed literature. With journals now online and providing an increasing amount of open access content, often free of copyright restriction, this distinction between database and literature is blurring. To exploit this opportunity we present the integration of open access literature with the RCSB Protein Data Bank (PDB). Results: BioLit provides an enhanced view of articles with markup of semantic data and links to biological databases, based on the content of the article. For example, words matching to existing biological ontologies are highlighted and database identifiers are linked to their database of origin. Among other functions, it identifies PDB IDs that are mentioned in the open access literature, by parsing the full text for all research articles in PubMed Central (PMC) and exposing t...
The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven e... more The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Here we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility (P. aureginosa only). We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. We conclude that, w...
The Protein Data Bank (PDB) was established in 1971 as a repository for the three dimensional str... more The Protein Data Bank (PDB) was established in 1971 as a repository for the three dimensional structures of biological macromolecules. Since then, more than 85 000 biological macromolecule structures have been determined and made available in the PDB archive. Through analysis of the corpus of data, it is possible to identify trends that can be used to inform us abou the future of structural biology and to plan the best ways to improve the management of the ever‐growing amount of PDB data.
Summary We developed a new software tool, BioJava-ModFinder, for identifying protein modification... more Summary We developed a new software tool, BioJava-ModFinder, for identifying protein modifications observed in 3D structures archived in the Protein Data Bank (PDB). Information on more than 400 types of protein modifications were collected and curated from annotations in PDB, RESID, and PSI-MOD. We divided these modifications into three categories: modified residues, attachment modifications, and cross-links. We have developed a systematic method to identify these modifications in 3D protein structures. We have integrated this package with the RCSB PDB web application and added protein modification annotations to the sequence diagram and structure display. By scanning all 3D structures in the PDB using BioJava-ModFinder, we identified more than 30 000 structures with protein modifications, which can be searched, browsed, and visualized on the RCSB PDB website. Availability and Implementation BioJava-ModFinder is available as open source (LGPL license) at (https://github.com/biojava...
NATO Science for Peace and Security Series A: Chemistry and Biology, 2015
The increasing size and complexity of the three dimensional (3D) structures of biomacromolecules ... more The increasing size and complexity of the three dimensional (3D) structures of biomacromolecules in the Protein Data Bank (PDB) is a reflection of the growth in the field of structural biology. Although the PDB archive was initially used only in the field of structural biology, it has grown to become a valuable resource for understanding biology at a molecular level and is critical for designing new therapeutic options for various diseases. The many uses of the PDB archive depend upon on the tools and resources for both data management and for data access and analysis.
Uploads
Papers by Peter W Rose