HCLSIG/LODD/Data

LODD-related datasets that the LODD group already made available as Linked Data

NOTE: WORK-IN-PROGRESS fu-berlin datasets are being hosted by Bio2RDF. Several are already there. Updates to this page and CKAN Datahub are pending..

Name	Topic	Short Description	Size and coverage	Status / Activity	Example Instances	SPARQL Endpoint
DrugBank	Drugs	Drugbank.ca provides drug (i.e., chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e., sequence, structure, and pathway) information (doi:10.1093/nar/gkj067)	766,920 triples; 4,800 drugs, 2,500 protein sequences	updated regularly	Varenicline via Marbles, via OpenLink Data Explorer	http://www4.wiwiss.fu-berlin.de/drugbank/sparql
LinkedCT	Clinical Trials	Linked data source of trials from ClinicalTrials.gov	~25 million triples, 106,000 trials (as of April 2011)	Updated automatically at all times, refer to FAQ for more details.	Breast Cancer (Condition), a NCT00999557 (Trial), Toronto (City).	http://data.linkedct.org/sparql
DailyMed	All FDA-approved Structured Product Labels (SPLs) for currently marketed drugs enhanced with indexing to pharmacogenomics information and NDF-RT drug class assignments	Data available via a D2R server (sample data), as an RDF dumpt (full data, ntriples), or from Virtuoso RDF Store (contact maintainer)	1,604,893 triples, 36,000+ product labels	Updated every Thursday using information from the DailyMed RSS feed	SPL for Venlafaxine Hydrochloride (American Health Packaging)	http://purl.org/net/nlprepository/linkedSPLs
DBpedia	Drugs/ Diseases/ Proteins	RDF data about 2.49 million things that has been extracted from Wikipedia	218 million RDF triples; 2,300 drugs, 2,200 proteins	updated every 3 months	Aspirin, HIV	http://dbpedia.org/sparql
Diseasome	Diseases / Genes	Diseasome describes characteristics of disorders and disease genes linked by known disorder–gene associations	91,182 triples; 2,600 genes	updated 2006	Alzheimer's via Marbles, via OpenLink Data Explorer	http://www4.wiwiss.fu-berlin.de/diseasome/sparql
The Drug Interaction Knowledge Base	Drugs / Metabolic Inhibition Drug-drug Interactions (DDIs) / Claims and Evidence for drug mechanisms and DDIs	A D2R server of more than 60 drugs currently in the DIKB	>41K	Updated 12/21/2012	paroxetine, atorvastatin	http://dbmi-icode-01.dbmi.pitt.edu:2020/
RDF-TCM	Genes / Diseases / Medicine / Ingredients	Traditional Chinese medicine, gene and disease association dataset and a linkset mapping TCM gene symbols to Extrez Gene IDs created by Neurocommons	117,643	updated August 2009 (stable)	Ginkgo biloba	http://www.open-biomed.org.uk/sparql/endpoint/tcm
RxNorm	Drugs	A linked version of the NLM's RxNorm database that connects prescription drugs, ingredients, and NDC through RXCUI a concept unique identifier. RxNorm is a product developed by NIH’s National Library of Medicine. It currently interlinks 12 different drug vocabularies around a unique concept identifier. Due to licensing only six of the drug vocabularies are made available as part of the LODD cloud. This includes: Medical Subject Headings,, Metathesaurus FDA National Drug Code Directory, Metathesaurus FDA Structured Product Labels, National Drug File, RxNorm Vocabulary, Veterans Health Administration National Drug File Links are provided connecting RxNorm to drug bank and to the UMLS.	over 7.7 million triples; 165,806 RXCUI (Concept Unique Identifiers) Unique drugs and ingredients; 332,754 RXAUI (Atomic Unique Identifiers) sourced terms	Based on 3/2010 Rxnorm Release; Last updated 5/2010	Singulair from the Metathesaurus FDA Structured Product Labels	http://link.informatics.stonybrook.edu/sparql/
SIDER	Diseases / Side Effects	SIDER contains information on marketed drugs and their adverse effects (doi:10.1038/msb.2009.98)	192,515 triples; 63,000 adverse effect reports, 1,737 genes	updated 2009	Confusion via Marbles	http://www4.wiwiss.fu-berlin.de/sider/sparql
STITCH	Chemicals / Proteins	STITCH contains information on chemicals, proteins, and their interactions (doi:10.1093/nar/gkm795)	7,500,000 chemicals; 500,000 proteins; 370 organisms	updated July 2009	Lactose via Marbles	http://www4.wiwiss.fu-berlin.de/stitch/sparql
Medicare	Medicare Formulary	xxx	xxx	xxx	xxx	http://www4.wiwiss.fu-berlin.de/medicare/sparql
ChEMBL	Chemical / Assays (Proteins, Organisms) / Papers	ChEMBL contains information on trial drugs with information about activity against targets like but not limited to proteins. All is backed up by and linked to literature. Includes links to Bio2RDF for ChEBI and Uniprot. License: CC-BY-SA.	~130M triples	Updated 2010-01	A IC50 activity.	http://rdf.farmbio.uu.se/chembl/sparql
WHO's Global Health Observatory (GHO)	Infectious Diseases /Demography / Socioeconomic Conditions / Environmental Factors	Data and statistics for infectious diseases at country, regional, and global levels	~3M triples	Updated 2012-05	xxx	http://gho.aksw.org
University of Pittsburgh NLP Repository	Drugs / Procedures / Diagnoses	A semantic index of concepts present in 800 full-text clinical notes from the University of Pittsburgh NLP Repository	38.664	Proof of concept -- Updated 02/25/2011	Concepts from a sample radiology report	http://dbmi-icode-01.dbmi.pitt.edu:8080/sparql

A graph of some of the LODD datasets (dark grey), related biomedical datasets (light grey), related general-purpose datasets (white) and their interconnections. Line weights correspond to the number of links. The direction of an arrow indicates the dataset that contains the links, e.g., an arrow from A to B means that dataset A contains RDF triples that use identifiers from B. Bidirectional arrows usually indicate that the links are mirrored in both datasets. More on the interlinking methodology and statistics can be found on the Interlinking page.

The LODD datasets have been crawled by the SWSE Semantic Web search engine and can be accessed via a faceted browsing interface at [1] (Example query: Varenicline).

Most of the LODD datasets have also been integrated into the SPARQL endpoint of the HCLS Knowledge Base, see the wiki page of the HCLS KB for further information.

Bio2RDF Data Sets

The Bio2RDF project has published 40 biology-, gene- and medical-related datasets (altogether 2.3 billion triples). The datasets are available via SPARQL endpoints and as Linked Data. It is recommended that you use the Bio2RDF Java Servlet, and optionally download the databases for efficient personal use. Running your own instance of the OpenLink Virtuoso AMI for EC2 is also an option (and for basic URI resolution doesn't require the Java Servlet, although if you want advanced queries you should still download it and configure it to query your EC2 sparql endpoint).

Bio2RDF sparql endpoint list Sparql endpoint list in RDF
Identification of an autoimmune enteropathy-related 75-kilodalton antigen, via an OpenLink hosted edition of Bio2Rdf
Structure of the gene encoding the human cyclin-dependent kinase inhibitor p18 and mutational analysis in breast cancer, via an OpenLink hosted edition of Bio2Rdf
PubMed article viewed using the Marbles Linked Data browser.
PubMed author viewed using the Marbles Linked Data browser.
OMIM Killer Cell Lectin-Like Receptor viewed using the Marbles Linked Data browser.
Falcons Search for KILLER CELL. The Bio2RDF data has been crawled by the Falcons Semantic Web Search engine. This is an example on how the data is accessed by humans using the search engine. Falcons also offers an API that can by used by applications to access the data.

Chem2bio2RDF

Information about the chem2bio2rdf data sets

Data Sets for the LODD Task

To complement the drug-related Web of Data build by the LODD effort, the following data sets could/should also be published as Linked Data.

The LODD effort is currently gathering more information about relevant datasets. See also Evaluation of LODD Data Sets for current evaluation results.

Adis R&D Insight
chEBI
ChemBlast
ChemSpider
ClinicalTrials.gov
Citeline TrialTrove
DailyMed
DBpedia
Diseasome
Drug Bank
DrugDB
Drugome
Drug Ontology
Investigational Drug Database - Proprietary
IMS
KEGG Drug
LillyTrials
MedWatch
National Drug Code
OMIM
Orange Book
Pharmaprojects - Proprietary
PubChem
RxNorm
VA NDF-RT
Other data sources could include blogs, discussion boards, wikis, etc.
and....
- World Health Organization's Global Health Atlas
- EpiSPIDER
- Drugs@FDA - FDA Approved Drug Products
- DrugDigest
- HumanCyc: Encyclopedia of Homo sapiens Genes and Metabolism
- Alzheimer Research Forum
- RxTerms
- HuDiNe
- Medpedia
- TCMGeneDIT and RDF dump
- List of other possible data sources from page 66 onwards

Alternative Herbal Medicine use case

TCMGeneDIT dataset

Identified Based Linkage Points

INCHIs
PubChem Compound ID (CID)
PubChem NSC
Chemical Abstract ID (CAS)
New Drug Application (NDA)

Data Set Attributes

Licensing
Data Format
Identifiers