Abstract
Free full text
Cistrome Data Browser: expanded datasets and new tools for gene regulatory analysis
The Cistrome Data Browser (DB) is a resource of human and mouse cis-regulatory information derived from ChIP-seq, DNase-seq and ATAC-seq chromatin profiling assays, which map the genome-wide locations of transcription factor binding sites, histone post-translational modifications and regions of chromatin accessible to endonuclease activity. Currently, the Cistrome DB contains approximately 47,000 human and mouse samples with about 24,000 newly collected datasets compared to the previous release two years ago. Furthermore, the Cistrome DB has a new Toolkit module with several features that allow users to better utilize the large-scale ChIP-seq, DNase-seq, and ATAC-seq data. First, users can query the factors which are likely to regulate a specific gene of interest. Second, the Cistrome DB Toolkit facilitates searches for factor binding, histone modifications, and chromatin accessibility in any given genomic interval shorter than 2Mb. Third, the Toolkit can determine the most similar ChIP-seq, DNase-seq, and ATAC-seq samples in terms of genomic interval overlaps with user-provided genomic interval sets. The Cistrome DB is a user-friendly, up-to-date, and well maintained resource, and the new tools will greatly benefit the biomedical research community. The database is freely available at http://cistrome.org/db, and the Toolkit is at http://dbtoolkit.cistrome.org.
Transcription factors (TFs) bind to cis-regulatory elements and regulate the transcription rates of genes through complex mechanisms, which involve the disruption of nucleosomes, the alteration of histone post-translational modifications, the recruitment or eviction of protein complexes, etc. (1). Cistromes, defined as genome-wide maps of the cis-regulatory binding sites of trans-acting factors, are invaluable for understanding the complex biology of gene regulation (2,3). Chromatin immunoprecipitation and DNA sequencing (ChIP-seq) experiments (4–7) targeting histones in particular post-translational modification states have revealed that histone marks can be used to identify promoters and enhancers (8,9), discriminate between repressive and activating regulatory states (8,10), and distinguish genes that are actively transcribed from silent ones (11). It has been estimated that over 1,600 TFs exist in human and mouse (12,13). As these are expressed in different combinations according to cell type and state, comprehensive mapping of these cistromes by ChIP-seq is an enormous challenge. DNase-seq (14) and ATAC-seq (15) are technologies developed to comprehensively map most of the TF binding sites in a biological sample through the characterization of regions that are accessible to DNase I or Tn5 transposase enzymatic activity. The raw sequencing data from tens of thousands of ChIP-seq, DNase-seq and ATAC-seq experiments, carried out by consortia such as ENCODE (16) and the Epigenomics Roadmap Project (17), as well as by individual research groups are publicly available in repositories such as GEO (18). The Cistrome Data Browser (DB) is a platform that extracts useful cis-regulatory information from these datasets and provides features that allow the biomedical research community to readily find and re-use this information (19).
Although there are other ChIP-seq databases, including ChIP-Atlas (BioRxiv: https://doi.org/10.1101/262899), ChIPBase (20) and ReMap (21), the Cistrome DB differs from these in terms of sample coverage, comprehensive quality control metrics, data browsing and querying capabilities, and downstream analysis functions. We reported the first version of the Cistrome DB in the 2017 Nucleic Acids Research database issue (19). Here, we present an updated version which doubles the original datasets (before 1 February 2018), including ~25,000 human and 22,000 mouse samples. To increase the utility of these resources we have also implemented several Toolkit features for querying the Cistrome DB data. These new features allow users to find the predicted regulators of a specific gene, determine factors that bind to a specific genomic interval, and identify factors with similar cistromes to a user provided cistrome.
Data collection
ChIP-seq, DNase-seq, and ATAC-seq samples were identified in the public databases: NCBI Gene Expression Omnibus (GEO), Encyclopedia of DNA Elements (ENCODE), and Roadmap Epigenetics Project. In the case of GEO, all sample identifiers (GSM ID) were obtained from the SRA database using the query ‘(homo sapiens[Organism] OR mus musculus[Organism])’. Sample XML files were downloaded from GEO and parsed to determine the species (‘Organism’), and data type (‘Library Strategy’) based on ‘ChIP-Seq’ and ‘DNase-seq’ labels. Since ATAC-Seq data is usually labeled as ‘OTHER’ in library strategy, the Cistrome DB parser identified ATAC-seq data by matching the keywords in the GEO sample description text. Single-cell ATAC-seq data were excluded if they match terms such as ‘scATAC-seq’, ‘single cell ATAC’ etc, in the sample description.
Data processing and quality control
The data in these public databases were produced by numerous laboratories, and the processed results were derived using a variety of algorithms. To improve the consistency of Cistrome DB data, raw DNA sequence data for each sample was downloaded and uniformly processed by the ChiLin pipeline (22), which uses BWA (23) to map reads to the hg38 or mm10 genomes and MACS2 (24) to identify statistically significant peaks. The raw data of SRA file was downloaded from NCBI at ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/. We obtained FASTQ files from SRA files using the fastq-dump software (https://ncbi.github.io/sra-tools/fastq-dump.html). Motif scanning was also performed on transcription factor or chromatin regulator ChIP-seq samples based on enrichment of the motif sequence relative to the center of the peaks (25). Target genes were predicted from ChIP-seq peaks using the regulatory potential model which weighs the impact of each peak by exponential decay of distance to gene transcription start site (TSS) (26). Additional information about these data can be found on the Cistrome DB document page at http://cistrome.org/db/#/documents.
Cistrome DB data quality controls include six metrics, representing DNA sequencing quality, ChIP quality, and genomic distribution characteristics. Read quality is based on the median FASTQ read quality, mapping quality is measured by the percentage of reads that each map to a unique genomic locus, and the PCR bottleneck coefficient (PBC) is used to estimate the rate of read duplication through PCR amplification (27,28). The fraction of non-mitochondrial reads in peak regions (FRiP) and the number of peaks with 10-fold enrichment are used to reflect the quality of the ChIP experiment (27,28). A union of DNase hypersensitive sites (Union DHS) was summarized using a large collection of DNase-seq samples from the Cistrome DB (19,29). The percentage of peaks that overlap with the union of DHS sites is used to characterize the data quality based on the genomic distribution of the peaks. Although most TFs and chromatin associated factors tend to bind at DHS sites, some histone marks and factors do not follow this trend. Cutoffs were determined based on the distribution of these quality control metrics in the Cistrome DB (22), and a red dot indicates data with lower quality on a metric while a green dot indicates higher quality of a sample (Figure (Figure1).1). These QC measures are meant to guide users in their appraisal of data, instead of being used strictly to categorize samples as pass or fail. Although the Cistrome DB includes some samples which appear to be of poor quality by several metrics, these samples may nevertheless hold valuable clues to some aspect of regulatory biology not represented by other samples in the database.
Toolkit development
To enhance the usage of Cistrome DB data, three new ‘Toolkit’ functionalities have been developed. These can be accessed through a link on the Cistrome DB webpage or at the URL: http://dbtoolkit.cistrome.org. The first function addresses the question: ‘What factors regulate your gene of interest?’ The assignment of TFs to genes is based on regulatory potential scores that reflect the collective influence of the binding sites of a given TF on genes nearby these sites (30), and assume that TF binding sites near the TSS are more likely to regulate the gene than those further away. As different TFs may regulate genes over different ranges of genomic influence, short (1 kb), mid-range (10 kb) and long-range (100 kb) influence scores are calculated for each TF. These distances represent the exponential decay parameter to estimate the impact of each TF binding site by its distance to TSS. To focus on high quality and high confidence peaks, only peaks with 5-fold enrichment over background were used in these RP score calculations. As the total number of peaks varies between samples and this number influences the RP scores, the RP scores for each sample were standardized to fit into a range between 0 and 1 to enable cross-sample comparison. Through the interactive web interface, users can input a coding gene name and select the required parameters (species, distance). The Cistrome DB Toolkit queries RP scores across all the samples and returns samples, ranked based on the RP score for this gene.
Two additional Cistrome DB Toolkit functions were developed to address the questions: ‘What factors bind in your interval?’ and ‘What factors have a significant binding overlap with your peak set?’ The GIGGLE algorithm (31), with high speed and accuracy, is used to search and compare Cistrome samples with the user defined intervals or peak sets. Only samples which have >1000 five-fold enriched peaks were used to build the GIGGLE search index. Further details about the Cistrome DB toolkit can be found at http://dbtoolkit.cistrome.org/document.
Design of the Cistrome DB
The Cistrome DB concentrates on collecting publicly available ChIP-seq, DNase-seq and ATAC-seq data in human and mouse and providing functionalities to yield useful insights from the collected data (Figure (Figure1).1). Cistrome DB users can search published ChIP-seq or chromatin accessibility data by factor, biological source (cell line, cell type and tissue type), and species. Sample quality control reports are available and the quality of multiple samples can be assessed simultaneously by green and red dots which indicate high and low quality control metrics, respectively. Visualization of multiple samples is provided through the UCSC Genome Browser (32,33) and the WashU Epigenome Browser (34). In addition, users can conveniently download peaks from one particular sample or from a bulk collection. In terms of downstream analysis, Cistrome DB predicts target genes and evaluates motif enrichments for transcription factor ChIP-seq data. The Cistrome DB Toolkit is a new module which enables better re-use of the data collection.
Integration of data sources
The total number of human and mouse samples in the Cistrome DB has grown steadily since 2008 (Figure (Figure2A).2A). In the current collection (February in 2018), the Cistrome DB incorporates ~25,000 human and ~22,000 mouse samples, which doubles the number of samples in the last release (19). This collection not only increases the sample size in the trans-factor/histone mark ChIP-seq, and chromatin accessibility in human and mouse, but also increases the types of factors and histone marks (Figure (Figure2B2B and C). The current Cistrome DB contains ~1,700 factors and 132 histone marks/variants in human, and 965 factors and 120 histone marks/variants in mouse (Figure (Figure2B).2B). Examples of new factors include ZBTB48 (35,36) and ZMYM3 (37) in human, and SPEN and TERF2IP (38) in mouse; and examples of new histone modifications / variants include H3F3A (37) and H2AFZ (37) in human, and H3K9BHB (39) and H2BK5me1 (40) in mouse (Figure (Figure2D).2D). The new data in the Cistrome DB is of a similar high quality as the previous collection (Figure (Figure2E),2E), as evident from the number of highly enriched peaks and the overlap with the union of DHS sites (Figure (Figure2E2E).
Query, visualization, and download
The Cistrome DB provides a drop-down menu to find samples with certain annotations, such as TF name, histone modification, cell line, cell type, and tissue type. Alternatively, users can directly search for Cistrome DB data by typing keywords. After finding relevant samples and filtering using quality control metrics, users can visualize sample batches on the WashU Epigenome Browser and on the UCSC Genome Browser. The Cistrome DB also displays the enrichment levels of known and de novo motifs with a sequence logo for each transcription factor and chromatin regulator ChIP-seq sample in the collection. A list of genes that are predicted to be directly regulated by the factor is provided for ChIP-seq samples, and users can further search by gene name to check whether a given gene can be targeted by the factor. Bulk download of peak files of many samples is supported, which could be a useful resource for computational groups.
Cistrome DB Toolkit
The Cistrome DB Toolkit was designed to help users easily extract useful cis-regulatory information from the large collection of Cistrome DB data. In this module, we provide tools to address three questions that are likely to be of interest to many users. The first tool addresses the question: ‘What factors regulate your gene of interest?’ This function returns a list of the transcription factors in the Cistrome DB that are the most likely regulators of a query gene based on the positions of transcription factor ChIP-seq peaks relative to the transcription start site. As an example, we asked what regulators target the human Androgen Receptor (AR) gene. To include long-range enhancer effects in this case, we set the distance influence parameter to 100 kb. The top factors returned by the Toolkit function are GATA2, AR, ERG, FOXA1, PIAS1, consistent with the known regulators of AR (41–43) (Figure (Figure3A3A).
The second tool answers the question: ‘What factors bind in your interval?’ This function identifies TF binding, histone modifications, and chromatin accessibility in any query genomic interval shorter than 2Mb. As an example, we queried an interval with known distal enhancers of the AR gene (chrX:66,897,958–66,908,958 hg38) in human prostate cancer cells (44). Since the number of peaks varies between different ChIP-seq samples, the number of peaks in this interval divided by the total number of peaks for the factor is used to rank the result. The top factors returned by the Toolkit function are PIAS1, FOXA1, AR, ERG, POLR2A, etc (Figure (Figure3B).3B). The WashU Epigenome Browser view (45,46) (Figure (Figure3C)3C) shows the binding peaks within this enhancer, which can help determine the functional sequence and the factors bound to this sequence.
The third tool answers the question: ‘What factors have a significant binding overlap with your peak set?’. This function compares the strongest peaks in each cistrome with the peak set provided by the user. Users can upload their own set of genomic intervals, such as a ChIP-seq peak set in a BED file format. The function then identifies the samples in the Cistrome DB that have the most significant peak overlaps with the input, which might be cofactors, histone marks, or chromatin accessibility profiles associated with the input sample. We tested this function using ChIP-seq peaks of BATF (GSM1370277) (47), and compared the results using either the top one thousand peaks or all the peaks in each Cistrome DB sample. The top 200 hits in the results using the two options share 150 common samples (Figure (Figure3D),3D), including ChIP-seq samples of BATF, JUND, IRF4, JUNB, BATF3 and other factors that are known to co-bind with BATF (48,49) (Figure (Figure3E3E).
We report an update of the Cistrome DB which includes an expanded data collection and new functionalities. Users can search by keyword or by drop-down menu for any factor they are interested in, and evaluate the quality of the data and the characteristics of the resulting cistromes. In addition, users can find informative data using the new Toolkit functions which are based on genomic binding patterns rather than metadata annotations. This way of finding data can lead to new hypotheses regarding cis-elements or trans-factors that might be functionally associated with the user input on gene regulation. The Cistrome DB is currently the most comprehensive resource for searching, visualizing, and exploring publicly available ChIP-seq and chromatin accessibility data of human and mouse. Because it is based on the collection of public data and relies on the automatic parsing of sample metadata from data source, occasional mis-annotation, incompleteness or ambiguity in the system is unavoidable. Correction of these types of error will require involvement from the community, especially the data contributors, and we are working on developing the web interface for users to conveniently correct meta-data errors. In the future, the Cistrome DB team will continue to collect all newly produced ChIP-seq and chromatin accessibility data, but will prioritize factors and histone modifications that are less well represented in the existing collection. In addition, we will explore the use of long-range chromatin interaction data, such as those available at The 3D Genome Browser (50) to improve TF target predictions. We hope that an awareness of the available data in the Cistrome DB will lead data producers to explore factors and cell types that are not well represented and thereby enrich the diversity and utility of cistromes. We will continue to maintain the database, incorporate new data, and develop new features into the Cistrome DB, to help accelerate the investigations and understanding of gene regulatory mechanisms in biological processes and diseases.
The authors would like to acknowledge Dr Zhiping Weng for providing backup of the Cistrome DB and Dr Ting Wang for the WashU Epigenome Gateway Browser.
National Key Research and Development Program of China [2016YFC1303200 to X.Z., 2017YFC0908500 to X.S.L.]; National Natural Science Foundation of China [31801110 to S.M., 81573023 to X.Z.]; National Institutes of Health of US [U24 HG009446 and U01 CA180980 to X.S.L.]. Funding for open access charge: National Institutes of Health of US [U24 HG009446 and U01 CA180980 to X.S.L.].
Conflict of interest statement. None declared.
Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
Citations & impact
Impact metrics
Citations of article over time
Alternative metrics
Smart citations by scite.ai
Explore citation contexts and check if this article has been
supported or disputed.
https://scite.ai/reports/10.1093/nar/gky1094
Article citations
Brca1 haploinsufficiency promotes early tumor onset and epigenetic alterations in a mouse model of hereditary breast cancer.
Nat Genet, 11 Nov 2024
Cited by: 0 articles | PMID: 39528827
Multiplexed chromatin immunoprecipitation sequencing for quantitative study of histone modifications and chromatin factors.
Nat Protoc, 03 Oct 2024
Cited by: 0 articles | PMID: 39363107
Review
Integrated multi-omics profiling reveals the ZZZ3/CD70 axis is a super-enhancer-driven regulator of diffuse large B-cell lymphoma cell-natural killer cell interactions.
Exp Biol Med (Maywood), 249:10155, 23 Sep 2024
Cited by: 0 articles | PMID: 39376717 | PMCID: PMC11457841
Vitessce: integrative visualization of multimodal and spatially resolved single-cell data.
Nat Methods, 27 Sep 2024
Cited by: 8 articles | PMID: 39333268
Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein.
Nat Commun, 15(1):7838, 07 Sep 2024
Cited by: 0 articles | PMID: 39244557 | PMCID: PMC11380688
Go to all (420) article citations
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
Cistrome Data Browser: integrated search, analysis and visualization of chromatin data.
Nucleic Acids Res, 52(d1):D61-D66, 01 Jan 2024
Cited by: 7 articles | PMID: 37971305 | PMCID: PMC10767960
Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse.
Nucleic Acids Res, 45(d1):D658-D662, 26 Oct 2016
Cited by: 311 articles | PMID: 27789702 | PMCID: PMC5210658
Cistrome Explorer: an interactive visual analysis tool for large-scale epigenomic data.
Bioinformatics, 39(2):btad018, 01 Feb 2023
Cited by: 1 article | PMID: 36688700 | PMCID: PMC9900209
Integrating ChIP-seq with other functional genomics data.
Brief Funct Genomics, 17(2):104-115, 01 Mar 2018
Cited by: 31 articles | PMID: 29579165 | PMCID: PMC5888983
Review Free full text in Europe PMC
Funding
Funders who supported this work.
NCI NIH HHS (1)
Grant ID: U01 CA180980
NHGRI NIH HHS (1)
Grant ID: U24 HG009446
National Institutes of Health of US (2)
Grant ID: U01 CA180980
Grant ID: U24 HG009446
National Key Research and Development Program of China (2)
Grant ID: 2017YFC0908500
Grant ID: 2016YFC1303200
National Natural Science Foundation of China (2)
Grant ID: 31801110
Grant ID: 81573023