. Author manuscript; available in PMC: 2013 Mar 6.

Published in final edited form as: Nature. 2012 Sep 6;489(7414):57–74. doi: 10.1038/nature11247

An Integrated Encyclopedia of DNA Elements in the Human Genome

The ENCODE Project Consortium

PMCID: PMC3439153 NIHMSID: NIHMS381381 PMID: 22955616

Summary

The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall the project provides new insights into the organization and regulation of our genes and genome, and an expansive resource of functional annotations for biomedical research.

The human genome sequence provides the underlying code for human biology. Despite intensive study, especially in identifying protein coding genes, our understanding of the genome is far from complete, particularly with regard to noncoding RNAs, alternatively spliced transcripts, and regulatory sequences. Systematic analyses of transcripts and regulatory information are essential to the identification of genes and regulatory regions and are an important resource for the study of human biology and disease. Such analyses can also provide comprehensive views of the organization and variability of genes and regulatory information across cellular contexts, species and individuals.

The Encyclopedia of DNA Elements (ENCODE) Project aims to delineate all functional elements encoded in the human genome^1–3. Operationally, we define a functional element as a discrete genome segment that encodes a defined product (e.g., protein or non-coding RNA) or displays a reproducible biochemical signature (e.g., protein-binding, or a specific chromatin structure). Comparative genomic studies suggest that 3–8% of bases are under purifying (negative) selection ^4–8 and therefore may be functional, although other analyses have suggested much higher estimates ^9–11. In a pilot phase covering 1% of the genome, the ENCODE project annotated 60% of mammalian evolutionarily constrained bases, but also identified many additional putative functional elements without evidence of constraint². The advent of more powerful DNA sequencing technologies now enables whole genome and more precise analyses with a broad repertoire of functional assays.

Here, we describe production and initial analysis of 1,640 datasets designed to annotate functional elements in the entire human genome. We integrate results from diverse experiments within cell types, related experiments involving 147 different cell types, and all ENCODE data with other resources, such as candidate regions from genome-wide association studies (GWAS) and evolutionarily constrained regions. Together, these efforts reveal important features about the organization and function of the human genome, including:

The vast majority (80.4%) of the human genome participates in at least one biochemical RNA and/or chromatin associated event in at least one cell type. Much of the genome lies close to a regulatory event: 95% of the genome lies within 8kb of a DNA-protein interaction (as assayed by bound ChIP-seq motifs or DNaseI footprints), and 99% is within 1.7kb of at least one of the biochemical events measured by ENCODE.
Primate-specific elements as well as elements without detectable mammalian constraint show, in aggregate, evidence of negative selection; thus some of them are expected to be functional.
Classifying the genome into seven chromatin states suggests an initial set of 399,124 regions with enhancer-like features and 70,292 regions with promoter-like features, as well hundreds of thousands of quiescent regions. High-resolution analyses further subdivide the genome into thousands of narrow states with distinct functional properties.
It is possible to quantitatively correlate RNA sequence production and processing with both chromatin marks and transcription factor (TF) binding at promoters, indicating that promoter functionality can explain the majority of RNA expression variation.
Many non-coding variants in individual genome sequences lie in ENCODE-annotated functional regions; this number is at least as large as those that lie in protein coding genes.
SNPs associated with disease by GWAS are enriched within non-coding functional elements, with a majority residing in or near ENCODE-defined regions that are outside of protein coding genes. In many cases, the disease phenotypes can be associated with a specific cell type or TF.

ENCODE data production and initial analyses

Since 2007, ENCODE has developed methods and performed a large number of sequence-based studies to map functional elements across the human genome³. The elements mapped (and approaches used) include RNA transcribed regions (RNA-seq, CAGE, RNA-PET, and manual annotation), protein-coding regions (mass spectrometry), TF-binding sites (ChIP-seq and DNase-seq), chromatin structure (DNase-seq, FAIRE-seq, histone ChIP-seq and MNase-seq), and DNA methylation sites (RRBS assay) (Box 1 itemizes methods and abbreviations, Supplementary Table P1 details production statistics)³. To compare and integrate results across the different laboratories, data production efforts focused on two selected sets of cell lines, designated “Tier 1” and “Tier 2” (Box 1). To capture a broader spectrum of biological diversity, selected assays were also executed on a third tier comprising more than 100 cell types including primary cells. All data and protocol descriptions are available at http://www.encodeproject.org/, and a “User’s Guide” including details of cell type choice and limitations was recently published³.

Box 1.

Abbreviation	Description
RNA-seq	Isolation of RNA sequences, often with different purification techniques to isolate different fractions of RNA followed by high-throughput sequencing
CAGE	Capture of the methylated cap at the 5′ end of RNA, followed by high- throughput sequencing of a small tag adjacent to the 5′ methylated caps. 5′ methylated caps are formed at the initiationof transcription, though other mechanisms also methylate 5′ ends of RNA
RNA-PET	Simultaneous capture of RNAs with both a 5′ methyl cap and a poly-A tail, which is indicative of a full-length RNA. This is then followed by sequencing a short tag from each end by high- throughput sequencing
ChIP-seq	Chromatin Immunoprecipitation followed by sequencing. Specific regions of cross-linked chromatin, which is genomic DNA complexed with its bound proteins, are selected by using an antibody to a specific epitope. The enriched sample is then subjected to high-throughput sequencing to determine the regions in the genome most often bound by the protein to which the antibody was directed. Most often used are antibodies to any chromatin-associated epitope, including transcription factors, chromatin binding proteins, and specific chemical modifications on histone proteins.
DNaseI-seq	Adaption of established regulatory sequence assay to modern techniques. The DNaseI enzyme will preferentially cut live chromatin preparations at sites where nearby there are specific (non-histone) proteins. The resulting cut points are then sequenced using high throughput sequencing to determine those sites “hypersensitive” to DNaseI, corresponding to open chromatin.
FAIRE-seq	Formaldehyde Assisted Isolation of Regulatory Elements. FAIRE isolates nucleosome-depleted genomic regions by exploiting the difference in crosslinking efficiency between nucleosomes (high) and sequence-specific regulatory factors (low). FAIRE consists of crosslinking, phenol extraction, and sequencing the DNA fragments in the aqueous phase.
RRBS	Reduced Representation Bisulfite Sequencing. Bisulfite treatment of DNA sequence converts methylated cytosines to uracil. In order to focus the assay and save costs, specific restriction enzymes that cut around CpG dinucleotides can reduce the genome to a portion specifically enriched in CpGs. This enriched sample is then sequenced to quantitatively determine the methylation status of individual cytosines.
Tier 1	Tier 1 cell types were the highest-priority set and comprised three widely-studied cell lines: K562 erythroleukemia cells; GM12878, a B-lymphoblastoid cell line that is also part of the 1,000 Genomes project (http://1000genomes.org)⁵⁵; and the H1 embryonic stem cell (H1 hESC) line.
Tier 2		The second-priority set of cell types in the ENCODE project which included HeLa-S3 cervical carcinoma cells, HepG2 hepatoblastoma cells, and primary (non-transformed) human umbilical vein endothelial cells (HUVEC).
Tier 3		Any other ENCODE cell types not in Tier 1 or Tier 2.

Open in a new tab

Integration methodology

For consistency, data were generated and processed using standardized guidelines, and for some assays, new quality-control measures were designed (see refs ^3,12, http://encodeproject.org/ENCODE/dataStandards.html and Kundaje, A. Personal Communication). Uniform data-processing methods were developed for each assay (see Supplementary Information and Kundaje, A. Personal Communication), and most assay results can be represented both as signal information, a per-base estimate across the genome and as discrete elements, regions computationally identified as enriched for signal. Extensive processing pipelines were developed to generate each representation (M.M. Hoffman et al., manuscript in preparation, Kundaje, A. Personal Communication). In addition we developed the irreproducible discovery rate (IDR)¹³ measure to provide a robust and conservative estimate of the threshold where two ranked lists of results from biological replicates no longer agree (i.e., are irreproducible) and we applied this to defining sets of discrete elements. We identified, and excluded from most analyses, regions yielding untrustworthy signals likely to be artifactual (e.g., multi-copy regions). Together, these regions comprise 0.39% of the genome (see Supplementary Information). The accompanying poster represents different ENCODE-identified elements and their genome coverage.

Transcribed and protein-coding regions

We used manual and automated annotation to produce a comprehensive catalogue of human protein-coding and non-coding RNAs as well as pseudogenes, referred to as the GENCODE reference gene set^14,15 (Supplementary Table U1). This includes 20,687 protein-coding genes (GENCODE annotation, V7), with on average 6.3 alternatively spliced transcripts (3.9 different protein-coding transcripts) per locus. In total GENCODE annotated exons of protein coding genes cover 2.94% of the genome or 1.22% for protein-coding exons. Protein-coding genes span 33.45% from the outermost start to stop codons, or 39.54% from promoter to poly A site. Analysis of mass spectrometry (MS) data from K562 and GM12878 cell lines yielded 57 confidently-identified unique peptide sequences intergenic relative to GENCODE annotation. Taken together with evidence of pervasive genome transcription¹⁶, these data indicate that additional protein–coding genes remain to be found.

In addition, we annotated 8,801 automatically derived small RNAs and 9,640 manually curated long non-coding RNA (lncRNA) loci ¹⁷. Comparing lncRNAs to other ENCODE data indicates that lncRNAs are generated through a pathway similar to that for protein coding genes¹⁷. The GENCODE project also annotated 11,224 pseudogenes, of which 863 were transcribed and associated with active chromatin¹⁸.

RNA

We sequenced RNA¹⁶ from different cell lines and multiple subcellular fractions to develop an extensive RNA expression catalogue. Using a conservative threshold to identify regions of RNA activity, 62% of genomic bases are reproducibly represented in sequenced long (>200 nucleotides) RNA molecules or GENCODE exons. Of these bases, only 5.5% are explained by GENCODE exons. The majority of transcribed bases are within or overlapping annotated genes boundaries (i.e. intronic) and only31% of bases in sequenced transcripts were intergenic¹⁶.

We used CAGE-seq (5′ cap-targeted RNA isolation and sequencing) to identify 62,403 transcription start sites (TSSs) at high confidence (IDR of 0.01) in Tier 1 and 2 cell types. Of these, 27,362 (44%) are within 100 bp of the 5′ end of a GENCODE-annotated transcript or previously reported full-length mRNA. The remaining regions predominantly lie across exons and 3′ UTRs, and some exhibit cell type restricted expression; these may represent the start sites of novel, cell type-specific transcripts.

Finally, we saw a significant proportion of coding and non-coding transcripts processed into steady state stable RNAs shorter than 200 nucleotides. These precursors include t-, mi-, sn- and sno-RNAs and the 5′ termini of these processed products align with the capped 5′ end tags¹⁶.

Regions bound by transcription factors, transcriptional machinery, and other proteins

To directly identify regulatory regions, we mapped the binding locations of 119 different DNA-binding proteins and a number of RNA polymerase components in 72 cell types using ChIP-seq (Table 1, Supplementary Table N1, ref ¹⁹); 87 (73%) were sequence-specific TFs (TFSS). Overall, 636,336 binding regions covering 231Mb (8.1%) of the genome are enriched for regions bound by DNA-binding proteins across all cell types. We assessed each protein-binding site for enrichment of known DNA-binding motifs and the presence of novel motifs. Overall, 86% of the DNA segments occupied by TFSS contained a strong DNA-binding motif and in most (55%) cases, the known motif was most enriched (Pouya Kheradpour and Manolis Kellis, personal communication).

Table 1.

Summary of TF classes analysed in ENCODE.

Acronym	Description	Factors Analysed

ChromRem	ATP-dependent chromatin complexes	5
DNARep	DNA repair	3
HISase	Histone acetylation, deacetylation, or methylation complexes	8
Other	Cyclin kinase associated with transcription.	1
Pol2	Pol II subunit	1 (2 forms)
Pol3	Pol III-associated	6
TFNS	General Pol II-associated factor, not site-specific	8
TFSS	Pol II TF with sequence-specific DNA binding	87

Open in a new tab

Protein-binding regions lacking high or moderate affinity cognate recognition sites have 21% lower median scores by rank than regions with recognition sequences (Wilcoxon rank sum p-value < 10⁻¹⁶). 82% of the low-signal regions have high-affinity recognition sequences for other factors. In addition, when ChIP-seq peaks are ranked by their concordance with their known recognition sequence, the median DNase I accessibility is two-fold higher in the bottom 20% of peaks than in the upper 80% (Genome Structure Correction²⁰, GSC p-value <10⁻¹⁶) consistent with previous observations^21–24. We speculate that low signal regions are either lower-affinity sites²¹ or indirect TF target regions associated through interactions with other factors (see also refs ^25,26).

We organized all the information associated with each TF, including the ChIP-seq peaks, discovered motifs, and associated histone modification patterns, in FactorBook (http://www.factorbook.org, ²⁶), a public resource which will be updated as the project proceeds.

DNaseI hypersensitive sites, footprints and nucleosome-depleted regions

Chromatin accessibility characterized by DNaseI hypersensitivity is the hallmark of regulatory DNA regions^27,28. We mapped 2.89 million unique, non-overlapping DNaseI hypersensitive sites (DHSs) by DNase-seq in 125 cell types, the overwhelming majority of which lie distal to TSSs ²⁹. We also mapped 4.8 million sites across 25 cell types that displayed reduced nucleosomal crosslinking by FAIRE, many of which coincide with DHSs. In addition, we used micrococcal nuclease to map nucleosome occupancy in GM12878 and K562 cells ³⁰.

In Tier 1 and Tier 2 cell types, we identified a mean of 205,109 DHSs per cell type (at FDR 1%), encompassing an average of 1.0% of the genomic sequence in each cell type, and 3.9% in aggregate. On average, 98.5% of the occupancy sites of TFs mapped by ENCODE ChIP-seq (and, collectively, 94.4% of all 1.1 million TF ChIP-seq peaks in K562) lay within accessible chromatin defined by DNaseI hotspots²⁹. However, a small number of factors, most prominently heterochromatin-bound repressive complexes (e.g., the Kap1-SetDB1-Znf274 complex^31,32 encoded by the TRIM28, SETDB1 and ZNF274 genes), appear to occupy a significant fraction of nucleosomal sites.

Using genomic DNaseI footprinting^33,34 on 41 cell types we identified 8.4 million distinct DNaseI footprints (FDR 1%)²⁵. Our de novo motif discovery on DNaseI footprints recovered ~90% of known TF motifs, together with hundreds of novel evolutionarily conserved motifs, many displaying highly cell-selective occupancy patterns similar to major developmental and tissue-specific regulators.

Regions of histone modifications

We assayed chromosomal locations for up to 12 histone modifications and variants in 46 cell types, including a complete matrix of eight modifications across Tier 1 and Tier 2. Because modification states may span multiple nucleosomes, which themselves can vary in position across cell populations, we used a continuous signal measure of histone modifications in downstream analysis, rather than calling regions (M.M. Hoffman et al., manuscript in preparation, http://code.google.com/p/align2rawsignal/). For the strongest, “peak-like” histone modifications, we used MACS ³⁵ to characterize enriched sites. Table 2 describes the different histone modifications, their peak characteristics, and a summary of their known roles (reviewed in refs^36–39).

Table 2.

Summary of histone modifications and variants studied in ENCODE, their peak characteristics, and putative functions.

Histone modification or variant	Signal characteristics	Putative functions
H2A.Z	Peak	Histone protein variant (H2A.Z) associated with regulatory elements with dynamic chromatin
H3K4me1	Peak/Region	Mark of regulatory elements associated with enhancers and other distal elements, but also enriched downstream of transcription starts
H3K4me2	Peak	Mark of regulatory elements associated with promoters and enhancers
H3K4me3	Peak	Mark of regulatory elements primarily associated with promoters/transcription starts
H3K9ac	Peak	Mark of active regulatory elements with preference for promoters
H3K9me1	Region	Preference for 5′ end of genes
H3K9me3	Peak/Region	Repressive mark associated with constitutive heterochromatin, and repetitive elements
H3K27ac	Peak	Mark of active regulatory elements; may distinguish active enhancers and promoters from their inactive counterparts
H3K27me3	Region	Repressive mark established by polycomb complex activity associated with repressive domains and silent developmental genes
H3K36me3	Region	Elongation mark associated with transcribed portions of genes, with preference for 3′ regions after intron 1
H3K79me2	Region	Transcription-associated mark, with preference for 5′ end of genes
H4K20me1	Region	Preference for 5′ end of genes

Open in a new tab

Our data show that global patterns of modification are highly variable across cell types, in accordance with changes in transcriptional activity. Consistent with prior studies^40,41, we find that integration of the different histone modification information can be used systematically to assign functional attributes to genomic regions (see below).

DNA methylation

Methylation of cytosine, usually at CpG dinucleotides, is involved in epigenetic regulation of gene expression. Promoter methylation is typically associated with repression, whereas genic methylation correlates with transcriptional activity⁴². We used reduced representation bisulfite sequencing (RRBS) to quantitatively profile DNA methylation for an average of 1.2 million CpGs in each of 82 cell lines and tissues (8.6% of non-repetitive genomic CpGs), including CpGs in intergenic regions, proximal promoters, and in intragenic regions (gene bodies)⁴³, although it should be noted that the RRBS method preferentially targets CpG rich islands. We found 96% of CpGs exhibited differential methylation in at least one cell type or tissue assayed (K. Varley et al. Personal Communication), and levels of DNA methylation correlated with chromatin accessibility. The most variably methylated CpGs are found more often in gene bodies and intergenic regions, rather than in promoters and upstream regulatory regions. In addition, we identified an unexpected correspondence between unmethylated genic CpG islands and binding by P300, a histone acetyltransferase linked to enhancer activity⁴⁴.

Because RRBS is a sequence-based assay with single-base resolution, we were able to identify CpGs with allele-specific methylation consistent with genomic imprinting, and determined that these loci exhibit aberrant methylation in cancer cell lines (K. Varley et al. Personal Communication). Furthermore, we detected reproducible cytosine methylation outside CpG dinucleotides in adult tissues⁴⁵, providing further support that this non-canonical methylation event may play important roles in human biology (K. Varley et al. Personal Communication).

Chromosome-interacting regions

Physical interaction between distinct chromosome regions that can be separated by hundreds of kb is thought to be important in the regulation of gene expression ⁴⁶. We used two complementary chromosome conformation capture (3C)-based technologies to probe these long-range physical interactions.

A 3C-carbon copy (5C) approach^47,48 provided unbiased detection of long-range interactions with TSSs in a targeted 1% of the genome (the 44 ENCODE pilot regions) in four cell types (GM12878, K562, HeLa-S3, and H1hESC)⁴⁹. We discovered hundreds of statistically significant long-range interactions in each cell type after accounting for chromatin polymer behavior and experimental variation. Pairs of interacting loci showed strong correlation between the gene expression level of the TSS and the presence of specific functional element classes such as enhancers. The average number of distal elements interacting with a TSS was 3.9, and the average number of TSSs interacting with a distal element was 2.5, indicating a complex network of interconnected chromatin. Such interwoven long-range architecture was also uncovered genome-wide using chromatin interaction analysis with paired-end tag sequencing (ChIA-PET)⁵⁰ applied to identify interactions in chromatin enriched by RNA polymerase II (PolII) ChIP from five cell types⁵¹. In K562 cells, we identified 127,417 promoter-centered chromatin interactions using ChIA-PET, 98% of which were intra-chromosomal. While promoter regions of 2,324 genes were involved in “single-gene” enhancer-promoter interactions, those of 19,813 genes were involved in “multi-gene” interaction complexes spanning up to several megabases, including promoter-promoter and enhancer-promoter interactions⁵¹.

These analyses portraya complex landscape of long-range gene-element connectivity across ranges of hundreds of kb to several Mb, including interactions among unrelated genes (Supplementary Figure Y1). Furthermore, in the 5C results, 50–60% of long-range interactions occurred in only one of the four cell lines, indicative of a high degree of tissue specificity for gene-element connectivity ⁴⁹.

Summary of ENCODE-identified elements

Accounting for all these elements, a surprisingly large amount of the human genome, 80.4%, is covered by at least one ENCODE-identified element (detailed in Supplementary Table Q1). The broadest element class represents the different RNA types covering 62% of the genome (although the majority is inside of introns or near genes). Regions highly enriched for histone modifications form the next largest class (56.1%). Excluding RNA elements and broad histone elements 44.2 % of the genome is covered. Smaller proportions of the genome are occupied by regions of open chromatin (15.2%) or sites of TF binding (8.1%), with 19.4% covered by at least one DHS or TF ChIP-seq peak across all cell lines. Using our most conservative assessment, 8.5% of bases are covered by either a TF binding site motif (4.6%) or a DHS footprint (5.7%). This however is still about 4.5-fold higher than the amount of protein coding exons, and about 2-fold higher than the estimated amount of pan-mammalian constraint.

Given that ENCODE did not assay all cell types, or all TFs, and in particular has sampled few specialized or developmentally restricted cell lineages, these proportions must be underestimates of the total amount of functional bases. However, many assays were performed on more than one cell type, allowing assessment of the rate of discovery of new elements. For both DHSs and CTCF sites, the number of new elements initially increases rapidly with a steep gradient for the saturation curve and then slows with increasing numbers of cell types (Supplementary Figure R1 and R2). With the current data, at the flattest part of the saturation curve, each new cell type adds on average 9,500 DHS elements (across 106 cell types) and 500 CTCF-binding elements (across 49 cell types), representing 0.45% of the total element number. We modelled saturation for the DHSs and CTCF-binding sites using a Weibull distribution (r2 > 0.999) and predict saturation at approximately 4.1 million (S.E. = 108,000) and 185,100 (S.E. = 18,020) sites, respectively, suggesting that we have discovered around half of the estimated total DHSs. These estimates represent a lower bound, but reinforce the observation that there is more non-coding functional DNA than either coding sequence or pan-mammalian constraint.

The impact of selection on functional elements

From comparative genomic studies, at least 3–8% of bases are under purifying (negative) selection ^4–11 indicating that these bases may potentially be functional. We previously found that 60% of mammalian evolutionarily constrained bases were annotated in the ENCODE pilot project, but also observed that many functional elements lacked evidence of constraint², a conclusion substantiated by others^52–54. The diversity and genome-wide occurrence of functional elements now identified provides an unprecedented opportunity to further examine the forces of negative selection on human functional sequences.

We examined negative selection using two measures that highlight different periods of selection in the human genome. The first measure, inter-species, pan-mammalian constraint (GERP-based scores; 24 mammals⁸) addresses selection during mammalian evolution. The second measure is intra-species constraint estimated from the numbers of variants discovered in human populations using data from the 1000 Genomes project⁵⁵ and covers selection over human evolution. In Figure 1, we plot both these measures of constraint for different classes of identified functional elements, excluding features overlapping exons and promoters that are known to be constrained. Each graph also shows genomic background levels and measures of coding-gene constraint for comparison. Since we plot human population diversity on an inverted scale, elements that are more constrained by negative selection will tend to lie in the upper and right hand regions of the plot.

Panel A shows the levels of pan-mammalian constraint (mean GERP score; 24 mammals⁸, x-axis) compared to diversity, a measure of negative selection in the human population (mean expected heterozygosity, inverted scale, y-axis) for ENCODE datasets. Each point is an average for a single dataset. The top right corners have the strongest evolutionary constraint and lowest diversity. Coding (C), UTR (U), genomic (G), intergenic (IG) and intronic (IN) averages are shown as filled squares. In each case the vertical and horizontal cross hairs show representative levels for the neutral expectation for mammalian conservation and human population diversity respectively. Panel A shows the spread over all non-exonic ENCODE elements greater than 2.5 kb from TSSs. The inner dashed box indicates that parts of the plot have been magnified for the surrounding outer panels, although the scales in the outer plots provide the exact regions and dimensions magnified. The spread for DHS sites (B) and RNA elements (D) are shown in the plots on the left. RNA elements are either long novel intronic (dark green) or long intergenic (light green) RNAs. The horizontal cross hairs are colour coded to the relevant dataset in panel D. Panel C shows the spread of TF motif instances either in regions bound by the TF (orange points) or the corresponding unbound motif matches in grey, with bound and unbound points connected with an arrow in each case showing that bound sites are generally more constrained and less diverse. Panel E shows the derived allele frequency spectrum for primate specific elements with variations outside ENCODE elements in black and variations covered by ENCODE elements in red. The increase in low frequency alleles compared to background is indicative of negative selection occurring in the set of variants annotated by the ENCODE data. Panel F shows aggregation of mammalian constraint scores over the glucocorticoid receptor (GR) TF motif in bound sites, showing the expected correlation with the information content of bases in the motif.

For DNaseI elements (Figure 1B) and bound motifs (Figure 1C) most sets of elements show enrichment in pan mammalian constraint and decreased human population diversity, though for some cell types the DNaseI sites do not appear overall to be subject to pan-mammalian constraint. Bound TF motifs have a natural control from the set of TF motif with equal sequence potential for binding but without binding evidence from ChIP-seq experiments; in all cases, the bound motifs show both more mammalian constraint and higher suppression of human diversity.

Consistent with previous findings, we do not observe genome-wide evidence for pan-mammalian selection of novel RNA sequences (Panel D). There are also a large number of elements without mammalian constraint, between 17–90% for TF-binding regions as well as DHSs and FAIRE regions. Previous studies could not determine whether these sequences are either biochemically active, but with little overall impact on the organism, or are under lineage specific selection. By isolating sequences preferentially inserted into the primate lineage, which is only feasible given the genome-wide scale of this data, we are able to specifically examine this issue. The majority of primate-specific sequence is due to retrotransposon activity, but an appreciable proportion is non-repetitive primate-specific sequence. Of 104,343,413 primate-specific bases (excluding repetitive elements), 67,769,372 (65%) are found within ENCODE-identified elements. Examination of 227,688 variants segregating in these primate specific regions revealed that all classes of elements (RNA and regulatory) show depressed derived allele frequencies, consistent with recent negative selection occurring in at least some of these regions (Figure 1E). An alternative approach examining sequences that are not clearly under pan-mammalian constraint showed a similar result (Luke Ward and Manolis Kellis, personal communication). This suggests that an appreciable proportion of the unconstrained elements are lineage specific elements required for organismal function, consistent with long standing views of recent evolution⁵⁶, and the remainder are likely to be “neutral” elements² which are not currently under selection, but may still affect cellular or larger scale phenotypes without an effect on fitness.

The binding patterns of TFs are not uniform, and we can correlate both inter-and intra-species measures of negative selection with the overall information content of motif positions. The selection on some motif positions is as high as protein coding exons (Figure 1F, Luke Ward and Manolis Kellis, personal communication). These aggregate measures across motifs show that the binding preferences found in the population of sites are also relevant to the per-site behavior. By developing a per-site metric of population effect on bound motifs, we found that highly constrained bound instances across mammals are able to buffer the impact of individual variation⁵⁷.

Integration of ENCODE data with known genomic features

Promoter-anchored integration

Many of the ENCODE assays directly or indirectly provide information about the action of promoters. Focusing on the TSSs of protein-coding transcripts, we investigated the relationships among different ENCODE assays, in particular testing the hypothesis that RNA expression (“output”) can be effectively predicted from patterns of chromatin modifications or TF binding (“input”). Consistent with previous reports⁵⁸, we observe two relatively distinct types of promoters: (1) broad, mainly C+G rich, TATA-less promoters; and (2) narrow, TATA-box-containing promoters. These promoters have distinct patterns of histone modifications, and TF-binding sites are selectively enriched in each class (Supplementary Figure Z1).

We developed predictive models to explore the interaction between histone modifications and measures of transcription at promoters, distinguishing between modifications known to be added as a consequence of transcription (such as H3K36me3 and H3K79me2) and other categories of histone marks⁵⁹. In our analyses, the best models had two components: an initial classification component (on/off) and a second quantitative model component. Our models showed activating acetylation marks (H3K27ac and H3K9ac) are roughly as informative as activating methylation marks (H3K4me3 and H3K4me2) (Figure 2A). Although repressive marks, such as H3K27me3 or H3K9me3, show negative correlation both individually and in the model, removing these marks produces only a small reduction in model performance. However, for a subset of promoters in each cell line repressive histone marks (H3K27me3 or H3K9me3) must be used to accurately predict their expression. We also examined the interplay between the H3K79me2 and H3K36me3 marks, both of which mark gene bodies, likely reflecting recruitment of modification enzymes by polymerase isoforms. As described previously, H3K79me2 occurs preferentially at the 5′ ends of gene bodies and H3K36me3 occurs more 3′, and our analyses support the previous model in which the H3K79me2 to H3K36me3 transition occurs at the first 3′ splice site⁶⁰.

Panels A and B show the correlative models between either histone modifications or TFs, respectively, and RNA production as measured by CAGE tag density at TSSs in K562. In each case the scatter plot shows the output of the correlation models (x-axis) compared to observed values (y-axis). The bar graphs show the most important histone modifications (A) or TFs (B) in both the initial classification phase (upper bar graph) or the quantitative regression phase (lower bar graph), with larger values indicating increasing importance of the variable in the model. Further analysis of other cell lines and RNA measurement types are reported elsewhere^59,79.

Few previous studies have attempted to build qualitative or quantitative models of transcription genome-wide from TF levels because of the paucity of documented TF-binding regions and the lack of coordination around a single cell line. We thus examined the predictive capacity of TF-binding signals for the expression levels of promoters (Figure 2B). In contrast to the profiles of histone modifications, most TFs show enriched binding signals in a narrow DNA region near the TSS, with relatively higher binding signals in promoters with higher CpG content. Most of this correlation could be recapitulated by looking at the aggregate binding of TFs without specific TF terms. Together, these correlation models suggest both that a limited set of chromatin marks are sufficient to “explain” transcription and that a variety of TFs might have broad roles in general transcription levels across many genes. It is important to note that this is an inherently observational study of correlation patterns, and is consistent with a variety of mechanistic models with different causal links between the chromatin, TF and RNA assays. However it does indicate that there is enough information present at the promoter regions of genes to explain the majority of variation in RNA expression.

We developed predictive models similar to those used to model transcriptional activity to explore the relationship between levels of histone modifications and inclusion of exons in alternately spliced transcripts. Even accounting for expression level, H3K36me3 has a positive contribution to exon inclusion, while H3K79me2 has a negative contribution⁶¹. By monitoring the RNA populations in the subcellular fractions of K562 cells, we found that essentially all splicing is co-transcriptional ⁶², further supporting a link between chromatin structure and splicing.

Transcription factor-binding site-anchored integration

TF binding sites provide a natural focus around which to explore chromatin properties. TFs are often multi-functional and can bind a variety of genomic loci with different combinations and patterns of chromatin marks and nucleosome organization. Hence, rather than averaging chromatin mark profiles across all binding sites of a TF, we developed a clustering procedure, termed the Clustered Aggregation Tool (CAGT), to identify subsets of binding sites sharing similar but distinct patterns of chromatin mark signal magnitude, shape, and hidden directionality ³⁰. For example, the average profile of the repressive histone mark, H3K27me3, over all 55,782 CTCF-binding sites in K562 shows poor signal enrichment (Figure 3A). However, after grouping profiles by signal magnitude, we found a subset of 9,840 (17.6%) CTCF-binding sites that exhibit significant flanking H3K27me3 signal. Shape and orientation analysis further revealed that the predominant signal profile for H3K27me3 around CTCF peak summits is asymmetric, consistent with a boundary role for some CTCF sites between active and polycomb-silenced domains. Further examples are provided in Supplementary Figures E5 and E6. For TAF1, predominantly found near TSSs, the asymmetric sites are orientated with the direction of transcription. However, for distal sites, such as those bound by GATA1 and CTCF, we also observed a high proportion of asymmetric histone patterns, although independent of motif directionality. In fact, all TF-binding datasets in all cell lines show predominantly asymmetric patterns (asymmetry ratio >0.6) for all chromatin marks but not DNaseI (Figure 3B). This suggests that most TF bound chromatin events correlate with structured, directional patterns of histone modifications, and that promoter directionality is not the only source of orientation at these sites.

Panel A shows the results of clustered aggregation of H3K27me3 modification signal around CTCF binding sites (a multi-functional protein involved with chromatin structure). The first three left-most plots show the signal behaviour of the histone modification over all sites (top) and then split into the high and low signal components. The high signal component is then decomposed further into six different shape classes on the right (see ref ³⁰ for details). The shape decomposition process is strand aware. Panel B summarises shape asymmetry for DNase1, nucleosome and histone modification signals by plotting an asymmetry ratio for each signal over all TF binding sites. All histone modifications measured in this study show predominantly asymmetric patterns at TF binding sites.

We also examined nucleosome occupancy relative to the symmetry properties of chromatin marks around TF-binding sites. Around TSSs, there is usually strong asymmetric nucleosome occupancy, often accounting for the majority of the histone modification signal (for instance, see Supplementary Figure E4). However, away from TSSs, there is far less concordance. For example, CTCF-binding sites typically show arrays of well-positioned nucleosomes on either side of the peak summit (Supplementary Figure E1)⁶³. Where the flanking chromatin mark signal is high, the signals are often asymmetric, indicating differential marking with histone modifications (Supplementary Figure E2 and E3). Thus, we confirm on a genome-wide scale that TFs can form barriers around which nucleosomes and histone modifications are arranged in a variety of configurations^63–66. Further detail is explored in refs^25,26,30.

Transcription factor co-associations

TF-binding regions are non-randomly distributed across the genome, with respect to both other features (e.g., promoters) and other TF-binding regions. Within the Tier 1 and 2 cell lines, we found 3,307 pairs of statistically co-associated factors (P value < 1E-16, GSC) involving 114 out of a possible 117 factors (97%) (Figure 4A). These include expected associations, such as Jun and Fos, and some more novel associations, such as TCF7L2 with HNF4alpha and FoxA2⁶⁷ (a full listing is given in Supplementary Table F1). When one considers promoter and intergenic regions separately, this changes to 3,201 pairs (116 factors, 99%) for promoters and 1,564 pairs (108 factors, 92%) for intergenic regions, with some associations more specific to these genomic contexts (e.g., the cluster of HDAC2, GABPA, CHD2, GTF2F1, MXI1, and MYC in promoter regions and SP1, EP300, HDAC2, and NANOG in intergenic regions (Figure 4B)). These general and context-dependent associations lead to a network representation of the co-binding with many interesting properties, explored in refs ^19,25,26. In addition we also identified a set of regions bound by multiple factors representing “High Occupancy of TFs” (HOT) regions⁶⁸.

Panel A shows significant co-associations of TF pairs using the GSC statistic across the entire genome in K562 cells. The colour strength represents the extent of association (red (strongest) through orange to yellow (weakest)), whereas the depth of colour represents the fit to the GSC²⁰ model (white meaning that the statistical model is not appropriate) as indicated by the key. The majority of TFs have a non-random association to other TFs, and these associations are dependent on the genomic context, meaning that once the genome is separated into promoter proximal and distal regions, the overall levels of co-association decrease, but more specific relationships are uncovered. Panel B illustrates three classes of behaviour. The first column shows a set of associations whose strength is independent of location in promoter and distal regions while the second shows a set of TFs which have stronger associations in promoter-proximal regions. Both these examples are from data in K562 cells and are highlighted on the genome wide coassociation matrix (panel A) by the labelled boxes A and B, respectively. The third column shows a set of TFs that show stronger association in distal regions (in the H1 hESC cell line).

Genome-wide integration

To identify functional regions genome-wide, we next integrated elements independent of genomic landmarks using either discriminative training methods, where a subset of known elements of a particular class were used to train a model that was then used to discover more instances of this class, or using methods in which only data from ENCODE assays were employed without explicit knowledge of any annotation.

For discriminative training, we used a three-step process to predict potential enhancers, described in Supplementary Info and ref ⁶⁸. Two alternative discriminative models converged on a set of ~13,000 putative enhancers in K562 cells⁶⁸. In the second approach, two methodologically distinct unbiased approaches (see ref ^40,69 and M.M. Hoffman et al., manuscript in preparation) converged on a concordant set of histone modification and chromatin-accessibility patterns that can be used to segment the genome in each of the Tier 1 and Tier 2 cell lines, although the individual loci in each state in each cell line are different. With the exception of RNA polymerase II and CTCF, the addition of TF data did not substantially alter these patterns. At this stage, we deliberately excluded RNA and methylation assays, reserving these data as a means to validate the segmentations.

Our integration of the two segmentation methods (M.M. Hoffman et al., manuscript in preparation) established a consensus set of seven major classes of genome states, described in Table 3. The standard view of active promoters, with a distinct core promoter region (TSS and PF states), leading to active gene bodies (T, transcribed state) is rediscovered in this model (Figure 5A and B). There are three “active” distal states. We tentatively labelled two as enhancers (predicted enhancers, E, and predicted weak enhancers, WE) due to their occurrence in regions of open chromatin with high H3K4me1, although they differ in the levels of marks such as H3K27ac, currently thought to distinguish active from inactive enhancers. The other active state (CTCF) has high CTCF binding and includes sequences that function as insulators in a transfection assay. The remaining repressed state (R) summarises sequences split between different classes of actively repressed or inactive, quiescent chromatin. We found that the CTCF-binding associated state is relatively invariant across cell types, with individual regions frequently occupying the CTCF state across all six cell types (Figure 5C). Conversely, the E and T states have substantial cell-specific behaviour, whereas the TSS state has a bimodal behaviour with similar numbers of cell-invariant and cell-specific occurrences. It is important to note that the consensus summary classes do not capture all the detail discovered in the individual segmentations containing more states.

Table 3.

Summary of the combined state types.

Label	Description	Details^§	Colour
CTCF	CTCF-enriched element	Sites of CTCF signal lacking histone modifications, often associated with open chromatin. Many are likely to function in insulators assays, but because of the multifunctional nature of CTCF, we are conservative in our description. Also enriched for the cohesion components RAD21 and SMC3; CTCF is known to recruit the cohesion complex.	Turquoise
E	Predicted enhancer	Regions of open chromatin associated with H3K4me1 signal. Enriched for other enhancer-associated marks, including TFs known to act at enhancers. In enhancer assays, many of these (>50%) function as enhancers. A more conservative alternative would be cis-regulatory regions. Enriched for sites for the proteins encoded by EP300, FOS, FOSL1, GATA2, HDAC8, JUNB, JUND, NFE2, SMARCA4, SMARCB1, SIRT6, and TAL1 in K562. Have nuclear and whole-cell RNA signal, particularly poly A minus fraction.	Orange
PF	Predicted promoter flanking region	Regions that generally surrounding TSS segments (see below).	Light Red
R	Predicted Repressed or Low Activity region	This is a merged state that includes H3K27me3 polycomb-enriched regions, along with regions that are silent in terms of observed signal for the input assays to the segmentations (low or no signal). They may have other signals (e.g., RNA, not in the segmentation input data). Enriched for sites for the proteins encoded by REST and some other factors (e.g. proteins encoded by BRF2, CEBPB, MAFK, TRIM28, ZNF274, and SETDB1 genes in K562)	Gray
TSS	Predicted promoter region including TSS	Found close to or overlapping Gencode TSS sites. High precision/recall for TSSs. Enriched for H3K4me3. Sites of open chromatin. Enriched for TF known to act close to promoters and polymerases Pol II and Pol III. Short RNAs are most enriched in these segments.	Bright Red
T	Predicted transcribed region	Overlap gene bodies with H3K36me3 transcriptional elongation signal. Enriched for phosphorylated form of PolII signal (elongating polymerase) and poly A-plus RNA, especially cytoplasmic.	Dark Green
WE	Predicted weak enhancer or open chromatin cis regulatory element	Similar to the E state, but weaker signals and weaker enrichments.	Yellow

Open in a new tab

^§

Where specific enrichments or overlaps are identified, these are derived from analysis in Gm12878 and/or K562 cells where the data for comparison is richest.

Panel A shows an illustrative region with the two segmentations methods (ChromHMM and Segway) in a dense view and the combined segmentation expanded to show each state in GM12878, beneath a compressed view of the GENCODE gene annotations. Note that at this level of zoom and genome browser resolution, some segments appear to overlap although they do not. Segmentation classes are named and coloured according to the scheme in Table 3. Beneath the segmentations are shown each of the normalised signals that were used as the input data for the segmentations. Open Chromatin signals from the DNase 1-seq and FAIRE assays are shown in blue, signal from histone modification ChIP-seq in red and TF ChIP-seq signal for Pol II and CTCF in green. The mauve ChIP-seq control signal (“Input control”) at the bottom was also included as an input to the segmentation. Panel B shows the association of selected TF (left) and RNA (right) elements in the combined segmentation states (x-axis) expressed as an observed/expected ratio for each combination of TF or RNA element and segmentation class using the heatmap scale shown in the keybesides each heatmap. Panel C shows the variability of states between cell lines, showing the distribution of occurrences of the state in the 6 cell lines at specific genome locations — from unique to one cell line to ubiquitous in all six cell lines for five states (CTCF, E, T, TSS, and R). Panel D shows the distribution of the level of methylation at individual sites from RRBS analysis in GM12878 across the different states, showing the expecting hypomethylation at TSSs and hypermethylation of genes bodies (T state) and repressed (R) regions.

The distribution of RNA species across segments is quite distinct, indicating that underlying biological activities are captured in the segmentation. Polyadenylated RNA is heavily enriched in gene bodies. Around promoters, there are short RNA species previously identified as promoter-associated short RNAs (PASRs) (Figure 5B)^16,70. Similarly, DNA methylation shows marked distinctions between segments, recapitulating the known biology of predominantly unmethylated active promoters (TSS states) followed by methylated gene bodies⁴² (T state, Figure 5D). The two enhancer-enriched states show distinct patterns of DNA methylation, with the less active enhancer state (by H3K27ac/H3K4me1 levels) showing higher methylation. These states also have an excess of RNA elements without poly-A tails and methyl-cap RNA as assayed by CAGE sequences compared to matched intergenic controls, suggesting a specific transcriptional mode associated with active enhancers⁷¹. TFs also showed distinct distributions across the segments (Figure 5B). A striking pattern is the concentration of TFs in the TSS-associated state. The enhancers contain a different set of TFs. For example, in K562, the E state is enriched for binding by the proteins encoded by the EP300, FOS, FOSL1, GATA2, HDAC8, JUNB, JUND, NFE2, SMARCA4, SMARCB1, SIRT6, and TAL1 genes. We tested a subset of these predicted enhancers in both Mouse and Fish transgenic models (examples in Figure 6), with over half of the elements showing activity, often in the corresponding tissue type.

Randomly sampled E state segments (see table 3) from the K562 segmentation were cloned for mouse- and fish-based transgenic enhancer assays. Panel A shows a representative LacZ-stained transgenic e11.5 mouse embryo obtained with construct hs2065 (EN167, chr10:46,052,882-46,055,670, GRCh37). Highly reproducible staining in the blood vessels was observed in 9 out of 9 embryos resulting from independent transgenic integration events. Panel B shows a representative green fluorescent protein reporter transgenic medaka fish obtained from a construct with a basal hsp70 promoter on meganuclease based transfection. Reproducible transgenic expression in the circulating nucleated blood cells and the endothelial cell walls was seen in 81 out of 100 transgenic tests of this construct.

The segmentation provides a linear determination of functional state across the genome, but not an association of particular distal regions with genes. By using the variation of DNaseI across cell lines, 39% of E (enhancer associated) states could be linked to a proposed regulated gene ²⁹ concordant with physical proximity patterns determined by 5C⁴⁹ or ChIA-PET.

To provide a fine-grained regional classification, we turned to a Self Organizing Map (SOM) to cluster genome segmentation regions based on their assay signal characteristics (Figure 7). The segmentation regions were initially randomly assigned to a 1,350-state map in a two-dimensional toroidal space (Figure 7A). This map can be visualised as a two dimensional rectangular plane onto which the various signal distributions can be plotted. For instance, the rectangle at the bottom left of Figure 7A shows the distribution of the genome in the initial randomised map. The SOM was then trained using the 12 different ChIP-seq and DNase-seq assays in the six cell types previously analyzed in the large-scale segmentations (i.e. over 72-dimensional space). After training, the SOM clustering was again visualised in two dimensions, now showing the organized distribution of genome segments (lower right hand, Figure 7A). Individual data sets associated with the genome segments in each SOM map unit (hexagonal cells) can then be visualised in the same framework to learn how each additional kind of data is distributed on the chromatin state map. Figure 7B shows CAGE/TSS expression data overlaid on the randomly initialised (left) and trained map (right) panels. In this way the trained SOM highlighted cell type-specific TSS clusters (bottom panels of Figure 7B), indicating that there are sets of tissue specific TSSs that are distinguished from each other by subtle combinations of ENCODE chromatin data. Many of the ultra-fine-grained state classifications revealed in the SOM are associated with specific gene ontology (GO) terms (right panel of Figure 7C). For instance, the left panel of Figure 7C, identifies 10 SOM map units enriched with genomic regions associated with genes associated with the GO term ‘immune response’. The central panel identifies a different set of map units enriched for the GO term “sequence-specific TF activity”. The two map units most enriched for this GO term, indicated by the darkest green colouring, contain genes with segments that are high in H3K27me3 in H1 hESC cells, but that differ in H3K27me3 levels in HUVEC cells. Gene function analysis with the GO ontology tool (GREAT⁷²) reveals that the map unit with high H3K27me3 in both cell types is enriched in TF genes with known neuronal functions, whereas the neighbouring map unit is enriched in genes involved in body patterning. The genome browser shots at the bottom of Figure 7C pick out an example region for each of the two SOM map units illustrating the difference in H3K27me3 signal. Overall, we have 228 distinct GO terms associated with specific segments across one or more states (Ali Mortazavi, personal communication), and can assign over one third of genes to a GO annotation solely on the basis of its multi-cellular histone patterns. Thus the SOM analysis provides a fine-grained map of chromatin data across multiple cell types, which can then be used to relate chromatin structure to other data-types at differing levels of resolution (for instance, the large cluster of units containing any active TSS, its sub-clusters composed of units enriched in TSSs active in only one cell type, or individual map units significantly enriched for specific GO terms).

The training of the self-organising map (panel A) and analysis of the results (panels B and C) are shown. Initially we arbitrarily placed genomic segments from the chromHMM segmentation on to the toroidal map surface, although the SOM does not use the chromHMM state assignments (panel A). We then trained the map using the signal of the 12 different ChIP-seq and DNase-seq assays in the six cell types analysed. Each unit of the SOM is represented here by an hexagonal cell in a planar two-dimensional view of the toroidal map. Curved arrows indicate that traversing the edges of two dimensional view leads back to the opposite edge. The resulting map can be overlaid with any class of ENCODE or other data to view the distribution of that data within this high-resolution segmentation. In panel A the distributions of genome bases across the untrained and trained map (left and right, respectively) are shown using heatmap colours for log₁₀ values. Panel B shows the distribution of TSSs from CAGE experiments of GENCODE annotation on the planar representations of either the initial random organisation (left) or the final trained SOM (right) using heat maps coloured according to the accompanying scales. The bottom half of panel B expands the different distributions in the SOM for all expressed TSSs (left) or TSSs specifically expressed in two example cell lines, H1 hESC (centre) and HepG2 (right). Panel C shows the association of Gene Ontology (GO) terms on the same representation of the same trained SOM. We assigned genes that are within 20 kb of a genomic segment in a SOM unit to that unit, and then associated this set of genes with GO terms using a hypergeometric distribution after correcting for multiple testing. Map units that are significantly associated to GO terms are now coloured green, with increasing strength of colour reflecting increasing numbers of genes significantly associated with the GO terms for either immune response (left) or sequence-specific TF activity (centre). In each case, specific SOM units show association with these terms. The right-hand panel shows the distribution on the same SOM of all significantly associated GO terms, now colouring by GO term count per SOM unit. For sequence-specific TF activity, two example genomic regions are extracted at the bottom of panel C from neighbouring SOM units. These are regions around the DBX1 (from SOM unit 26,31, left panel) and IRX6 (SOM unit 27,30, right panel) genes, respectively, along with their H3K27me3 ChIP-seq signal for each of the Tier 1 and 2 cell types. For DBX1, representative of a set of primarily neuronal TFs associated with unit 26,31, there is a repressive H3K27me3 signal in both H1 hESC and HUVEC cells; for IRX6, representative of a set of body patterning TFs associated with SOM unit 27,30, the repressive mark is restricted largely to the embryonic stem cell.

The classifications presented here are necessarily limited by the assays and cell lines studied, and are likely to contain a number of heterogeneous classes of elements. Nonetheless, robust classifications can be made, allowing a systematic view of the human genome.

Insights into human genomic variation

We next explored the potential impact of sequence variation on ENCODE functional elements. We examined allele-specific variation using results from the GM12878 cells that are derived from an individual (NA12878) sequenced in the 1000 Genomes project, along with her parents. Since ENCODE assays are predominantly sequence-based, the trio design allows each GM12878 dataset to be divided by the specific parental contributions at heterozygous sites, producing aggregate haplotypic signals from multiple genomic sites. We examined 193 ENCODE assays for allele-specific biases using 1,409,992 phased, heterozygous SNPs and 167,096 indels (Figure 8). Alignment biases towards alleles present in the reference genome sequence were avoided utilising a sequence specifically tailored to the variants and haplotypes present in NA12878 (a ‘personalised genome’)⁷³. We found instances of preferential binding towards each parental allele. For example, comparison of the results from the POLR2A, H3K79me2, and H3K27me3 assays in the region of NACC2 (Figure 8A) shows a strong paternal bias for H3K79me2 and POL2RA and a strong maternal bias for H3K27me3, suggesting differential activity for the maternal and paternal alleles.

Panel A shows representative allele-specific information from GM12878 cells for selected assays around the first exon of the NACC2 gene (genomic region chr9:138,950,000- 138,995,000, GRCh37). Transcription signal is shown in green, and the three sections show allele specific data for three datasets (POLR2A, H3K79me2 and H3K27me3 ChIP-seq). In each case the purple signal is the processed signal for all sequence reads for the assay, while the blue and red signals show sequence reads specifically assigned to either the paternal or maternal copies of the genome, respectively. The set of common SNPs from dbSNP, including the phased, heterozygous SNPs used to provide the assignment, are shown at the bottom of the panel. NACC2 has a statistically significant paternal bias for POLR2A and the transcription associated mark H3K79me2, and has a significant maternal bias for the repressive mark H3K27me3. Panel B shows pairwise correlations of allele specific signal within single genes (below the diagonal) or within individual ChromHMM segments across the whole genome for selected DNase-seq and histone modification and TF ChIP-seq assays. The extent of correlation is coloured according to the heatmap scale indicated from positive correlation (red) through to anti-correlation (blue).

Figure 8B shows the correlation of selected allele-specific signals across the whole genome. For instance we find a strong allelic correlation between POL2RA and BCLAF1 binding, as well as negative correlation between H3K79me2 and H3K27me3, both at genes (below the diagonal, bottom left) and chromosomal segments (top right). Overall we find that positive allelic correlations among the 193 ENCODE assays are stronger and more frequent than negative correlations. This may be due to preferential capture of accessible alleles and/or the specific histone modification and TF, assays used in the project.

Rare variants, individual genomes and somatic variants

We further investigated the potential functional effects of individual variation in the context of ENCODE annotations. We divided NA12878 variants into common and rare classes, and partitioned these into those overlapping ENCODE annotation (Figure 9A, Supplementary Tables K1 and K2). We also predicted potential functional effects: for protein-coding genes, these are either non-synonymous SNPs or variants likely to induce loss of function by frame-shift, premature stop, or splice-site disruption; for other regions, these are variants that overlap a TF-binding site. We found similar numbers of potentially functional variants affecting protein-coding genes or affecting other ENCODE annotations, suggesting that many functional variants within individual genomes lie outside exons of protein-coding genes. A more detailed analysis of regulatory variant annotation is described in ref ⁷⁴.

Panel A shows the breakdown of variants in a single genome (NA12878) by both frequency (common or rare (*i.e*., variants not present in the low-coverage sequencing of 179 individuals in the pilot 1 European panel of the 1000 Genomes project⁵⁵) and by ENCODE annotation, including protein-coding gene and non-coding elements (GENCODE annotations for protein-coding genes, pseudogenes, and other ncRNAs, as well as TF-binding sites from ChIP-seq datasets, excluding broad annotations such as histone modifications, segmentations, and RNA-seq). Annotation status is further subdivided by predicted functional effect, being non-synonymous and missense mutations for protein-coding regions and variants overlapping bound TF motifs for non-coding element annotations. A substantial proportion of variants are annotated as having predicted functional effects in the non-coding category. Panel B shows one of several relatively rare occurrences, where alignment to an individual genome sequence (paternal and maternal panels) shows a different readout from the reference genome. In this case, a paternal haplotype-specific CTCF peak is identified. Panel C shows the relative level of somatic variants from whole-genome melanoma sample that occur in DHSs unique to different cell lines. The coloured bars show cases that are significantly enriched or supressed in somatic mutations. Details of ENCODE cell types can be found at http://encodeproject.org/ENCODE/cellTypes.html.

To further study the potential effects of NA12878 genome variants on TF binding regions, we performed peak-calling using a constructed personal diploid genome sequence for NA12878⁷³. We aligned ChIP-seq sequences from GM12878 separately against the maternal and paternal haplotypes. As expected, a greater fraction of reads were aligned than to the reference genome (see Supplementary Information, Supplementary Figure K1). On average, approximately 1% of TF-binding sites in GM12878 are detected in a haplotype-specific fashion. For instance, Figure 9B shows a CTCF-binding site not detected using the reference sequence that is only present on the paternal haplotype due to a 1-bp deletion (see also Supplementary Figure K2). As costs of DNA sequencing decrease further, optimized analysis of ENCODE-type data should use the genome sequence of the individual or cell being analyzed when possible.

Most analyses of cancer genomes to date have focused on characterizing somatic variants in protein-coding regions. We intersected four available whole-genome cancer datasets with ENCODE annotations (Figure 9C, Supplementary Figure L2). Overall somatic variation is relatively depleted from ENCODE annotated regions, particularly for elements specific to a cell type matching the putative tumor source (e.g., skin melanocytes for melanoma). Examining the mutational spectrum of elements in introns for cases where a strand-specific mutation assignment could be made reveals that there are mutational spectrum differences between DHSs and unannotated regions (0.06 Fisher’s Exact, Supplementary Figure L3). The suppression of somatic mutation is consistent with important functional roles of these elements within tumor cells, highlighting a potential alternative set of targets for examination in cancer.

Common variants associated with human disease and phenotypes

In recent years, GWAS have greatly extended our knowledge of genetic loci associated with human disease risk and other phenotypes. The output of these studies is a series of SNPs (“GWAS SNPs”) correlated with a phenotype, although not necessarily the functional variants. Strikingly, 88% of associated SNPs are either intronic or intergenic⁷⁵. We examined 4,860 SNP-phenotype associations for 4,492 SNPs curated in the NHGRI GWAS catalogue⁷⁵. We found that 12% of these SNPs overlap TF-occupied regions whereas 34% overlap DHSs (Figure 10A). Both figures reflect significant enrichments relative to the overall proportions of 1000 Genomes project SNPs (about 6% and 23%, respectively). Even after accounting for biases introduced by selection of SNPs for the standard genotyping arrays, GWAS SNPs show consistently higher overlap with ENCODE annotations (Figure 10A, see Supplementary Information). Furthermore, after partitioning the genome by density of different classes of functional elements, GWAS SNPs were consistently enriched beyond all the genotyping SNPs in function-rich partitions, and depleted in function-poor partitions (see Supplementary Figure M1). GWAS SNPs are particularly enriched in the segmentation classes associated with enhancers and TSSs across several cell types (see Supplementary Figure M2).

Panel A shows overlap of lead SNPs in the NHGRI GWAS SNP catalog (June 2011) with DHSs (left) or TF-binding sites (right) as red bars compared to various control SNP sets in blue. The control SNP sets are: SNPs on the Illumina 2.5M chip as an example of a widely used GWAS SNP typing panel; SNPs from the 1,000 Genomes project; SNPs extracted from 24 personal genomes (see Personal Genome Variants track at http://main.genome-browser.bx.psu.edu⁸⁰ all shown as blue bars. In addition a further control utilised 1,000 randomisations from the genotyping SNP panel, matching the SNPs with each NHGRI catalog SNP for allele frequency and distance to the nearest TSS (light blue bars with bounds at 1.5 times the interquartile range, and any outliers beyond shown as circles). For both DHSs and TF binding regions, a larger proportion of overlaps with GWAS-implicated SNPs is found compared to any of the controls sets. Panel B shows the aggregate overlap of phenotypes to selected TF-binding sites (left matrix) or DHSs in selected cell lines (right matrix), with a count of overlaps between the phenotype and the cell line/factor. Values in green squares pass an empirical p-value threshold <=0.01 (based on the same analysis of overlaps between randomly chosen, GWAS-matched SNPs and these epigenetic features) and have at least a count of 3 overlaps. The p-value for the total number of phenotype-TF associations is <0.001. Panel C shows several SNPs associated with Crohn’s disease and other inflammatory diseases that reside in a large gene desert on chromosome 5, along with some epigenetic features suggestive of function. The SNP (rs11742570) strongly associated to Crohn’s disease overlaps a GATA2 TF binding signal determined in HUVEC cells. This region is also DNaseI hypersensitive in HUVEC and T-helper Th1 and Th2 cells.

Examining the SOM of integrated ENCODE annotations (see above), we found 19 SOM map units showing significant enrichment for GWAS SNPs, including many SOM units previously associated with specific gene functions, such as the immune response regions. Thus, an appreciable proportion of SNPs identified in initial GWAS scans are either functional or lie within the length of an ENCODE annotation (~500 bp on average) and represent plausible candidates for the functional variant. Expanding the set of feasible functional SNPs to those in reasonable linkage disequilibrium, up to 71% of GWAS SNPs have a potential causative SNP overlapping a DNaseI site, and 31% of loci have a candidate SNP that overlaps a binding site occupied by a TF (see also refs ^74,76).

The GWAS catalogue provides a rich functional categorization from the precise phenotypes being studied. These phenotypic categorizations are non-randomly associated with ENCODE annotations and there is striking correspondence between the phenotype and the identity of the cell type or TF used in the ENCODE assay (Figure 10B). For example, five SNPs associated with Crohn’s disease overlap GATA2-binding sites (P-value 0.003 by random permutation or 0.001 by an empirical approach comparing to the GWAS-matched SNPs; see Supplementary information), and fourteen are located in DHSs found in immunologically relevant cell types. A notable example is a gene desert on chromosome 5p13.1 containing eight SNPs associated with inflammatory diseases. Several are close to or within DHSs in Th1 and Th2 cells as well as peaks of binding by TFs in HUVECs (Figure 10C). The latter cell line is not immunological, but factor occupancy detected there could be a proxy for binding of a more relevant factor, such as GATA3, in T-cells. Genetic variants in this region also affect expression levels of PTGER4⁷⁷, encoding the prostaglandin receptor EP4. Thus, the ENCODE data reinforce the hypothesis that genetic variants in 5p13.1 modulate the expression of flanking genes, and furthermore provide the specific hypothesis that the variants affect occupancy of a GATA factor in an allele-specific manner, thereby influencing susceptibility to Crohn’s disease.

Non-random association of phenotypes with ENCODE cell types strengthens the argument that at least some of the GWAS lead SNPs are functional or extremely close to functional variants. Each of the associations between a lead SNP and an ENCODE annotation remains a credible hypothesis of a particular functional element class or cell type to explore with future experiments. Supplementary Tables M1, M2 and M3 list all 14,885 pairwise associations across the ENCODE annotations. The accompanying papers have a more detailed examination of common variants with other regulatory information ⁷⁶.

Conclusions

The unprecedented number of functional elements identified in this study provides a valuable resource to the scientific community as well as significantly enhances our understanding of the human genome. Our analyses have revealed many novel aspects of gene expression and regulation as well as the organization of such information, as illustrated by the accompanying papers (see http://www.encodeproject.org/ENCODE/pubs.html for collected ENCODE publications). However, there are still many specific details, particularly about the mechanistic processes which generate these elements and how and where they function, that require additional experiments to elucidate.

The large spread of coverage, from our highest resolution, most conservative set of bases implicated in GENCODE protein coding gene exons (2.9%) or specific protein DNA binding (8.5%) to the broadest, most general set of marks covering the genome (approximately 80%) -- with many gradations in between -- presents a spectrum of elements with different functional properties discovered by ENCODE. 99% of the known bases in the genome are within 1.7 kbp of any ENCODE element, whereas 95% of bases are within 8 kb of a bound TF motif or DNaseI footprint. Interestingly, even using the most conservative estimates, the fraction of bases likely to be involved in direct gene regulation, even though incomplete, is significantly higher than that ascribed to protein coding exons (1.2%), raising the possibility that more information in the human genome may be important for gene regulation than for biochemical function. Many of the regulatory elements are not constrained across mammalian evolution, which to date has been one of the most reliable indication of an important biochemical event for the organism. Thus, our data provide orthologous indicators for suggesting possible functional elements.

Importantly, for the first time we have sufficient statistical power to assess the impact of negative selection on primate-specific elements, and all ENCODE classes display evidence of negative selection in these unique to primate elements. Furthermore, even with our most conservative estimate of functional elements (8.5% of putative DNA:protein binding regions) and assuming that we have already sampled half of the elements from our TF and cell type diversity, one would estimate that at a minimum 20% (17% from protein binding, and 2.9% protein coding gene exons) of the genome participates in these specific functions, with the likely figure significantly higher.

The broad coverage of ENCODE annotations enhances our understanding of common diseases with a genetic component, rare genetic diseases, and cancer, as shown by our ability to link otherwise anonymous associations to a functional element. ENCODE and similar studies provide a first step towards interpreting the rest of the genome— beyond protein-coding genes—thereby augmenting common disease genetic studies with testable hypotheses. Such information justifies performing whole-genome sequencing (rather than exome only, 1.2% of the genome) on rare diseases and investigating somatic variants in non-coding functional elements, for instance, in cancer. Furthermore since GWAS analyses typically associate disease to SNPs in large regions, comparison to ENCODE non-coding functional elements can help pinpoint putative causal variants in addition to refinement of location by fine-mapping techniques⁷⁸. Combining ENCODE data with allele-specific information derived from individual genome sequences, provides specific insight on the impact of a genetic variant. Indeed, we believe a significant goal would be to use functional data such as that derived from this project to assign every genomic variant to its possible impact on human phenotypes.

To date, ENCODE has sampled 119 of 1,800 known TFs and general components of the transcriptional machinery on a limited number of cell types and 13 of more than 60 currently known histone or DNA modifications across 147 cell types. DNaseI, FAIRE and extensive RNA assays across subcellular fractionations have been undertaken on many cell types, but overall these data reflect a minor fraction of the potential functional information encoded in the human genome. An important future goal will be to enlarge this dataset to additional factors, modifications and cell types, complementing the other related projects in this area (e.g., Roadmap Epigenomics Project, http://www.roadmapepigenomics.org/ and International Human Epigenome Consortium, http://www.ihec-epigenomes.org/). These projects will constitute foundational resources for human genomics, allowing a deeper interpretation of the organization of gene and regulatory information and the mechanisms of regulation and thereby provide important insights in human health and disease.

A full listing of the Supplementary Figures and Tables is provided in the Supplementary file “ENCODE Supplementary Figures and Tables.docx”. Additional tables are provided as stand alone files as detailed in the index of “ENCODE Supplementary Figures and Tables.docx”. The file “ENCODE Supplementary Info.docx” contains detailed analysis methods and descriptions of code provided, along with descriptions of additional analysis and figures. The supplementary information is accompanied by a Virtual Machine (VM) containing the functioning analysis data and code. Further details of the VM are available from http://encodeproject.org/ENCODE/integrativeAnalysis/VM

Supplementary Material

NIHMS381381-supplement-1.doc^{(8MB, doc)}

NIHMS381381-supplement-10.xlsx^{(173.6KB, xlsx)}

NIHMS381381-supplement-11.xls^{(20.5KB, xls)}

NIHMS381381-supplement-3.doc^{(543KB, doc)}

NIHMS381381-supplement-4.xls^{(2.1MB, xls)}

NIHMS381381-supplement-5.xlsx^{(405.6KB, xlsx)}

NIHMS381381-supplement-6.txt^{(907.7KB, txt)}

NIHMS381381-supplement-7.txt^{(2.3MB, txt)}

NIHMS381381-supplement-8.xlsx^{(38.1KB, xlsx)}

NIHMS381381-supplement-9.xlsx^{(41.4KB, xlsx)}

Acknowledgments

We thank additional members of our laboratories and institutions who have contributed to the experimental and analytical components of this project. We thank Darryl Leja for assistance with production of the figures. The Consortium is funded by grants from the NHGRI as follows: Production Grants: U54HG004570 (Bernstein); U01HG004695 (Birney); U54HG004563 (Crawford); U54HG004557 (Gingeras); U54HG004555 (Hubbard); U41HG004568 (Kent); U54HG004576 (Myers); U54HG004558 (Snyder); U54HG004592 (Stamatoyannopoulos). Pilot Grants: R01HG003143 (Dekker); RC2HG005591 and R01HG003700 (Giddings); R01HG004456-03 (Ruan); U01HG004571 (Tenenbaum); U01HG004561 (Weng); RC2HG005679 (White). This project was supported in part by American Recovery and Reinvestment Act (ARRA) funds from the NHGRI through grants U54HG004570, U54HG004563, U41HG004568, U54HG004592, R01HG003143, RC2HG005591, R01HG003541, U01HG004561, RC2HG005679 and R01HG003988 (PI: Pennacchio). In addition, work from NHGRI Groups was supported by the Intramural Research Program of the NHGRI (Elnitski, ZIAHG200323; Margulies, ZIAHG200341). Research in the Pennachio lab was performed at Lawrence Berkeley National Laboratory and at the United States Department of Energy Joint Genome Institute, Department of Energy Contract DE-AC02-05CH11231, University of California.

Authors

Overall Coordination (Data Analysis Coordination)

Ian Dunham ¹, Anshul Kundaje ²,^†.

Data Production Leads (Data Production)

Shelley F. Aldred ³, Patrick J. Collins ³, Carrie A. Davis ⁴, Francis Doyle ⁵, Charles B. Epstein ⁶, Seth Frietze ⁷, Jennifer Harrow ⁸, Rajinder Kaul ⁹, Jainab Khatun ¹⁰, Bryan R. Lajoie ¹¹, Stephen G. Landt ¹², Bum-Kyu Lee ¹³, Florencia Pauli ¹⁴, Kate R. Rosenbloom ¹⁵, Peter Sabo ¹⁶, Alexias Safi ¹⁷, Amartya Sanyal ¹¹, Noam Shoresh ⁶, Jeremy M. Simon ¹⁸, Lingyun Song ¹⁷, Nathan D. Trinklein ³.

Lead Analysts (Data Analysis)

Robert C. Altshuler ¹⁹, Ewan Birney ¹, James B. Brown ²⁰, Chao Cheng ²¹, Sarah Djebali ²², Xianjun Dong ²³, Ian Dunham ¹, Jason Ernst ¹⁹,^‡, Terrence S. Furey ²⁴, Mark Gerstein ²¹, Belinda Giardine ²⁵, Melissa Greven ²³, Ross C. Hardison ²⁵,²⁶, Robert S. Harris ²⁵, Javier Herrero ¹, Michael M. Hoffman ¹⁶, Sowmya Iyer ²⁷, Manolis Kellis ¹⁹, Jainab Khatun ¹⁰, Pouya Kheradpour ¹⁹, Anshul Kundaje ²,^†, Timo Lassmann ²⁸, Qunhua Li ²⁰,^§, Xinying Lin ²³, Georgi K. Marinov ²⁹, Angelika Merkel ²², Ali Mortazavi ³⁰, Stephen C. J. Parker ³¹, Timothy E. Reddy ¹⁴,^⊥, Joel Rozowsky ²¹, Felix Schlesinger ⁴, Robert E. Thurman ¹⁶, Jie Wang ²³, Lucas D. Ward ¹⁹, Troy W. Whitfield ²³, Steven P. Wilder ¹, Weisheng Wu ²⁵, Hualin S. Xi ³², Kevin (Yuk-Lap) Yip ²¹,^||, Jiali Zhuang ²³.

Writing Group

Bradley E. Bernstein ⁶,³³, Ewan Birney ¹, Ian Dunham ¹, Eric D. Green ³⁴, Chris Gunter ¹⁴, Michael Snyder ¹².

NHGRI Project Management (Scientific Management)

Michael J. Pazin ³⁵, Rebecca F. Lowdon ³⁵,^∇ Laura A.L. Dillon ³⁵, ^O, Leslie B. Adams ³⁵, Caroline J. Kelly ³⁵, Julia Zhang ³⁵,^††, Judith R. Wexler ³⁵,^‡‡, Eric D. Green ³⁴, Peter J. Good ³⁵, Elise A. Feingold ³⁵.

Principal Investigators (Steering Committee)

Bradley E. Bernstein ⁶,³³, Ewan Birney ¹, Gregory E. Crawford ¹⁷,³⁶, Job Dekker ¹¹, Laura Elnitski ³⁷, Peggy J. Farnham ⁷, Mark Gerstein ²¹, Morgan C. Giddings ¹⁰, Thomas R. Gingeras ⁴,³⁸, Eric D. Green ³⁴, Roderic Guigó ²²,³⁹, Ross C. Hardison ²⁵,²⁶, Timothy J. Hubbard ⁸, Manolis Kellis ¹⁹, W. James Kent ¹⁵, Jason D. Lieb ¹⁸, Elliott H. Margulies ³¹,^§§, Richard M. Myers ¹⁴, Michael Snyder ¹², John A. Stamatoyannopoulos ⁴⁰, Scott A. Tenenbaum ⁵, Zhiping Weng ²³, Kevin P. White ⁴¹, Barbara Wold ²⁹,⁴².

Boise State University Proteomics Group (Data Production and Analysis)

Jainab Khatun ¹⁰, Yanbao Yu ⁴³, John Wrobel ¹⁰, Brian A. Risk ¹⁰, Harsha P. Gunawardena ⁴³, Heather C. Kuiper ⁴³, Christopher W. Maier ⁴³, Ling Xie ⁴³, Xian Chen ⁴³, Morgan C. Giddings ¹⁰.

Broad Institute Group (Data Production and Analysis)

Bradley E. Bernstein ⁶,³³, Charles B. Epstein ⁶, Noam Shoresh ⁶, Jason Ernst ¹⁹,^‡, Pouya Kheradpour ¹⁹, Tarjei S. Mikkelsen ⁶, Shawn Gillespie ³³, Alon Goren ⁶,³³, Oren Ram ⁶,³³, Xiaolan Zhang ⁶, Li Wang ⁶, Robbyn Issner ⁶, Michael J. Coyne ⁶, Timothy Durham ⁶, Manching Ku ⁶,³³, Thanh Truong ⁶, Lucas D. Ward ¹⁹, Robert C. Altshuler ¹⁹, Matthew L. Eaton ¹⁹, Manolis Kellis ¹⁹.

Cold Spring Harbor, University of Geneva, Center for Genomic Regulation, Barcelona, RIKEN, Sanger Institute, University of Lausanne, Genome Institute of Singapore Group (Data Production and Analysis)

Sarah Djebali ²², Carrie A. Davis ⁴, Angelika Merkel ²², Alex Dobin ⁴, Timo Lassmann ²⁸, Ali Mortazavi ³⁰, Andrea Tanzer ²², Julien Lagarde ²², Wei Lin ⁴, Felix Schlesinger ⁴, Chenghai Xue ⁴, Georgi K. Marinov ²⁹, Jainab Khatun ¹⁰, Brian A. Williams ²⁹, Chris Zaleski ⁴, Joel Rozowsky ²¹, Maik Röder ²², Felix Kokocinski ⁸, ^⊥⊥, Rehab F. Abdelhamid ²⁸, Tyler Alioto ²²,⁴⁴, Igor Antoshechkin ²⁹, Michael T. Baer ⁴, Philippe Batut ⁴, Ian Bell ⁴⁵, Kimberly Bell ⁴, Sudipto Chakrabortty ⁴, Xian Chen ⁴³, Jacqueline Chrast ⁴⁶, Joao Curado ²², Thomas Derrien ²², || ||, Jorg Drenkow ⁴, Erica Dumais ⁴⁵, Jackie Dumais ⁴⁵, Radha Duttagupta ⁴⁵, Megan Fastuca ⁴, Kata Fejes-Toth ⁴, ^∇∇, Pedro Ferreira ²², Sylvain Foissac ⁴⁵, Melissa J. Fullwood ⁴⁷, O O, Hui Gao ⁴⁵, David Gonzalez ²², Assaf Gordon ⁴, Harsha P. Gunawardena ⁴³, Cédric Howald ⁴⁶, Sonali Jha ⁴, Rory Johnson ²², Philipp Kapranov ⁴⁵, †††, Brandon King ²⁹, Colin Kingswood ²²,⁴⁴, Guoliang Li ⁴⁸, Oscar J. Luo ⁴⁷, Eddie Park ³⁰, Jonathan B. Preall ⁴, Kimberly Presaud ⁴, Paolo Ribeca ²²,⁴⁴, Brian A. Risk ¹⁰, Daniel Robyr ⁴⁹, Xiaoan Ruan ⁴⁷, Michael Sammeth ²²,⁴⁴, Kuljeet Singh Sandhu ⁴⁷, Lorain Schaeffer ²⁹, Lei-Hoon See ⁴, Atif Shahab ⁴⁷, Jorgen Skancke ²², Ana Maria Suzuki ²⁸, Hazuki Takahashi ²⁸, Hagen Tilgner ²², ‡‡‡, Diane Trout ²⁹, Nathalie Walters ⁴⁶, Huaien Wang ⁴, John Wrobel ¹⁰, Yanbao Yu ⁴³, Yoshihide Hayashizaki ²⁸, Jennifer Harrow ⁸, Mark Gerstein ²¹, Timothy J. Hubbard ⁸, Alexandre Reymond ⁴⁶, Stylianos E. Antonarakis ⁴⁹, Gregory J. Hannon ⁴, Morgan C. Giddings ¹⁰, Yijun Ruan ⁴⁷, Barbara Wold ²⁹,⁴², Piero Carninci ²⁸, Roderic Guigó ²²,³⁹, Thomas R. Gingeras ⁴,³⁸.

Data Coordination Center at UC Santa Cruz (Production Data Coordination)

Kate R. Rosenbloom ¹⁵, Cricket A. Sloan ¹⁵, Katrina A. Learned ¹⁵, Venkat S. Malladi ¹⁵, Matthew C. Wong ¹⁵, Galt P. Barber ¹⁵, Melissa S. Cline ¹⁵, Timothy R. Dreszer ¹⁵, Steven G. Heitner ¹⁵, Donna Karolchik ¹⁵, W. James Kent ¹⁵, Vanessa M. Kirkup ¹⁵, Laurence R. Meyer ¹⁵, Jeffrey C. Long ¹⁵, Morgan Maddren ¹⁵, Brian J. Raney ¹⁵.

Duke University, EBI, University of Texas, Austin, University of North Carolina-Chapel Hill Group (Data Production and Analysis)

Terrence S. Furey ²⁴, Lingyun Song ¹⁷, Linda L. Grasfeder ¹⁸, Paul G. Giresi ¹⁸, Bum-Kyu Lee ¹³, Anna Battenhouse ¹³, Nathan C. Sheffield ¹⁷, Jeremy M. Simon ¹⁸, Kimberly A. Showers ¹⁸, Alexias Safi ¹⁷, Darin London ¹⁷, Akshay A. Bhinge ¹³, Christopher Shestak ¹⁸, Matthew R. Schaner ¹⁸, Seul Ki Kim ¹⁸, Zhuzhu Z. Zhang ¹⁸, Piotr A. Mieczkowski ⁵⁰, Joanna O. Mieczkowska ¹⁸, Zheng Liu ¹³, Ryan M. McDaniell ¹³, Yunyun Ni ¹³, Naim U. Rashid ⁵¹, Min Jae Kim ¹⁸, Sheera Adar ¹⁸, Zhancheng Zhang ²⁴, Tianyuan Wang ¹⁷, Deborah Winter ¹⁷, Damian Keefe ¹, Ewan Birney ¹, Vishwanath R. Iyer ¹³, Jason D. Lieb ¹⁸, Gregory E. Crawford ¹⁷,³⁶.

Genome Institute of Singapore Group (Data Production and Analysis)

Guoliang Li ⁴⁸, Kuljeet Singh Sandhu ⁴⁷, Meizhen Zheng ⁴⁷, Ping Wang ⁴⁷, Oscar J. Luo ⁴⁷, Atif Shahab ⁴⁷, Melissa J. Fullwood ⁴⁷, O O, Xiaoan Ruan ⁴⁷, Yijun Ruan ⁴⁷.

HudsonAlpha Institute, Caltech, UC Irvine, Stanford Group (Data Production and Analysis)

Richard M. Myers ¹⁴, Florencia Pauli ¹⁴, Brian A. Williams ²⁹, Jason Gertz ¹⁴, Georgi K. Marinov ²⁹, Timothy E. Reddy ¹⁴,^⊥, Jost Vielmetter ²⁹,⁴², E. Christopher Partridge ¹⁴, Diane Trout ²⁹, Katherine E. Varley ¹⁴, Clarke Gasper ²⁹,⁴², Anita Bansal ¹⁴, Shirley Pepke ²⁹,⁵², Preti Jain ¹⁴, Henry Amrhein ²⁹, Kevin M. Bowling ¹⁴, Michael Anaya ²⁹,⁴², Marie K. Cross ¹⁴, Brandon King ²⁹, Michael A. Muratet ¹⁴, Igor Antoshechkin ²⁹, Kimberly M. Newberry ¹⁴, Kenneth McCue ²⁹, Amy S. Nesmith ¹⁴, Katherine I. Fisher-Aylor ²⁹,⁴², Barbara Pusey ¹⁴, Gilberto DeSalvo ²⁹,⁴², Stephanie L. Parker ¹⁴, §§§, Sreeram Balasubramanian ²⁹,⁴², Nicholas S. Davis ¹⁴, Sarah K. Meadows ¹⁴, Tracy Eggleston ¹⁴, Chris Gunter ¹⁴, J. Scott Newberry ¹⁴, Shawn E. Levy ¹⁴, Devin M. Absher ¹⁴, Ali Mortazavi ³⁰, Wing H. Wong ⁵³, Barbara Wold ²⁹,⁴².

Lawrence Berkeley National Laboratory Group (Targeted Experimental Validation)

Matthew J. Blow ⁵⁴, Axel Visel ⁵⁴,⁵⁵, Len A. Pennachio ⁵⁴,⁵⁵.

NHGRI Groups (Data Production and Analysis)

Laura Elnitski ³⁷, Elliott H. Margulies ³¹, ^§§, Stephen C. J. Parker ³¹, Hanna M. Petrykowska ³⁷.

Sanger Institute, Washington University, Yale University, Center for Genomic Regulation, Barcelona, UCSC, MIT, University of Lausanne, CNIO Group (Data Production and Analysis)

Alexej Abyzov ²¹, Bronwen Aken ⁸, Daniel Barrell ⁸, Gemma Barson ⁸, Andrew Berry ⁸, Alexandra Bignell ⁸, Veronika Boychenko ⁸, Giovanni Bussotti ²², Jacqueline Chrast ⁴⁶, Claire Davidson ⁸, Thomas Derrien ²², || ||, Gloria Despacio-Reyes ⁸, Mark Diekhans ¹⁵, Iakes Ezkurdia ⁵⁶, Adam Frankish ⁸, James Gilbert ⁸, Jose Manuel Gonzalez ⁸, Ed Griffiths ⁸, Rachel Harte ¹⁵, David A. Hendrix ¹⁹, Cédric Howald ⁴⁶, Toby Hunt ⁸, Irwin Jungreis ¹⁹, Mike Kay ⁸, Ekta Khurana ²¹, Felix Kokocinski ⁸, ^⊥⊥, Jing (Jane) Leng ²¹, Michael F. Lin ¹⁹, Jane Loveland ⁸, Zhi Lu ⁵⁷, Deepa Manthravadi ⁸, Marco Mariotti ²², Jonathan Mudge ⁸, Gaurab Mukherjee ⁸, Cedric Notredame ²², Baikang Pei ²¹, Jose Manuel Rodriguez ⁵⁶, Gary Saunders ⁸, Andrea Sboner ⁵⁸, Stephen Searle ⁸, Cristina Sisu ²¹, Catherine Snow ⁸, Charlie Steward ⁸, Andrea Tanzer ²², Electra Tapanari ⁸, Michael L. Tress ⁵⁶, Marijke J. van Baren ⁵⁹, ⊥⊥⊥, Nathalie Walters ⁴⁶, Stefan Washietl ¹⁹, Laurens Wilming ⁸, Amonida Zadissa ⁸, Zhengdong Zhang ⁶⁰, Michael Brent ⁵⁹, David Haussler ⁶¹, Manolis Kellis ¹⁹, Alfonso Valencia ⁵⁶, Mark Gerstein ²¹, Alexandre Reymond ⁴⁶, Roderic Guigó ²²,³⁹, Jennifer Harrow ⁸, Timothy J. Hubbard ⁸.

Stanford-Yale, Harvard, University of Massachusetts Medical School, University of Southern California/UCDavis Group (Data Production and Analysis)

Stephen G. Landt ¹², Seth Frietze ⁷, Alexej Abyzov ²¹, Nick Addleman ¹², Roger P. Alexander ²¹, Raymond K. Auerbach ²¹, Suganthi Balasubramanian ²¹, Keith Bettinger ¹², Nitin Bhardwaj ²¹, Alan P. Boyle ¹², Alina R. Cao ⁶², Philip Cayting ¹², Alexandra Charos ⁶³, Yong Cheng ¹², Chao Cheng ²¹, Catharine Eastman ¹², Ghia Euskirchen ¹², Joseph D. Fleming ⁶⁴, Fabian Grubert ¹², Lukas Habegger ²¹, Manoj Hariharan ¹², Arif Harmanci ²¹, Sushma Iyengar ⁶⁵, Victor X. Jin ⁶⁶, Konrad J. Karczewski ¹², Maya Kasowski ¹², Phil Lacroute ¹², Hugo Lam ¹², Nathan Lamarre-Vincent ⁶⁴, Jing (Jane) Leng ²¹, Jin Lian ⁶⁷, Marianne Lindahl-Allen ⁶⁴, Renqiang Min ²¹, || || ||, Benoit Miotto ⁶⁴, Hannah Monahan ⁶³, Zarmik Moqtaderi ⁶⁴, Xinmeng (Jasmine) Mu ²¹, Henriette O’Geen ⁶², Zhengqing Ouyang ¹², Dorrelyn Patacsil ¹², Baikang Pei ²¹, Debasish Raha ⁶³, Lucia Ramirez ¹², Brian Reed ⁶³, Joel Rozowsky ²¹, Andrea Sboner ⁵⁸, Minyi Shi ¹², Cristina Sisu ²¹, Teri Slifer ¹², Heather Witt ⁷, Linfeng Wu ¹², Xiaoqin Xu ⁶², Koon-Kiu Yan ²¹, Xinqiong Yang ¹², Kevin (Yuk-Lap) Yip ²¹,^||, Zhengdong Zhang ⁶⁰, Kevin Struhl ⁶⁴, Sherman M. Weissman ⁶⁷, Mark Gerstein ²¹, Peggy J. Farnham ⁷, Michael Snyder ¹².

University of Albany SUNY Group (Data Production and Analysis)

Scott A. Tenenbaum ⁵, Luiz O. Penalva ⁶⁸, Francis Doyle ⁵.

University of Chicago, Stanford Group (Data Production and Analysis)

Subhrad ip Karmakar ⁴¹, Stephen G. Landt ¹², Raj R. Bhanvadia ⁴¹, Alina Choudhury ⁴¹, Marc Domanus ⁴¹, Lijia Ma ⁴¹, Jennifer Moran ⁴¹, Dorrelyn Patacsil ¹², Teri Slifer ¹², Alec Victorsen ⁴¹, Xinqiong Yang ¹², Michael Snyder ¹², Kevin P. White ⁴¹.

University of Heidelberg Group (Targeted Experimental Validation)

Thomas Auer ⁶⁹, ∇∇∇, Lazaro Centanin ⁶⁹, Michael Eichenlaub ⁶⁹, Franziska Gruhl ⁶⁹, Stephan Heermann ⁶⁹, Burkhard Hoeckendorf ⁶⁹, Daigo Inoue ⁶⁹, Tanja Kellner ⁶⁹, Stephan Kirchmaier ⁶⁹, Claudia Mueller ⁶⁹, Robert Reinhardt ⁶⁹, Lea Schertel ⁶⁹, Stephanie Schneider ⁶⁹, Rebecca Sinn ⁶⁹, Beate Wittbrodt ⁶⁹, Jochen Wittbrodt ⁶⁹.

University of Massachusetts Medical School Bioinformatics Group (Data Production and Analysis)

Zhiping Weng ²³, Troy W. Whitfield ²³, Jie Wang ²³, Patrick J. Collins ³, Shelley F. Aldred ³, Nathan D. Trinklein ³, E. Christopher Partridge ¹⁴, Richard M. Myers ¹⁴.

University of Massachusetts Medical School Genome Folding Group (Data Production and Analysis)

Job Dekker ¹¹, Gaurav Jain ¹¹, Bryan R. Lajoie ¹¹, Amartya Sanyal ¹¹.

University of Washington, University of Massachusetts Medical Center Group (Data Production and Analysis)

Gayathri Balasundaram ⁷⁰, Daniel L. Bates ¹⁶, Rachel Byron ⁷⁰, Theresa K. Canfield ¹⁶, Morgan J. Diegel ¹⁶, Douglas Dunn ¹⁶, Abigail K. Ebersol ⁷¹, Tristan Frum ⁷¹, Kavita Garg ⁷², Erica Gist ¹⁶, R. Scott Hansen ⁷¹, Lisa Boatman ⁷¹, Eric Haugen ¹⁶, Richard Humbert ¹⁶, Gaurav Jain ¹¹, Audra K Johnson ¹⁶, Ericka M. Johnson ⁷¹, Tattyana V. Kutyavin ¹⁶, Bryan R. Lajoie ¹¹, Kristen Lee ¹⁶, Dimitra Lotakis ⁷¹, Matthew T. Maurano ¹⁶, Shane J. Neph ¹⁶, Fiedencio V. Neri ¹⁶, Eric D. Nguyen ⁷¹, Hongzhu Qu ¹⁶, Alex P. Reynolds ¹⁶, Vaughn Roach ¹⁶, Eric Rynes ¹⁶, Peter Sabo ¹⁶, Minerva E. Sanchez ⁷¹, Richard S. Sandstrom ¹⁶, Amartya Sanyal ¹¹, Anthony O. Shafer ¹⁶, Andrew B. Stergachis ¹⁶, Sean Thomas ¹⁶, Robert E. Thurman ¹⁶, Benjamin Vernot ¹⁶, Jeff Vierstra ¹⁶, Shinny Vong ¹⁶, Hao Wang ¹⁶, Molly A. Weaver ¹⁶, Yongqi Yan ⁷¹, Miaohua Zhang ⁷⁰, Joshua M. Akey ¹⁶, Michael Bender ⁷⁰, Michael O. Dorschner ⁷³, Mark Groudine ⁷⁰, Michael J. MacCoss ¹⁶, Patrick Navas ⁷¹, George Stamatoyannopoulos ⁷¹, Rajinder Kaul ⁹, Job Dekker ¹¹, John A. Stamatoyannopoulos ⁴⁰.

Data Analysis Center (Data Analysis)

Ian Dunham ¹, Kathryn Beal ¹, Alvis Brazma ⁷⁴, Paul Flicek ¹, Javier Herrero ¹, Nathan Johnson ¹, Damian Keefe ¹, Margus Lukk ⁷⁴, O O O, Nicholas M. Luscombe ⁷⁵, Daniel Sobral ¹, ††††, Juan M. Vaquerizas ⁷⁵, Steven P. Wilder ¹, Serafim Batzoglou ², Arend Sidow ⁷⁶, Nadine Hussami ², Sofia Kyriazopoulou-Panagiotopoulou ², Max W. Libbrecht ², ‡‡‡‡, Marc A. Schaub ², Anshul Kundaje ²,^†, Ross C. Hardison ²⁵,²⁶, Webb Miller ²⁵, Belinda Giardine ²⁵, Robert S. Harris ²⁵, Weisheng Wu ²⁵, Peter J. Bickel ²⁰, Balazs Banfai ²⁰, Nathan P. Boley ²⁰, James B. Brown ²⁰, Haiyan Huang ²⁰, Qunhua Li ²⁰,^§, Jingyi Jessica Li ²⁰, William Stafford Noble ¹⁶,⁷⁷, Jeffrey A. Bilmes ⁷⁸, Orion J. Buske ¹⁶, Michael M. Hoffman ¹⁶, Avinash D. Sahu ¹⁶, Peter V. Kharchenko ⁷⁹, Peter J. Park ⁷⁹, Dannon Baker⁸⁰, James Taylor⁸⁰, Zhiping Weng ²³, Sowmya Iyer ²⁷, Xianjun Dong ²³, Melissa Greven ²³, Xinying Lin ²³, Jie Wang ²³, Hualin S. Xi ³², Jiali Zhuang ²³, Mark Gerstein ²¹, Roger P. Alexander ²¹, Suganthi Balasubramanian ²¹, Chao Cheng ²¹, Arif Harmanci ²¹, Lucas Lochovsky ²¹, Renqiang Min ²¹, || || ||, Xinmeng (Jasmine) Mu ²¹, Joel Rozowsky ²¹, Koon-Kiu Yan ²¹, Kevin (Yuk-Lap) Yip ²¹, ^||, Ewan Birney ¹.

Footnotes

Vertebrate Genomics Group, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK

Department of Computer Science, Stanford University, 318 Campus Drive, Stanford, CA, USA

SwitchGear Genomics, 1455 Adams Drive Suite 1317, Menlo Park, CA, USA

⁴

Functional Genomics, Cold Spring Harbor Laboratory, 1 Bungtown Rd, Cold Spring Harbor, NY, USA

⁵

College of Nanoscale Sciences and Engineering, University ay Albany-SUNY, 257 Fuller Road, NFE 4405, Albany, NY, USA

⁶

Broad Institute of MIT and Harvard, 7 Cambridge Center, Cambridge, MA, USA

⁷

Biochemistry and Molecular Biology, USC/Norris Comprehensive Cancer Center, 1450 Biggy Street, NRT 6503, Los Angeles, CA, USA

⁸

Informatics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK

⁹

Department of Medicine, Division of Medical Genetics, University of Washington, 3720 15th Ave NE, Seattle, WA, USA

¹⁰

College of Arts and Sciences, Boise State University, 1910 University Dr., Boise, ID, USA

¹¹

Program in Systems Biology, Program in Gene Function and Expression, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, 364 Plantation Street, Worcester, MA, USA

¹²

Department of Genetics, Stanford University, 300 Pasteur Dr., M-344, Stanford, CA, USA

¹³

Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, Section of Molecular Genetics and Microbiology, The University of Texas at Austin, 1 University Station A4800, Austin, TX, USA

¹⁴

HudsonAlpha Institute for Biotechnology, 601 Genome Way, Huntsville, AL, USA

¹⁵

Center for Biomolecular Science and Engineering, University of California, Santa Cruz, 1156 High Street, Santa Cruz, CA, USA

¹⁶

Department of Genome Sciences, University of Washington, 3720 15th Ave NE, Seattle, WA, USA

¹⁷

Institute for Genome Sciences & Policy, Duke University, 101 Science Drive, Durham, NC, USA

¹⁸

Department of Biology, Carolina Center for Genome Sciences, and Lineberger Comprehensive Cancer Center, The University of North Carolina at Chapel Hill, 408 Fordham Hall, Chapel Hill, NC, USA

¹⁹

Computer Science and Artificial Intelligence Laboratory, Broad Institute of MIT and Harvard, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, MA, USA

²⁰

Department of Statistics, University of California, Berkeley, 367 Evans Hall, University of California, Berkeley, Berkeley, CA, USA

²¹

Computational Biology & Bioinformatics Program, Yale University, 266 Whitney Ave, New Haven, CT, USA

²²

Bioinformatics and Genomics, Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 - 08003, Barcelona, Catalunya, Spain

²³

Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, 364 Plantation Street, Worcester, MA, USA

²⁴

Department of Genetics, The University of North Carolina at Chapel Hill, 120 Mason Farm Rd, CB#7240, Chapel Hill, NC, USA

²⁵

Center for Comparative Genomics and Bioinformatics, The Pennsylvania State University, Wartik Laboratory, University Park, PA, USA

²⁶

Department of Biochemistry and Molecular Biology, The Pennsylvania State University, 304 Wartik Laboratory, University Park, PA, USA

²⁷

Program in Bioinformatics, Boston University, 24 Cummington St, Boston, MA, USA

²⁸

RIKEN Omics Science Center, RIKEN Yokohama Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan

²⁹

Division of Biology, California Institute of Technology, 156-291200 E. California Blvd, Pasadena, CA, USA

³⁰

Developmental and Cell Biology and Center for Complex Biological Systems, University of California Irvine, 2218 Biological Sciences III, Irvine, CA, USA

³¹

Genome Technology Branch, National Human Genome Research Institute, 5625 Fishers Lane, Bethesda, MD, USA

³²

Department of Biochemistry and Molecular Pharmacology, Bioinformatics Core, University of Massachusetts Medical School, 364 Plantation Street, Worcester, MA, USA

³³

Howard Hughes Medical Institute and Department of Pathology, Massachusetts General Hospital and Harvard Medical School, 185 Cambridge St CPZN 8400, Boston, MA, USA

³⁴

National Human Genome Research Institute, National Institutes of Health, 31 Center Dr., Bldg. 31, Rm. 4B09, Bethesda, MD, USA

³⁵

National Human Genome Research Institute, National Institutes of Health, 5635 Fishers Lane, Bethesda, MD, USA

³⁶

Department of Pediatrics, Division of Medical Genetics, Duke University School of Medicine, Durham, NC, USA

³⁷

National Human Genome Research Institute, National Institutes of Health, 5625 Fishers Lane, Rockville, MD, USA

³⁸

Affymetrix, Inc., 3380 Central Expressway, Santa Clara, CA, USA

³⁹

Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, Barcelona, Catalonia, Spain

⁴⁰

Department of Genome Sciences, Box 355065, and Department of Medicine, Division of Oncology, Box 358081, University of Washington, Seattle, WA, USA

⁴¹

Institute for Genomics and Systems Biology, The University of Chicago, 900 E. 57th Street, 10100 KCBD, Chicago, IL, USA

⁴²

Beckman Institute, California Institute of Technology, 156-29 1200 E. California Blvd., Pasadena, CA, 91125, US

⁴³

Dept. of Biochemistry & Biophysics, University of North Carolina School of Medicine, Campus Box 7260, 120 Mason Farm Rd., #3010 Genetic Medicine Bldg., Chapel Hill, NC, USA

⁴⁴

Centro Nacional de Análisis Genómico (CNAG), C/Baldiri Reixac 4, Torre I, Barcelona, Catalunya 08028, Spain

⁴⁵

Genomics, Affymetrix, Inc., 3380 Central Expressway, Santa Clara, CA, USA

⁴⁶

Center for Integrative Genomics, University of Lausanne, Genopode Building, 1015 Lausanne, Switzerland, Lausanne, Switzerland

⁴⁷

Genome Technology and Biology, Genome Institute of Singapore, 60 Biopolis Street, #02-01, Genome,, Singapore 138672, Singapore

⁴⁸

Computational and Systems Biology, Genome Institute of Singapore, 60 Biopolis Street, #02 01, Genome,, Singapore 138672, Singapore

⁴⁹

Department of Genetic Medicine and Development, University of Geneva Medical School, and University Hospitals of Geneva, 1 rue Michel-Servet, Geneva, Switzerland

⁵⁰

Department of Genetics, The University of North Carolina at Chapel Hill, 5078 GMB, Chapel Hill, NC, USA

⁵¹

Department of Biostatistics, Gillings School of Global Public Health, The University of North Carolina at Chapel Hill, 408 Fordham Hall, Chapel Hill, NC, USA

⁵²

Center for Advanced Computing Research, California Institute of Technology, MC 158-79, 1200 East California Blvd., Pasadena, CA, 91125, USA

⁵³

Department Statistics, Stanford University, Sequoia Hall. 390 Serra Mall,, Stanford, CA, USA

⁵⁴

DOE Joint Genome Institute, Walnut Creek, CA, USA

⁵⁵

Genomics Division, Lawrence Berkeley National Laboratory, One Cyclotron Road, MS 84-171, Berkeley, CA, USA

⁵⁶

Structural Computational Biology, Spanish National Cancer Research Centre (CNIO), Melchor Fernandez Almagro, 3 - 28029, Madrid, Spain

⁵⁷

School of Life Sciences, Tsinghua University, School of Life Sciences, Tsinghua University, Beijing, 100084, Beijing, China

⁵⁸

Dept. Pathology and Laboratory Medicine, Institute for Computational Biomedicine, Weill Cornell Medical College, 1305 York Ave, Box 140, New York, NY, USA

⁵⁹

Computer Science and Engineering, Washington University in St Louis, St Louis, MO, USA

⁶⁰

Department of Genetics, Albert Einstein College of Medicine, 1301 Morris Park Avenue, Room 353A, Bronxe, NY, USA

⁶¹

Center for Biomolecular Science and Engineering, Howard Hughes Medical Institute, University of California, Santa Cruz, 1156 High Street, Santa Cruz, CA, USA

⁶²

Genome Center, University of California-Davis, 451 Health Sciences Drive, Davis, CA, USA

⁶³

Department of Molecular, Cellular, and Developmental Biology, Yale University, 266 Whitney Ave, New Haven, CT, USA

⁶⁴

Biological Chemistry and Molecular Pharmacology, Harvard Medical School, 240 Longwood Avenue, Boston, MA, USA

⁶⁵

Biochemistry and Molecular Biology, University of Southern California, 1501 San Pablo Street, Los Angeles, CA, USA

⁶⁶

Department of Biomedical Informatics, Ohio State University, 3172C Graves Hall, 333 W Tenth Avenue, Columbus, OH, USA

⁶⁷

Department of Genetics, Yale University, Yale University School of Medicine, 333 Cedar Street, New Haven, CT, USA

⁶⁸

Department of Cellular and Structural Biology, Children’s Cancer Research Institute - UTHSCSA, Mail code 7784-7703 Floyd Curl Dr, San Antonio, TX, USA

⁶⁹

Centre for Organismal Studies (COS) Heidelberg, University of Heidelberg, Im Neuenheimer Feld 230, 69120 Heidelberg, Germany

⁷⁰

Basic Sciences Division, Fred Hutchinson Cancer Research Center, 825 Eastlake Avenue East, Seattle, WA, USA

⁷¹

Department of Medicine, Division of Medical Genetics, Box 357720, University of Washington, Seattle, WA, USA

⁷²

Division of Human Biology, Fred Hutchinson Cancer Research Center, 825 Eastlake Avenue East, Seattle, WA, USA

⁷³

Department of Psychiatry and Behavioral Sciences, Box 356560, University of Washington, Seattle, WA, USA

⁷⁴

Microarray Informatics Group, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK

⁷⁵

Genomics and Regulatory Systems Group, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK

⁷⁶

Department of Pathology, Department of Genetics, Stanford University, 300 Pasteur Drive, Stanford, CA, USA

⁷⁷

Department of Computer Science and Engineering, 185 Stevens Way, Seattle, WA, USA

⁷⁸

Department of Electrical Engineering, University of Washington, 185 Stevens Way, Seattle, WA, USA

⁷⁹

Center for Biomedical Informatics, Harvard Medical School, 10 Shattuck St, Boston, MA, USA

⁸⁰

Departments of Biology and Mathematics & Computer Science, Emory University, Atlanta, GA, USA

^†

Present Address: Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, MA, USA. Anshul Kundaje

^‡

Present Address: UCLA Biological Chemistry Department, Eli & Edythe Broad Center of Regenerative Medicine and Stem Cell Research at UCLA, Jonsson Comprehensive Cancer Center; 615 Charles E Young Dr South; Los Angeles, CA 90095, USA. Jason Ernst

^§

Present Address: Department of Statistics, 514D Wartik Lab, Penn State University, State College, PA, USA. Qunhua Li

^⊥

Present Address: Department of Biostatistics & Bioinformatics and the Institute for Genome Sciences & Policy, Duke University School of Medicine, 101 Science Drive, Durham, NC, USA. Timothy E. Reddy

^||

Present Address: Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. Kevin (Yuk-Lap) Yip

^∇

Present Address: Department of Genetics, Washington University in St. Louis, St. Louis, Missouri, USA. Rebecca F. Lowdon

Present Address: Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD, USA. Laura A.L. Dillon

^††

Present Address: National Cancer Institute, National Institutes of Health, Bethesda, MD, USA. Julia Zhang

^‡‡

Present Address: University of California, Davis Population Biology Graduate Group, Davis, CA, USA. Judith R. Wexler

^§§

Present Address: Illumina Cambridge Ltd., Chesterford Research Park, Little Chesterford, Saffron Walden, Essex, CB10 1XL, UK.. Elliott H. Margulies

^⊥⊥

Present Address: BlueGnome Ltd., CPC4, Capital Park, Fulbourn, Cambridge, CB21 5XE, UK. Felix Kokocinski

^{|| ||}

Present Address: Institut de Génétique et Développement de Rennes, CNRS-UMR6061, Université de Rennes 1, F-35000 Rennes, Brittany, France. Thomas Derrien

^∇∇

Present Address: Caltech, 1200 E. California Blvd., Pasadena, CA 91125, USA. Kata Fejes Toth

^{O O}

Present Address: A*STAR-Duke-NUS Neuroscience Research Partnership, 8 College Road, Singapore 169857. Melissa J. Fullwood

^†††

Present Address: St. Laurent Institute, One Kendall Square, Cambridge, MA. USA. Philipp K apranov

^‡‡‡

Present Address: Department of Genetics, Stanford University, Stanford, CA 94305, USA. Hagen Tilgner

^§§§

Present Address: Biomedical Sciences (BMS) Graduate Program, University of California, San Francisco, 513 Parnassus Ave., HSE-1285, San Francisco, CA 94143-0505, USA. Stephanie L. Parker

^⊥⊥⊥

Present Address: Monterey Bay Aquarium Research Institute, Moss Landing, CA, USA. Marijke J. van Baren

^{|| || ||}

Present Address: Department of Machine Learning, NEC Laboratories America, 4 Independence Way, Princeton, NJ 08540, USA. Renqiang Min

^∇∇∇

Present Address: Neuronal Circuit Development Group, Unité de Génétique et Biologie du Développement, U934/UMR3215, Institut Curie - Centre de Recherche, Pole de Biologie du Développement et Cancer, 26, rue d’Ulm, 75248 PARIS Cedex 05, France. Thomas Auer

^{O O O}

Present Address: Cancer Research UK, Cambridge Research Institute, Li Ka Shing Centre, Robinson Way, Cambridge, CB2 0RE, UK. Margus Lukk

^††††

Present Address: Unidade de Bioinformatica, Rua da Quinta Grande, 6, P-2780-156 Oeiras, Portugal. Daniel Sobral

^‡‡‡‡

Present Address: Department of Genome Sciences, University of Washington, 3720 15th Ave NE, Seattle, WA, USA. Max W. Libbrecht

^§§§§

Present Address: Center for Bioinformatics and Computational Biology, 3115 Ag/Life Surge Bldg #296, University of Maryland, College Park, MD, USA. Avinash D. Sahu

References

1.ENCODE_Project_Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306:636–640. doi: 10.1126/science.1105136. 306/5696/636 [pii] [DOI] [PubMed] [Google Scholar]
2.Birney E, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. doi: 10.1038/nature05874. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Myers RM, et al. A user’s guide to the encyclopedia of DNA elements (ENCODE) PLoS biology. 2011;9:e1001046. doi: 10.1371/journal.pbio.1001046. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Waterston RH, et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562. doi: 10.1038/nature01262. [DOI] [PubMed] [Google Scholar]
5.Chiaromonte F, et al. The share of human genomic DNA under selection estimated from human-mouse genomic alignments. Cold Spring Harbor symposia on quantitative biology. 2003;68:245–254. doi: 10.1101/sqb.2003.68.245. [DOI] [PubMed] [Google Scholar]
6.Cooper GM, et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome research. 2005;15:901–913. doi: 10.1101/gr.3577405. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Parker SC, Hansen L, Abaan HO, Tullius TD, Margulies EH. Local DNA topography correlates with functional noncoding regions of the human genome. Science. 2009;324:389–392. doi: 10.1126/science.1169050. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Lindblad-Toh K, et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature. 2011;478:476–482. doi: 10.1038/nature10530. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Pheasant M, Mattick JS. Raising the estimate of functional human sequences. Genome research. 2007;17:1245–1253. doi: 10.1101/gr.6406307. [DOI] [PubMed] [Google Scholar]
10.Ponting CP, Hardison RC. What fraction of the human genome is functional? Genome research. 2011;21:1769–1776. doi: 10.1101/gr.116814.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Asthana S, et al. Widely distributed noncoding purifying selection in the human genome. Proceedings of the National Academy of Sciences of the United States of America. 2007;104:12410–12415. doi: 10.1073/pnas.0705140104. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Landt SG, et al. ChIP-seq guidelines and practices used by the ENCODE and modENCODE consortia. Genome research. 2012;22(9) doi: 10.1101/gr.136184.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Li Q, Brown JB, Huang H, Bickel PJ. Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics. 5:1752–1779. [Google Scholar]
14.Harrow J, et al. GENCODE: The reference human genome annotation for the ENCODE project. Genome research. 2012 doi: 10.1101/gr.135350.111. manuscript submitted. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Howald C, et al. Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome. Genome research. 2012;22(9) doi: 10.1101/gr.134478.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Djebali S, et al. Landscape of transcription in human cells. Nature. 2012 doi: 10.1038/nature11233. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Derrien T, et al. The GENCODE v7 catalogue of human long non-coding RNAs: Analysis of their gene structure, evolution and expression. Genome research. 2012 doi: 10.1101/gr.132159.111. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Pei B, et al. The GENCODE Pseudogene Resource: Integration of Functional Genomics Evidence Allows Comprehensive Annotation of Partial Activity. Genome biology. 2012 doi: 10.1186/gb-2012-13-9-r51. Manuscript under review. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Gerstein MB, et al. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012 doi: 10.1038/nature11245. In Press. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Bickel PJ, Boley N, Brown JB, Huang HY, Zhang NR. Subsampling Methods for Genomic Inference. Annals of Applied Statistics. 2010;4:1660–1697. doi: 10.1214/10-Aoas363. [DOI] [Google Scholar]
21.Kaplan T, et al. Quantitative models of the mechanisms that control genome-wide patterns of transcription factor binding during early Drosophila development. PLoS genetics. 2011;7:e1001290. doi: 10.1371/journal.pgen.1001290. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Li XY, et al. The role of chromatin accessibility in directing the widespread, overlapping patterns of Drosophila transcription factor binding. Genome biology. 2011;12:R34. doi: 10.1186/gb-2011-12-4-r34. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Pique-Regi R, et al. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome research. 2011;21:447–455. doi: 10.1101/gr.112623.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zhang Y, et al. Primary sequence and epigenetic determinants of in vivo occupancy of genomic DNA by GATA1. Nucleic acids research. 2009;37:7024–7038. doi: 10.1093/nar/gkp747. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Neph S, et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature. 2012 doi: 10.1038/nature11212. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Whitfield TW, et al. Functional analysis of transcription factor binding sites in human promoters. Genome biology. 2012 doi: 10.1186/gb-2012-13-9-r50. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Gross DS, Garrard WT. Nuclease hypersensitive sites in chromatin. Annual review of biochemistry. 1988;57:159–197. doi: 10.1146/annurev.bi.57.070188.001111. [DOI] [PubMed] [Google Scholar]
28.Urnov FD. Chromatin remodeling as a guide to transcriptional regulatory networks in mammals. Journal of cellular biochemistry. 2003;88:684–694. doi: 10.1002/jcb.10397. [DOI] [PubMed] [Google Scholar]
29.Thurman RE, et al. The accessible chromatin landscape of the human genome. Nature. 2012 doi: 10.1038/nature11232. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Kundaje A, et al. Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements. Genome research. 2012;22 doi: 10.1101/gr.136366.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Schultz DC, Ayyanathan K, Negorev D, Maul GG, Rauscher FJ., 3rd SETDB1: a novel KAP-1-associated histone H3, lysine 9-specific methyltransferase that contributes to HP1-mediated silencing of euchromatic genes by KRAB zinc-finger proteins. Genes & development. 2002;16:919–932. doi: 10.1101/gad.973302. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Frietze S, O’Geen H, Blahnik KR, Jin VX, Farnham PJ. ZNF274 recruits the histone methyltransferase SETDB1 to the 3′ ends of ZNF genes. PloS one. 2010;5:e15082. doi: 10.1371/journal.pone.0015082. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Boyle AP, et al. High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome research. 2011;21:456–464. doi: 10.1101/gr.112656.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Hesselberth JR, et al. Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nature methods. 2009;6:283–289. doi: 10.1038/nmeth.1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Zhang Y, et al. Model-based analysis of ChIP-Seq (MACS) Genome biology. 2008;9:R137. doi: 10.1186/gb-2008-9-9-r137. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Kouzarides T. Chromatin modifications and their function. Cell. 2007;128:693–705. doi: 10.1016/j.cell.2007.02.005. [DOI] [PubMed] [Google Scholar]
37.Li B, Carey M, Workman JL. The role of chromatin during transcription. Cell. 2007;128:707–719. doi: 10.1016/j.cell.2007.01.015. [DOI] [PubMed] [Google Scholar]
38.Hon GC, Hawkins RD, Ren B. Predictive chromatin signatures in the mammalian genome. Human molecular genetics. 2009;18:R195–201. doi: 10.1093/hmg/ddp409. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Zhou VW, Goren A, Bernstein BE. Charting histone modifications and the functional organization of mammalian genomes. Nature reviews Genetics. 2011;12:7–18. doi: 10.1038/nrg2905. [DOI] [PubMed] [Google Scholar]
40.Ernst J, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011;473:43–49. doi: 10.1038/nature09906. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Hon G, Wang W, Ren B. Discovery and annotation of functional chromatin signatures in the human genome. PLoS computational biology. 2009;5:e1000566. doi: 10.1371/journal.pcbi.1000566. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Ball MP, et al. Targeted and genome-scale strategies reveal gene-body methylation signatures in human cells. Nature biotechnology. 2009;27:361–368. doi: 10.1038/nbt.1533. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Meissner A, et al. Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature. 2008;454:766–770. doi: 10.1038/nature07107. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Ogryzko VV, Schiltz RL, Russanova V, Howard BH, Nakatani Y. The transcriptional coactivators p300 and CBP are histone acetyltransferases. Cell. 1996;87:953–959. doi: 10.1016/s0092-8674(00)82001-2. [DOI] [PubMed] [Google Scholar]
45.Lister R, et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009;462:315–322. doi: 10.1038/nature08514. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Dekker J. Gene regulation in the third dimension. Science. 2008;319:1793–1794. doi: 10.1126/science.1152850. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Dostie J, et al. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome research. 2006;16:1299–1309. doi: 10.1101/gr.5571506. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Lajoie BR, van Berkum NL, Sanyal A, Dekker J. My5C: web tools for chromosome conformation capture studies. Nature methods. 2009;6:690–691. doi: 10.1038/nmeth1009-690. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Sanyal A, Lajoie B, Jain G, Dekker J. The long-range interaction landscape of gene promoters. Nature. 2012 doi: 10.1038/nature11279. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Fullwood MJ, et al. An oestrogen-receptor-alpha-bound human chromatin interactome. Nature. 2009;462:58–64. doi: 10.1038/nature08497. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Li G, et al. Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell. 2012;148:84–98. doi: 10.1016/j.cell.2011.12.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Borneman AR, et al. Divergence of transcription factor binding sites across related yeast species. Science. 2007;317:815–819. doi: 10.1126/science.1140748. [DOI] [PubMed] [Google Scholar]
53.Odom DT, et al. Tissue-specific transcriptional regulation has diverged significantly between human and mouse. Nature genetics. 2007;39:730–732. doi: 10.1038/ng2047. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Schmidt D, et al. Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science. 2010;328:1036–1040. doi: 10.1126/science.1186176. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.1000_Genomes_Project_Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. nature09534 [pii] [DOI] [PMC free article] [PubMed] [Google Scholar]
56.King MC, Wilson AC. Evolution at two levels in humans and chimpanzees. Science. 1975;188:107–116. doi: 10.1126/science.1090005. [DOI] [PubMed] [Google Scholar]
57.Spivakov M, et al. Analysis of variation at transcription factor binding sites in Drosophilaand humans. Genome biology. 2012 doi: 10.1186/gb-2012-13-9-r49. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Sandelin A, et al. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nature reviews Genetics. 2007;8:424–436. doi: 10.1038/nrg2026. [DOI] [PubMed] [Google Scholar]
59.Dong X, et al. Modeling gene expression using chromatin features in various cellular contexts. Genome biology. 2012 doi: 10.1186/gb-2012-13-9-r53. manuscript submitted. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Huff JT, Plocik AM, Guthrie C, Yamamoto KR. Reciprocal intronic and exonic histone modification regions in humans. Nature structural & molecular biology. 2010;17:1495–1499. doi: 10.1038/nsmb.1924. [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Tilgner H, et al. Genomic analysis of ENCODE data: a weak but very widespread role of chromatin organization in alternative splicing. Genome research. 2012 manuscript submitted. [Google Scholar]
62.Tilgner H, et al. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs. Genome research. 2012;22(9) doi: 10.1101/gr.134445.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Fu Y, Sinha M, Peterson CL, Weng Z. The insulator binding protein CTCF positions 20 nucleosomes around its binding sites across the human genome. PLoS genetics. 2008;4:e1000138. doi: 10.1371/journal.pgen.1000138. [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Kornberg RD, Stryer L. Statistical distributions of nucleosomes: nonrandom locations by a stochastic mechanism. Nucleic acids research. 1988;16:6677–6690. doi: 10.1093/nar/16.14.6677. [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Schones DE, et al. Dynamic regulation of nucleosome positioning in the human genome. Cell. 2008;132:887–898. doi: 10.1016/j.cell.2008.02.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Valouev A, et al. Determinants of nucleosome organization in primary human cells. Nature. 2011;474:516–520. doi: 10.1038/nature10002. [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Frietze S, et al. Cell type-specific binding patterns reveal that TCF7L2 can be tethered to the genome by association with GATA3. Genome biology. 2012 doi: 10.1186/gb-2012-13-9-r52. under review. [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Yip KY, et al. Classification of human genomic regions based on experimentally-determined binding sites of more than 100 transcription-related factors. Genome biology. 2012 doi: 10.1186/gb-2012-13-9-r48. in Press. [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Hoffman MM, et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature methods. 2012 doi: 10.1038/nmeth.1937. [DOI] [PMC free article] [PubMed] [Google Scholar]
70.Kapranov P, et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007;316:1484–1488. doi: 10.1126/science.1138341. [DOI] [PubMed] [Google Scholar]
71.Koch F, et al. Transcription initiation platforms and GTF recruitment at tissue-specific enhancers and promoters. Nature structural & molecular biology. 2011;18:956–963. doi: 10.1038/nsmb.2085. [DOI] [PubMed] [Google Scholar]
72.McLean CY, et al. GREAT improves functional interpretation of cis-regulatory regions. Nature biotechnology. 2010;28:495–501. doi: 10.1038/nbt.1630. [DOI] [PMC free article] [PubMed] [Google Scholar]
73.Rozowsky J, et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Molecular systems biology. 2011;7:522. doi: 10.1038/msb.2011.54. [DOI] [PMC free article] [PubMed] [Google Scholar]
74.Boyle AP, et al. Annotation of Functional Variation in Personal Genomes Using RegulomeDB. Genome research. 2012;22(9) doi: 10.1101/gr.137323.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
75.Hindorff LA, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences of the United States of America. 2009;106:9362–9367. doi: 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]
76.Schaub MA, Boyle AP, Kundaje A, Batzoglou S, Snyder M. Linking Disease Associations with Regulatory Information in the Human Genome. Genome research. 2012;22(9) doi: 10.1101/gr.136127.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
77.Libioulle C, et al. Novel Crohn disease locus identified by genome-wide association maps to a gene desert on 5p13.1 and modulates expression of PTGER4. PLoS genetics. 2007;3:e58. doi: 10.1371/journal.pgen.0030058. [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Harismendy O, et al. 9p21 DNA variants associated with coronary artery disease impair interferon-gamma signalling response. Nature. 2011;470:264–268. doi: 10.1038/nature09753. [DOI] [PMC free article] [PubMed] [Google Scholar]
79.Cheng C, et al. Understanding transcriptional regulation by integrative analysis of transcription factor binding data. Genome research. 2012;22(9) doi: 10.1101/gr.136838.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
80.Schuster SC, et al. Complete Khoisan and Bantu genomes from southern Africa. Nature. 2010;463:943–947. doi: 10.1038/nature08795. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS381381-supplement-1.doc^{(8MB, doc)}

NIHMS381381-supplement-10.xlsx^{(173.6KB, xlsx)}

NIHMS381381-supplement-11.xls^{(20.5KB, xls)}

NIHMS381381-supplement-3.doc^{(543KB, doc)}

NIHMS381381-supplement-4.xls^{(2.1MB, xls)}

NIHMS381381-supplement-5.xlsx^{(405.6KB, xlsx)}

NIHMS381381-supplement-6.txt^{(907.7KB, txt)}

NIHMS381381-supplement-7.txt^{(2.3MB, txt)}

NIHMS381381-supplement-8.xlsx^{(38.1KB, xlsx)}

NIHMS381381-supplement-9.xlsx^{(41.4KB, xlsx)}

PERMALINK

An Integrated Encyclopedia of DNA Elements in the Human Genome

Summary

ENCODE data production and initial analyses

Box 1.

Integration methodology

Transcribed and protein-coding regions

RNA

Regions bound by transcription factors, transcriptional machinery, and other proteins

Table 1.

DNaseI hypersensitive sites, footprints and nucleosome-depleted regions

Regions of histone modifications

Table 2.

DNA methylation

Chromosome-interacting regions

Summary of ENCODE-identified elements

The impact of selection on functional elements

Figure 1. Impact of Selection on ENCODE Functional Elements in Mammals and Human Populations.

Integration of ENCODE data with known genomic features

Promoter-anchored integration

Figure 2. Modelling Transcription Levels from Histone Modification and TF-Binding Patterns.

Transcription factor-binding site-anchored integration

Figure 3. Patterns and Asymmetry of Chromatin Modification at Transcription Factor-binding Sites.

Transcription factor co-associations

Figure 4. Co-association between Transcription Factors.

Genome-wide integration

Table 3.

Figure 5. Integration of ENCODE Data by Genome-wide Segmentation.

Figure 6. Experimental Characterisation of Segmentations.

Figure 7. High-Resolution Segmentation of ENCODE Data by Self-Organising Maps (SOM).

Insights into human genomic variation

Figure 8. Allele-Specific ENCODE Elements.

Rare variants, individual genomes and somatic variants

Figure 9. Examining ENCODE Elements on a per individual basis in the Normal and Cancer Genome.

Common variants associated with human disease and phenotypes

Figure 10. Comparison of Genome-wide Association Study-identified Loci with ENCODE Data.

Conclusions

Supplementary Material

Acknowledgments

Authors

Overall Coordination (Data Analysis Coordination)

Data Production Leads (Data Production)

Lead Analysts (Data Analysis)

Writing Group

NHGRI Project Management (Scientific Management)

Principal Investigators (Steering Committee)

Boise State University Proteomics Group (Data Production and Analysis)

Broad Institute Group (Data Production and Analysis)

Cold Spring Harbor, University of Geneva, Center for Genomic Regulation, Barcelona, RIKEN, Sanger Institute, University of Lausanne, Genome Institute of Singapore Group (Data Production and Analysis)

Data Coordination Center at UC Santa Cruz (Production Data Coordination)

Duke University, EBI, University of Texas, Austin, University of North Carolina-Chapel Hill Group (Data Production and Analysis)

Genome Institute of Singapore Group (Data Production and Analysis)

HudsonAlpha Institute, Caltech, UC Irvine, Stanford Group (Data Production and Analysis)

Lawrence Berkeley National Laboratory Group (Targeted Experimental Validation)

NHGRI Groups (Data Production and Analysis)

Sanger Institute, Washington University, Yale University, Center for Genomic Regulation, Barcelona, UCSC, MIT, University of Lausanne, CNIO Group (Data Production and Analysis)

Stanford-Yale, Harvard, University of Massachusetts Medical School, University of Southern California/UCDavis Group (Data Production and Analysis)

University of Albany SUNY Group (Data Production and Analysis)

University of Chicago, Stanford Group (Data Production and Analysis)

University of Heidelberg Group (Targeted Experimental Validation)

University of Massachusetts Medical School Bioinformatics Group (Data Production and Analysis)

University of Massachusetts Medical School Genome Folding Group (Data Production and Analysis)

University of Washington, University of Massachusetts Medical Center Group (Data Production and Analysis)

Data Analysis Center (Data Analysis)

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases