Variation graph toolkit improves read mapping by representing genetic variation in the reference.

Garrison E ¹,

Sirén J ¹,

Novak AM ²,

Hickey G ²,

Affiliations

1. Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.
Authors
Garrison E¹
Sirén J¹
Dawson ET¹
Jones W¹
Durbin R¹
(5 authors)
2. UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California, USA.
Authors
Novak AM²
Hickey G²
Eizenga JM²
Markello C²
Paten B²
(5 authors)
3. Max-Planck-Institut für Informatik, Saarbrücken, Germany.
Authors
Garg S³
(1 author)
4. DNAnexus, Mountain View, California, USA.
Authors
Lin MF⁴
(1 author)

ORCIDs linked to this article

Nature Biotechnology, 20 Aug 2018, 36(9):875-879
https://doi.org/10.1038/nbt.4227 PMID: 30125266 PMCID: PMC6126949

This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.

Free full text in Europe PMC

A comment on this article appears in "Genomes for all." Nat Biotechnol. 2018 Sep 6;36(9):815-816. doi: 10.1038/nbt.4244. This article is based on a previously available preprint.

Abstract

Reference genomes guide our interpretation of DNA sequence data. However, conventional linear references represent only one version of each locus, ignoring variation in the population. Poor representation of an individual's genome sequence impacts read mapping and introduces bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation across a population, including large-scale structural variation such as inversions and duplications. Previous graph genome software implementations have been limited by scalability or topological constraints. Here we present vg, a toolkit of computational methods for creating, manipulating, and using these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays, with improved accuracy over alignment to a linear reference, and effectively removing reference bias. These capabilities make using variation graphs as references for DNA sequencing practical at a gigabase scale, or at the topological complexity of de novo assemblies.

Free full text

Nat Biotechnol. Author manuscript; available in PMC 2019 Feb 20.

Published in final edited form as:

Nat Biotechnol. 2018 Oct; 36(9): 875–879.

Published online 2018 Aug 20. https://doi.org/10.1038/nbt.4227

PMCID: PMC6126949

EMSID: EMS78750

NIHMSID: NIHMS1500758

PMID: 30125266

Variation graph toolkit improves read mapping by representing genetic variation in the reference

Erik Garrison,^1,^* Jouni Sirén,¹ Adam M. Novak,² Glenn Hickey,² Jordan M. Eizenga,² Eric T. Dawson,^1,^3,⁴ William Jones,¹ Shilpa Garg,⁵ Charles Markello,² Michael F. Lin,⁶ Benedict Paten,² and Richard Durbin^1,^4,^*

Author information Copyright and License information Disclaimer

The publisher's final edited version of this article is available at Nat Biotechnol

See other articles in PMC that cite the published article.

Go to:

Associated Data

Supplementary Materials: Reporting Summary.
NIHMS78750-supplement-Reporting_Summary.pdf (70K)
Supplementary Figures.
NIHMS78750-supplement-Supplementary_Figures.docx (1.9M)
Supplementary Note 1 and Table 1.
NIHMS78750-supplement-Supplementary_Note_1_and_Table_1.pdf (137K)

Data Availability Statement

Data Availability

No new data were collected for this study. The human HG002 data used for figure 2b are available from ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/latest (calls) and http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP047086 (reads). The yeast whole genome assemblies for Figures 1 and and33 are available from http://www.ebi.ac.uk/ena/data/view/PRJEB7245, the ChIP-seq data set from https://www.encodeproject.org/files/ENCFF000ATK/, and the viral metagenome data from https://www.ebi.ac.uk/ena/data/view/ERS396648.

Go to:

Abstract

Reference genomes guide our interpretation of DNA sequence data. However, conventional linear references represent only one version of each locus, ignoring variation in the population. Poor representation of an individual's genome sequence impacts read mapping and introduces bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation across a population, including large scale structural variation such as inversions and duplications¹. Previous graph genome software implementations²^–⁴ have been limited by scalability or topological constraints. Here we present vg, a toolkit of computational methods for creating, manipulating, and utilizing these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalised compressed suffix arrays⁵, with improved accuracy over alignment to a linear reference, and effectively removing reference bias. These capabilities make using variation graphs as references for DNA sequencing practical at gigabase scale, or at the topological complexity of de novo assemblies.

For small genomes, it is possible to study genetic variation by assembling whole genomes and then comparing them via whole-genome comparison⁶^,⁷. For large genomes, such as the human genome, complete and accurate de novo genome assembly is impractical because of repeat complexity and scale. Therefore prior information is used to interpret new sequence data in its correct genomic context. The current practice is to align sequence reads to a single high-quality reference genome sequence that represents one haplotype at each location in the genome. Although much faster than de novo assembly, and simplifying discovery and reporting of genetic variants, this approach leads to mapping biases towards variants matching the reference sequence and away from alternative variants. There will even be some sequence in each new sample that is entirely absent in the reference⁸.

To avoid these biases, data would need to be aligned to a “personalized” reference sequence that already incorporates the individual’s variants⁹, but in general it is not known what variants are present in a sample before aligning data from it. However, most differences between any one genome and the reference are segregating in the population. Thus, a reference structure that represents known shared variation will contain most of the correct personalised sequence for any individual.

The natural computational structure for doing this is the sequence graph¹. Sequence graphs or equivalent structures have been used previously to represent multiple sequences that contain shared differences or ambiguities in a single structure. For example, multiple sequence alignments have a natural representation as partially ordered sequence graphs¹⁰. The variant call format¹¹ (VCF), which is a common data format for describing sets of genome sequences can be understood as defining a partially ordered graph similar to those implied by a multiple sequence alignment. Related structures frequently used in genome assembly include the De Bruijn graph¹² and string graph¹³, which collapse long repeated sequences, so the same nodes are used for different regions of the genome. Graphs to represent genetic variation have previously been used for microbial genomes and localized regions of the human genome such as the Major Histocompatibility Complex².

We define a variation graph as a sequence graph together with a set of paths representing possible sequences from a population (Figure 1). Recently, software packages have been introduced that support a subset of variation graphs that reflect local variation away from a linear reference²^,³, formalising approaches introduced in FreeBayes and the GATK HaplotypeCaller for the 1000 Genomes Project analysis¹⁴^–¹⁶. Our model goes beyond these in that it does not require the graph to be based on an initial linear reference, or indeed directionally ordered, and thus supports cycles and inversions. vg is the first openly available tool with these properties to scale practically to the multi-gigabase scale required for whole vertebrate genomes.

An external file that holds a picture, illustration, etc.
Object name is emss-78750-f001.jpg

Figure 1

A region of a yeast genome variation graph. This displays the start of the subtelomeric region on the left arm of chromosome 9 in a multiple alignment of the strains sequenced by Yue et al.22, built using vg from a full genome multiple alignment generated with the Cactus alignment package6. The inset shows a subregion of the alignment at single base level. The colored paths correspond to separate contiguous chromosomal segments of these strains. This illustrates the ability of vg to represent paths corresponding to both colinear (inset) and structurally rearranged (main figure) regions of genomic variation.

The core data model, data structures and algorithms, and implementation of vg are described in the Online Methods. Indicative memory and compute run time requirements are given in Supplementary Table 1. Below we present results demonstrating the functionality of vg. Variant calling using vg against a variety of different human genome variation graphs is described elsewhere¹⁷.

For a species such as human, with only 0.1% nucleotide divergence on average between individual genome sequences, over 90% of 100bp reads will derive from sequence exactly matching the reference. Therefore new mappers should perform at least as well for linear reference mapping as the current standard, which we take to be bwa mem¹⁸ with default parameters. We show that vg does this, and then that around divergent sites vg maps more informatively.

The final phase of the 1000 Genomes Project (1000GP) produced a data set of approximately 80 million variants in 2504 humans¹⁶. We made a series of vg graphs containing all variants or those with minor allele frequency thresholds at 0.1%, 1% or 10%, as well as a graph corresponding to the standard GRCh37 linear reference sequence without any variation. The full vg graph uses 3.92 GB when serialized to disk, and contains 3.181 Gbp of sequence, which is exactly equivalent to the length of the input reference plus the length of the novel alleles in the VCF file. Complete file sizes including indices vary from 25GB to 63GB, with details including build and mapping times given in Supplementary Table 1.

We next aligned ten million 150 bp paired end reads simulated with errors from the parentally phased haplotypes of an Ashkenazi Jewish male NA24385, sequenced by the Genome in a Bottle (GIAB) Consortium¹⁹ and not included in the 1000GP sample set, to each of these graphs as well as to the linear reference using bwa mem. Figure 2a shows the accuracy of these alignments compared with bwa mem for the 1% allele frequency threshold graph, in terms of Receiver Operating Characteristic (ROC) curves. Comparable plots for other data are given in Supplementary Figure 1.

An external file that holds a picture, illustration, etc.
Object name is emss-78750-f002.jpg

Figure 2

Mapping accuracy for vg against the human genome. (a) ROC curves parameterised by mapping quality for 10M read pairs simulated from NA24385 as mapped by bwa mem, vg with the 1000GP 1% allele frequency threshold pangenome reference, and vg with a linear reference, using single end (se) or paired end (pe) mapping. Left: all reads, middle: reads simulated from segments matching the linear reference, Right: reads simulated from segments different from the linear reference. (b) the mean alternate allele fraction at heterozygous variants previously called19 in NA24385 as a function of deletion or insertion size (SNPs at 0). Error bars are +/- one standard error.

Reads that come from parts of the sequence without differences from the reference (middle panel of Fig. 2a) map slightly better to the reference sequence (green) than to the 1000GP graph (red), which we attribute to a combination of the increase in options for alternative places to map reads provided by the variation graph, and the fact that we needed to prune some search index k-mers in the most complex regions of the graph. As expected this difference increases as the allele frequency threshold is lowered and more variants are included in the graph (Supplementary Figure 1).

For reads that were simulated from segments containing non-reference alleles (approximately 10% of reads), which are the reads relevant to variant calling, vg mapping to the 1000GP graph (red) gives better performance than either vg (green) or bwa mem (blue) mapping to the linear reference (Figure 2b), because many variants present in NA24385 are already represented in the 1000GP graph. This is particularly clear for single end mapping, since many paired end reads are rescued by the mate read mapping. Overall vg performs at least as well as bwa mem even on reference-derived reads, and substantially better on reads containing non-reference variants.

We also mapped a real human genome read set with approximately 50X coverage of Illumina 150bp paired end reads from the NA24385 sample to the 1000GP graph. vg produced mappings for 98.7% of the reads, 88.7% with reported mapping quality score 30 on the Phred scale, and 76.8% with perfect, full-length sequence identity to the reported path on the graph. For comparison, we also used vg to map these reads to the linear reference. Similar proportions of reads mapped (98.7%) and with reported quality 30 (88.8%), but considerably fewer with perfect identity (67.6%). Markedly different mappings were found for 1.0% of reads (0.9% mapping to widely separated positions on the two graphs, and 0.1% mapping to one graph but not the other). The reads mapping to widely separated positions were strongly enriched for repetitive DNA. For example, the linear reference mappings for 27.5% of these read pairs overlap various types of satellite DNA identified by RepeatMasker, compared to 3.0% of all read pairs.

To illustrate the consequences of mapping to a reference graph rather than a linear reference, we stratified the sites independently called as heterozygous in NA24385 by deletion or insertion length (0 for single nucleotide variants) and by whether the site was present in 1000GP, and measured the fraction of reads mapped to the alternate allele for each category. The results show that mapping with vg to the population graph when the variant is present in 1000GP (95.4% of sites) gives nearly balanced coverage of alternate and reference alleles independent of variant size, whereas mapping to the linear reference either with vg or bwa mem leads to a progressively increasing bias with increasing deletion and (especially) insertion length (Figure 2b), so that for insertions around 30bp a majority of insertion containing reads are missing (there are over twice as many reference reads as alternate reads).

This removal of bias is important when mapping functional genomics data such as ChIP-seq data, where allele specific expression analysis can reveal genetic variation that affects function but is confounded by reference mapping bias²⁰, especially given that read lengths are typically shorter for these experiments. We compared mapping with bwa or vg for data set ENCFF000ATK from the ENCODE project²¹, which contains 14.9 million 51bp ChIP-seq reads for the H3K4me1 histone methylation mark from the NA12878 cell line. When mapping with bwa the ratio of reference to alternate allele matches at heterozygous sites was 1.20, whereas with vg to the 1000GP graph the ratio was 1.01, effectively eliminating reference bias.

We also explored integration of vg with the recently published GraphTyper¹² method, which calls genotypes by remapping reads to a local partially ordered variation graph built from a VCF file, relying on initial global assignment to a region of the genome by mapping with bwa to a linear reference. Therefore, although GraphTyper also scales to the whole human genome because it is essentially a local method, its functionality is complementary to that of vg, which maps to a global variation graph and does not directly call genotypes. In experiments where we used vg rather than bwa as the primary mapper for GraphTyper, true positives increased marginally (0.02% for SNPs and 0.06% for indels) while false positives increased for SNPs by 0.15% and decreased for indels by 0.03%. We note, however that GraphTyper was developed by its authors for bwa mapping.

The graphs that we have used so far were constructed from variation data obtained from mapping to a linear reference, and so are directed acyclic graphs. We next demonstrate the ability of vg to work with arbitrary graphs that include duplications, inversions, and translocations, by showing its use with multiple yeast strains independently assembled de novo using long read data²². These assemblies manifest large scale structural variation and novel sequence not detected in reference-based sequencing, including extensive rearrangement and reordering in subtelomeric regions²² as illustrated in Figure 1.

We compared four vg graphs: a linear reference graph from the standard S288c strain, a linear reference from the SK1 strain, a pangenome graph of all seven strains, and a “drop SK1” variation graph in which all sequence private to the strain SK1 was removed from the pangenome graph. The multiple genome graphs were constructed with the Cactus progressive aligner⁶, which generates graphs that typically contain cycles and are not partially ordered.

Similarly to the human experiments, we simulated 100,000 150bp paired reads from the SK1 reference, modelling sequencing errors, and mapped them to the four references. The resulting ROC curves are shown in Figure 3a. Not surprisingly, the best performance is obtained by mapping to a linear reference of the SK1 strain from which the data were simulated, with substantially higher sensitivity and specificity compared to mapping to the standard linear reference from the strain S288c with either vg or bwa mem. Mapping to the variation graphs gives intermediate performance, with over one percent more sensitivity and lower false positive rates than to the standard reference. There is surprisingly little difference between mapping to graphs with and without the SK1 private variation, probably because much of what is novel in SK1 compared to the reference is also seen in other strains. We see lower sensitivity compared to mapping just to the SK1 sequence, likely because of suppression of GCSA2 index kmers in complex or duplicated regions. In Figure 3b we show the benefit of aligning long reads to a pangenome graph compared to the S288c reference, using a set of 43,337 Pacific Biosciences SK1 reads (mean length 4.7kb) from Ref. 22.

An external file that holds a picture, illustration, etc.
Object name is emss-78750-f003.jpg

Figure 3

Mapping short and long reads with vg to yeast genome references. (a) ROC curves obtained by mapping 100,000 simulated SK1 yeast strain 150bp paired reads against a variety of references described in the text; (b) a density plot of identity fraction when mapping 43,337 Pacific Biosciences long reads from the SK1 strain to the drop.SK1 reference or the S288c reference.

Finally, to further demonstrate the ability of vg to map to arbitrary sequence graphs, we constructed a vg graph from a metagenomic assembly of a polar freshwater viral DNA community²³ that was constructed with the minia3²⁴ assembler. We then aligned a held-out subset of 100,000 reads to this assembly graph using vg, and to the linear contigs using bwa. Although both methods map approximately 96% of the reads, vg has an average identity score of 95% compared to 87% for bwa, reflecting that the bwa alignments in many cases are not full length (Supplementary Figure 2).

In conclusion, vg implements a suite of tools for genomic sequence data analysis using general variation graph references. Using the vg toolkit, we can construct or import a graph, modify it, visualize it, and use it as a reference. vg can accurately map new sequence reads to the reference using succinct indexes of the graph and its sequence space, and can describe variation between a new sample and an arbitrary reference embedded as a path in the graph. Elsewhere¹⁷ we discuss the use of vg to map read sets and call variants against a number of alternative human reference graphs built from multiple regions of the human genome with different properties.

There are many areas for potential future development and application of vg. These include further improvements in the mapping and variant calling algorithms, potentially using long range statistical haplotype structure information, stored in a graph extension of the PBWT haplotype compression and search data structure²⁵, as proposed by Novak²⁶. Beyond variant calling, the ability to map in an unbiased way to both reference and alternate alleles is potentially important when quantitating allele-specific protein binding as shown with ChIP-seq data above or allele-specific expression²⁷. We note that graphs can also naturally represent the relationships between transcribed, spliced and edited RNA sequences and the genome from which they are transcribed, so the vg software can potentially be used for splice-aware RNAseq mapping²⁸.

We believe that genome variation graphs will underpin a new paradigm for genome sequence data analysis¹. They support the representation of structural variation using the same components (edges, nodes and paths) that are used to represent single base changes. For human, they allow more accurate and complete read mapping (Figure 2). The benefits will only be greater for other organisms with higher levels of genetic variation, or for which uncertainties remain in the reference assembly. For the biological research community to exploit these advantages, it needs software for variation graphs that scales to the genomes of humans and other complex organisms. vg is a robust and openly available platform to fulfil this need.

A life sciences reporting summary is available.

Go to:

Online Methods

Model

We define a variation graph to be a graph with embedded paths G = (N, E, P) comprised of a set of nodes N = n₁ … n_M, a set of edges E = e₁ … e_L, and a set of paths P = p₁ … p_Q, each of which describes the embedding of a sequence into the graph.

Each node n_i represents a sequence seq(n_i) which is built from an alphabet A = {A,C,G,T}. Nodes may be traversed in either the forward or reverse direction, with the sequence being reverse-complemented in the reverse direction. We write n*_Ifor the reverse-complement of node n_i, so that seq(n_i) = revcomp(seq(n*_i)); note that n_i = n**_i. For convenience, we refer to both n_i and n*_i as “nodes”. Edges represent adjacencies between the sequences of the nodes they connect. Thus, the graph implicitly encodes longer sequences as the concatenation of node sequences along walks through the graph. Edges can be identified with the ordered pairs of oriented nodes that they link, so we can write e_ij = (n_i,n_j). Edges also can be traversed in either the forward or the reverse direction, with the reverse traversal defined as e*_ij = (n*_j,n*_i). Note that graphs in vg can contain ordinary cycles (in which n_i is reachable from n_i), reversing cycles (in which n_i is reachable from n*_i), and non-cyclic instances of reversal (in which both n_i and n*_i are reachable from n_j). We implement paths as an edit string with respect to the concatenation of node sequences along a directed walk through the graph. We do not require the alignment described by the edit string to start at the beginning of the sequence of the initial node, nor to terminate at the end of the sequence of the terminal node.

Implementation

The vg implementation is multithreaded and written in C++11, and is available from https://github.com/vgteam/vg (version v1.6.0 for code used in the mapping experiments) under the MIT open source software license. It provides both a primary application to support the operations we describe here, and a library libvg which applications can use to access the data structures, indexes and low level operations.

Our core representation of the graph uses Google's open source protobuf system, which directly supports serialisation onto disk for storage. We also provide a protobuf alignment format, GAM (for “Graph Alignment Map”), with analogous functionality to BAM²⁹, but can also export mappings with respect to embedded references in BAM format. To enable read mapping and other random access operations against large sequence graphs we have implemented a succinct representation of a vg variation graph (xg) that is static but very memory and time efficient, using rank/select dictionaries and other data structures from the Succinct Data Structures Library (SDSL)³⁰. Graphs can be imported from and exported to a variety of formats, including the assembly format GFA and the W3C graph exchange format RDF. Further details about the implementation and features are available in the supplement and at the Github website.

Alignment

A key requirement for a reference genome is the ability to efficiently and accurately find an optimal alignment for a new DNA sequence such as a sequencing read. Analogous to the way that read mappers to linear references work, our approach to this problem is to find seed matches by an indexed search process, cluster them if there are multiple seeds close together, and then perform a local constrained dynamic programming alignment of the read against a region of the graph around each cluster. A brief description of the key steps in this process is given here, with further details in the supplement.

The GCSA2 library⁴ that vg uses for seeding can perform linear time exact match queries independent of the graph size to find super-maximal exact match (SMEM) seeds, subject to a maximum query length, in time comparable to the corresponding operations in bwa mem. SMEMs are exact matches between a query substring and a reference substring that cannot be extended in either direction, and for which there is no extension of the query substring that matches elsewhere in the graph.

After obtaining SMEMs for a query sequence using GCSA2, we cluster them using a global approximate distance metric and distance estimates provided by any nearby paths. For paired reads, we cluster all the SMEMs for both reads in the pair to preferentially support mappings where the SMEMs match a fragment model that we establish online during the alignment of the read set.

We next chain the SMEMs within each cluster by selecting the maximum likelihood path through a Markov model that rewards long SMEMs and short colinear gaps between SMEMs. In many cases there is just one SMEM in the sequence, but there are complex cases where the best SMEM sequence is not correct, and to catch these we recursively mask out the SMEMs in paths found so far and re-run the algorithm to obtain additional disjoint SMEM sequences if available.

For each consistent sequence of SMEMs, we then obtain the subgraph containing the cluster. To avoid the complications introduced by cycles and inversions³¹, we transform the local graph region into a directed acyclic graph (DAG) while maintaining an embedding in the original, potentially cyclic bidirected graph (Supplementary figures S3, S4). We can then perform partial order alignment to the DAG⁹, using banded dynamic programming and an extension of Farrar's SIMD-accelerated striped Smith Waterman algorithm³².

When mapping long sequences, we split them into overlapping “chunks” (default 256bp with 32bp overlap), map those as above, then chain them using the same colinear Markov model method as described for SMEMs within a chunk. This scales effectively linearly in sequence length up to multiple megabases.

The vg alignment tool also uses base qualities in alignment scores and calculates adjusted mapping quality scores. Base qualities are probabilistic estimates of the condence of each base call in a read provided by the sequencing technology. vg combines these with a probabilistic interpretation of alignment³³ to adjust the scoring function for alignments, which has previously been shown to improve variant calling accuracy³⁴. Mapping qualities³⁵ are a probability-based measure of the confidence in the localisation of a read on the reference that is important for variant calling and other downstream analyses. vg computes mapping qualities by comparing the scores of optimal and suboptimal alignments under the probabilistic alignment model, in a similar fashion to bwa mem.

Graph editing and construction

We can build a graph either by direct construction from external graphs such as from de novo assemblies, or by a series of editing operations applied to simple starting graphs such as standard linear reference genomes. To support editing of existing graphs, vg supports operations that can split a node where sequences diverge and insert additional edges and nodes. While doing this it keeps track of the relationship to the previous graph in a translation object, which supports projection of coordinates from one version of the graph to another.

We make use of the editing operations to construct graphs from Variant Call Format (VCF) files¹⁰ as produced by population sequencing projects such as the 1000 Genomes Project¹⁶, inserting a cluster of nodes and edges into a linear reference for each overlapping subset of VCF records. Edit operations also allow progressive construction of a vg graph from a set of sequences by repeated alignment and editing, so that all the initial sequences are embedded in the graph as paths. Last but not least, edit operations allow new variants to be added to an existing vg reference graph to support use cases such as incorporating novel variants from new individuals mapped and called against the graph, while retaining a coordinate mapping to the existing reference. These actions are also invertible, in that vg can generate VCF to describe the graph as a set of variants, using an arbitrarily chosen embedded path as a reference.

Experiments

Experiments were carried out on a dedicated compute node with 256 gigabytes of RAM and two 2.4GHz AMD Opteron 6378 processors with a total of 32 CPU cores. Mapping comparisons were to bwa version 0.7.15-r1142.

GraphTyper comparisons

To test how vg map complements genotyping in Graphtyper, we mapped reads from the Genome In A Bottle (GIAB) Ashkenazi Jewish Trio benchmark sample HG002 readset, and analyzed variant calling performance on chromosome 21 against the HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz calls using Illumina’s Haplotype Comparison Toolset available from https://github.com/Illumina/hap.py. Bwa mem mappings were against the GRCh37d5 reference, and vg mappings against the 1000GP graph then projected onto the GRCh37d5 reference, which is embedded in the 1000GP graph. GraphTyper version 1.3 was run using the dbSNP “common variant” chromosome 21 VCF from NCBI (ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/common_all_20170710.vcf.gz). Code used for the analysis is available on request.

Go to:

Supplementary Material

Acknowledgements

EG, JS and RD were funded by the Wellcome Trust (grants 206194 and 207492). ETD was funded by an NIH Cambridge Trust studentship, and WJ by a Wellcome Trust MGM studentship (109083/Z/15/Z). AMN, GH, JME and BP were supported by the National Institutes of Health (5U41HG007234), the W. M. Keck Foundation (DT06172015) and the Simons Foundation (SFLIFE# 35190). We thank members of the GA4GH Reference Variation Working Group for support, ideas and comments, and Hannes Eggertsson for assistance in the integration with GraphTyper.

Go to:

Footnotes

Contributed by

Author Contributions

EG conceived and led the development of vg, JS developed the GCSA2 index, AMN, GH, JME and ETD contributed to the software, EG, WJ, SG, CM, MFL and RD contributed results and data analysis, RD and BP oversaw the project, and all contributed to the manuscript.

Competing Financial Interests

ML is an employee of, and EG consults for, DNAnexus Inc. RD holds shares in and consults for Congenica Ltd and Dovetail Inc. The remaining authors declare no competing financial interests.

Data Availability

Code Availability:

vg is available at https://github.com/vgteam/vg under the MIT open source software licence.

Go to:

References

1. Paten B, Novak AM, Eizenga JM, Garrison E. Genome graphs and the evolution of genome inference. Genome Res. 2017;27:665–676. [Europe PMC free article] [Abstract] [Google Scholar]

2. Dilthey A, Cox C, Iqbal Z, Nelson MR, McVean G. Improved genome inference in the MHC using a population reference graph. Nat Genet. 2015;47:682–688. [Europe PMC free article] [Abstract] [Google Scholar]

3. Eggertsson HP, et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat Genet. 2017;49:1654–1660. [Abstract] [Google Scholar]

4. Rakocevic G, et al. Fast and accurate genomic analyses using genome graphs. bioRxiv preprint. 2017 10.1101/194530. [Abstract] [CrossRef] [Google Scholar]

5. Siren J. Indexing variation graphs. Proc 19th Workshop on Algorithm Engineering and Experiments (ALENEX); 2017. [Google Scholar]

6. Delcher L, et al. Alignment of whole genomes. Nucleic Acids Res. 1999;27:2369–2376. [Europe PMC free article] [Abstract] [Google Scholar]

7. Paten B, et al. Cactus: Algorithms for genome multiple sequence alignment. Genome Res. 2011;21:1512–1528. [Europe PMC free article] [Abstract] [Google Scholar]

8. Li R, et al. Building the sequence map of the human pan-genome. Nat Biotech. 2010;28:57–63. [Abstract] [Google Scholar]

9. Yuan S, Qin Z. Read-mapping using personalized diploid reference genome for RNA sequencing data reduced bias for detecting allele specific expression. IEEE International Conference on Bioinformatics and Biomedicine Workshops; 2012. [Europe PMC free article] [Abstract] [Google Scholar]

10. Lee C, Grasso C, Sharlow MF. Multiple sequence alignment using partial order graphs. Bioinformatics. 2002;18:452–464. [Abstract] [Google Scholar]

11. Danecek P, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. [Europe PMC free article] [Abstract] [Google Scholar]

12. Pevzner PA, Tang H, Waterman MS. An eulerian path approach to DNA fragment assembly. Proc Nat Acad, Sci USA. 2001;98:9748–9753. [Europe PMC free article] [Abstract] [Google Scholar]

13. Myers EW. The fragment assembly string graph. Bioinformatics. 2005;21(Suppl 2):79–85. [Abstract] [Google Scholar]

14. Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint. 2012 arXiv:1207.3907. [Google Scholar]

15. DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–498. [Europe PMC free article] [Abstract] [Google Scholar]

16. 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature. 2015;526:68–74. [Europe PMC free article] [Abstract] [Google Scholar]

17. Novak AM, et al. Genome graphs. bioRxiv preprint. 2017 10.1101/101378. [CrossRef] [Google Scholar]

18. Li H. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv preprint. 2013 arXiv:1303.3997. [Google Scholar]

19. Zook JM, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016;3 160025. [Europe PMC free article] [Abstract] [Google Scholar]

20. McDaniell R, et al. Heritable individual-specific and allele-specific chromatin signatures in humans. Science. 2010;328:235–239. [Europe PMC free article] [Abstract] [Google Scholar]

21. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. [Europe PMC free article] [Abstract] [Google Scholar]

22. Yue J-X, et al. Contrasting evolutionary genome dynamics between domesticated and wild yeasts. Nat Genet. 2017;49:913–924. [Europe PMC free article] [Abstract] [Google Scholar]

23. de Cárcer DA, López-Bueno A, Pearce DA, Alcamí A. Biodiversity and distribution of polar freshwater DNA viruses. Science Advances. 2015;1:e1400127. [Europe PMC free article] [Abstract] [Google Scholar]

24. Chikhi R, Rizk G. Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol Biol. 2013;8:22. [Europe PMC free article] [Abstract] [Google Scholar]

25. Durbin R. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT) Bioinformatics. 2014;30:1266–1272. [Europe PMC free article] [Abstract] [Google Scholar]

26. Novak AM, Garrison E, Paten B. A graph extension of the positional Burrows-Wheeler transform and its applications. In: Firth M, Pedersen CN, editors. Algorithms in bioinformatics. Springer; Heidelberg: 2016. pp. 246–256. [Google Scholar]

27. Ge B, et al. Global patterns of cis variation in human cells revealed by high-density allelic expression analysis. Nat Genet. 2009;41:1216–1222. [Abstract] [Google Scholar]

28. Beretta S, et al. Mapping RNAseq Data to a Transcript Graph via Approximate Pattern Matching to a Hypertext. In: Figueiredo D, Martn-Vide C, Pratas D, Vega-Rodrguez M, editors. Algorithms for Computational Biology (AlCoB). Lecture Notes in Computer Science. Vol. 10252. Springer; Champaign-Urbana: 2017. pp. 49–61. [Google Scholar]

29. Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. [Europe PMC free article] [Abstract] [Google Scholar]

30. Gog S, Beller T, Moat A, Petri M. From theory to practice: Plug and play with succinct data structures. International Symposium on Experimental Algorithms; Springer; 2014. pp. 326–337. [Google Scholar]

31. Myers EW, Miller W. Approximate matching of regular expressions. Bull Math Biol. 1989;51:5–37. [Abstract] [Google Scholar]

32. Farrar M. Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics. 2007;23:156–161. [Abstract] [Google Scholar]

33. Durbin R, Eddy SR, Krogh A, Mitchison G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press; 1998. [Google Scholar]

34. Hamada M, Wijaya E, Frith MC, Asai K. Probabilistic alignments with quality scores: an application to short-read mapping toward accurate snp/indel detection. Bioinformatics. 2011;27:3085–3092. [Abstract] [Google Scholar]

35. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27:2987–2993. [Europe PMC free article] [Abstract] [Google Scholar]

Full text links

Read article at publisher's site: https://doi.org/10.1038/nbt.4227

Read article for free, from open access legal sources, via Unpaywall: https://europepmc.org/articles/pmc6126949?pdf=render

Citations & impact

Impact metrics

232

Citations

Jump to Citations

Data citation

Jump to Data

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/46743488

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/46743488

Smart citations by scite.ai
Explore citation contexts and check if this article has been supported or disputed.
https://scite.ai/reports/10.1038/nbt.4227

Supporting

Mentioning

Contrasting

697

Article citations

Structural variation in the pangenome of wild and domesticated barley.
Jayakodi M, Lu Q, Pidon H, Rabanus-Wallace MT, Bayer M, Lux T, Guo Y, Jaegle B, Badea A, Bekele W, Brar GS, Braune K, Bunk B, Chalmers KJ, Chapman B, Jørgensen ME, Feng JW, Feser M, Fiebig A, [...] Stein N
Nature, 13 Nov 2024
Cited by: 0 articles | PMID: 39537924
A stepwise guide for pangenome development in crop plants: an alfalfa (Medicago sativa) case study.
Kaur H, Shannon LM, Samac DA
BMC Genomics, 25(1):1022, 31 Oct 2024
Cited by: 0 articles | PMID: 39482604 | PMCID: PMC11526573
Review
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
PangeBlocks: customized construction of pangenome graphs via maximal blocks.
Avila Cartes J, Bonizzoni P, Ciccolella S, Della Vedova G, Denti L
BMC Bioinformatics, 25(1):344, 04 Nov 2024
Cited by: 0 articles | PMID: 39497039 | PMCID: PMC11533710
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Genomic Diversity of <i>Streptomyces clavuligerus</i>: Implications for Clavulanic Acid Biosynthesis and Industrial Hyperproduction.
Ríos-Fernández P, Caicedo-Montoya C, Ríos-Estepa R
Int J Mol Sci, 25(20):10992, 12 Oct 2024
Cited by: 0 articles | PMID: 39456781 | PMCID: PMC11507055
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Haplotype-aware sequence alignment to pangenome graphs.
Chandra G, Gibney D, Jain C
Genome Res, 34(9):1265-1275, 11 Oct 2024
Cited by: 1 article | PMID: 39013594

Go to all (232) article citations

Data

Data behind the article

This data has been text mined from the article, or deposited into data resources.

BioStudies: supplemental material and supporting data

http://www.ebi.ac.uk/biostudies/studies/S-EPMC6126949?xr=true

Nucleotide Sequences

(1 citation) ENA - SRP047086

EBI Metagenomics/MGnify

https://www.ebi.ac.uk/metagenomics/projects/ERP004659

Data that cites the article

This data has been provided by curated databases and other sources that have cited the article.

ENCODE: Encyclopedia of DNA Elements

http://encodeproject.org/publications/284b81ad-327d-4126-9f75-32d46bf00ea6/

Funding

Funders who supported this work.

NHGRI NIH HHS (3)

Grant ID: T32 HG008345
69 publications
Grant ID: U54 HG007990
115 publications
Grant ID: U41 HG007234
192 publications

NHLBI NIH HHS (1)

Grant ID: U01 HL137183
36 publications

Wellcome Trust (5)

Whole genome sequence based analysis of genetic variation and genome evolution
Dr Richard Durbin, University of Cambridge
Grant ID: 207492/Z/17/Z
53 publications
Cambridge, Mathematical Genomics and Medicine.
Mr William Jones, University of Cambridge
Grant ID: 109083
2 publications
Grant ID: 109083/Z/15/Z
2 publications
Wellcome Trust Sanger Institute - generic account for deposition of all core- funded research papers.
Prof Sir Michael Stratton, Wellcome Trust Sanger Institute
Grant ID: 206194
2151 publications
Whole genome sequence based analysis of genetic variation and genome evolution
Dr Richard Durbin, University of Cambridge
Grant ID: 207492
41 publications

Search life-sciences literature (45,100,178 articles, preprints and more)

Variation graph toolkit improves read mapping by representing genetic variation in the reference.

Author information

Affiliations

ORCIDs linked to this article

Abstract

Free full text

Variation graph toolkit improves read mapping by representing genetic variation in the reference

Erik Garrison

Jouni Sirén

Adam M. Novak

Glenn Hickey

Jordan M. Eizenga

Eric T. Dawson

William Jones

Shilpa Garg

Charles Markello

Michael F. Lin

Benedict Paten

Richard Durbin

Associated Data

Abstract

Online Methods

Model

Implementation

Alignment

Graph editing and construction

Experiments

GraphTyper comparisons

Supplementary Material

Reporting Summary

Supplementary Figures

Supplementary Note 1 and Table 1

Acknowledgements

Footnotes

References

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Article citations

Data

Data behind the article

BioStudies: supplemental material and supporting data

Nucleotide Sequences

EBI Metagenomics/MGnify

Data that cites the article

ENCODE: Encyclopedia of DNA Elements

Similar Articles

Funding

NHGRI NIH HHS (3)﻿

NHLBI NIH HHS (1)﻿

Wellcome Trust (5)﻿

Partnerships & funding

NHGRI NIH HHS (3)

NHLBI NIH HHS (1)

Wellcome Trust (5)