Simultaneous Phasing of Multiple Polyploids

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11925))

Included in the following conference series:

International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics

738 Accesses

Abstract

We address the problem of phasing polyploids specifically with polyploidy larger than two. We consider the scenario where the input is the genotype of samples along a genic chromosomal segment. In this setting, instead of NGS reads of the segments of a sample, genotype data from multiple individuals is available for simultaneous phasing. For this mathematically interesting problem, with application in plant genomics, we design and test two algorithms under a parsimony model. The first is a linear time greedy algorithm and the second is a more carefully crafted algebraic algorithm. We show that both the methods work reasonably well (with accuracy on an average larger than 80%). The former is very time-efficient and the latter improves the accuracy further.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

nPhase: an accurate and contiguous phasing method for polyploids

Article Open access 29 April 2021

GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes

Article Open access 18 March 2020

Using probabilistic genotypes in linkage analysis of polyploids

Article Open access 25 May 2021

References

Aguiar, D., Istrail, S.: Haplotype assembly in polyploid genomes and identical by descent shared tracts. Bioinformatics 29(13), i352–i360 (2013). https://doi.org/10.1093/bioinformatics/btt213. http://dx.doi.org/10.1093/bioinformatics/btt213
Article Google Scholar
Browning, S., Browning, B.: Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703 (2011)
Article Google Scholar
Chaisson, M.J., Mukherjee, S., Kannan, S., Eichler, E.E.: Resolving multicopy duplications de novo using polyploid phasing. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 117–133. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56970-3_8
Chapter Google Scholar
Halldórsson, B.V., Bafna, V., Edwards, N., Lippert, R., Yooseph, S., Istrail, S.: A survey of computational methods for determining haplotypes. In: Istrail, S., Waterman, M., Clark, A. (eds.) RSNPsH 2002. LNCS, vol. 2983, pp. 26–47. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24719-7_3
Chapter MATH Google Scholar
He, D., Saha, S., Finkers, R., Parida, L.: Efficient algorithms for polyploid haplotype phasing. BMC Genom. 19(2), 110 (2018)
Article Google Scholar
Motazedi, E., Finkers, R., Maliepaard, C., de Ridder, D.: Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study. Brief. Bioinform. 19(3), 387–403 (2018)
Google Scholar
Siragusa, E., Haiminen, N., Utro, F., Parida, L.: Linear time algorithms to construct populations fitting multiple constraint distributions at genomic scales. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 1132–1142 (2018)
Article Google Scholar
Siragusa, E., Haiminen, N., Finkers, R., Visser, R., Parida, L.: Haplotype assembly of autotetraploid potato using integer linear programing. Bioinformatics (2019). https://doi.org/10.1093/bioinformatics/btz060
Utro, F., et al.: iXora: exact haplotype inferencing and trait association. BMC Genet. 14(1), 48 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computational Biology Center, IBM T. J. Watson Research, Yorktown Heights, NY, 10598, USA
Laxmi Parida & Filippo Utro

Authors

Laxmi Parida
View author publications
You can also search for this author in PubMed Google Scholar
Filippo Utro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laxmi Parida .

Editor information

Editors and Affiliations

Universidade Nova de Lisboa, Caparica, Portugal
Maria Raposo
Universidade Nova de Lisboa, Caparica, Portugal
Paulo Ribeiro
University of Lisbon, Caparica, Portugal
Susana Sério
Università degli Studi di Napoli Parthenope, Naples, Italy
Antonino Staiano
Università degli Studi di Napoli Parthenope, Naples, Italy
Angelo Ciaramella

6 Supplement

1.1 6.1 On the Greedy Method

The Greedy Algorithm: Concrete Examples. Refer to Fig. 2 for the example input matrix G and the trie $\mathcal T$ that is constructed. The nodes in the Trie are labeled by the column identifier on top and then following vertically down from that column. For example, $v_{5,3}$ refers to the third node from top along the column marked 5. It is labeled by SNP 0 (i.e., hollow circle) and the cardinality of the label set is 4.

Greedy Min Set cover algorithm:

1.
Associate a Weight to the columns by the number of rows they cover (each multiplied by the number in that row).
2.
Sort the columns by this weight.
3.
Solution: Traverse down the sorted list in descending order till all rows are covered.

A node is untouchable if and only if all its ancestral nodes are produced by homozygosity. Every other node is touchable.

Priority to fix the matrix, after the 1’s have been assigned:

1.
Eliminate singletons.
1. (a)
  Mark a singleton column as 1 if you need to remove the 1 and −1 if you need to remove a 0.
2. (b)
  Then, search for pairs of −1 marked columns and 1 marked columns where the exchange can happen, i.e., there exists at least one row in the matrix such that both are not marked X for these 2 columns. Make the exchange.
3. (c)
  If columns are still marked; then check if they can be moved without generating new singletons.
4. (d)
  If all fails then leave the singletons as they are.
2.
Reduce gaps between siblings.
1. (a)
  Mark a column as 1 if you need to remove the 1 to get a balance. Mark it as −1 if you need to move the 0 to balance it. A balanced column is marked 0.
2. (b)
  Then, search for pairs of −1 marked columns and 1 marked columns where the exchange can happen, i.e., there exists at least one row in the matrix such that both are not marked X for these 2 columns. Make the exchange.

The Heuristics proposed are: (Heuristic I): Using set cover (explained earlier); (Heuristic II) controlling the MAF allele (1’s) along each path of the tree or haplotype; (Heuristic III (backtracking)) Collapsing two isomorphic sub-trees:

1.
they must have the same values, i.e., 0 or 1 and
2.
there must exist a column in the matrix where they can get aligned, i.e., collapsed.

and (Heuristic IV (Trie-shaking)) See Fig. 3.

1.2 6.2 On the Algebraic Method

Notation and Basic Definitions. Each element of the matrix is a genotype, say X.

Definition 1 (coded genotype X and $x_1,x_0,x_l,x_L, X_p, \{X\}$ of X). Each genotype is equivalent to a 3-tuple (triple)

The implementation tracks the states of the variables as v or $\bar{v}$ where $v \in x_L$. For brevity, we skip the details. Some concrete illustrative examples of genotypes:

$$\begin{array}{cccccc} X &{} \hbox {3-tuple} &{}\{X\} &{}|X|&{}X_p&{}x_L\\ \hline \fbox {11100} &{}(3,2,0) &{}\left\{ \fbox {11100}\right\} &{}1&{}5 &{}\emptyset \\ \fbox {1110q} &{}(3,1,1) &{}\left\{ \fbox {11100}, \fbox {11110}\right\} &{}2&{}5&{}\{q\}\\ \fbox {11qr} &{} (2,0,2) &{} \left\{ \fbox {1100}, \fbox {1110}, \fbox {1111} \right\} &{}3&{}4&{}\{q,r\} \end{array} $$

Definition 2 (VOID, empty genotype). A coded genotype X is empty when $\{X\} = \emptyset $. X is VOID when $x_1 < 0$ or $x_0 <0$ holds.

Definition 3 (X $\le $ Y). Let X and Y be genotypes. $X \le Y \Leftrightarrow x_1\le y_1, x_0 \le y_0, x_l \le y_l$.

Lemma 3

For a genotype $X \equiv (x_1,x_0,x_l)$ the following hold:

1.
$|X| = x_l + 1.$
2.
$X \subseteq Y \Leftrightarrow (X_p=Y_p) \hbox { AND } y_1 \le x_1 \le x_1 + x_l \le y_1 + y_l.$

Sketch of Proof: 2. Since $X_p=Y_p$, it is adequate to base the arguments only on the number of 1’s in X and Y. The possible number of 1’s in X is in the interval $\left[ x_1,x_1+x_l\right] $ and similarly in Y. So if $X\subseteq Y$, then $\left[ x_1,x_1+x_l\right] $ is contained in $\left[ y_1,y_1+y_l\right] $ and vice-versa, leading to the above. $\Box $

Definition 4 ($\langle X \rangle $, ploidy $\langle X \rangle _p$). $\left\langle X \right\rangle $ is defined to an ordered finite list of coded genotypes $X^1, X^2, \ldots , X^j \ldots $ with the same ploidy k. Then k is defined to be $\left\langle X \right\rangle _p$, the ploidy of the $\langle X \rangle $.

Algebra of Genotypes

Resolving Variables. Two randomized procedures variable-to-constant (v2c) and variable-to-variable (v2v) are defined below. Also, a composition of these two primitive operations in resVar() on two coded genotypes.

Primitive Genotype Operations. When X and Y are two given coded genotypes, Z is produced based on the operations as follows:

VOID/Empty Genotypes. When a primitive operation fails, i.e., either results in an empty genotype $Z \equiv (0,0,\emptyset )$ or at least one of $z_0$, $z_1$ is negative, then we resolve some of the variables, either by assigning explicit 1 or 0 (v2c) or assigning it to other variables (v2v). Note that if $z_L$ is empty, then there is no variable to resolve and this failure cannot be rescued (unless the model admits possible errors in the input). However, when $z_L$ is non-empty, there is a possibility that it can be rescued and in the following operations we minimize the number of resolved variables to do so:

Lemma 4

1. If $Z = X \cap _k Y$, then $Z_p = k$.

2. If $Z = X \setminus Y$, then $Z_p = X_p - Y_p$.

Sketch of Proof: 1. Note that the union is over genotypes that each have a ploidy of k. Since the union operation maintains the ploidy, the result must hold.

2. If the operation is not a failure, then

$$\begin{array}{lll} Z_p &{}= &{}z_1 + z_0 + z_l \\ &{}= &{}(x_1 - y_1 - y_l') + (x_0 - y_0 - y_l') + (x_l' + y_l')\\ &{}= &{}(x_1 + x_0 + x_l') - (y_1 + y_0 + y_l')\\ &{}= &{}(x_1 + x_0 + x_l' + |x_L \cap y_L|) - (y_1 + y_0 + y_l'+ |x_L \cap y_L|)\\ &{}= &{}X_p - Y_p. \end{array}$$

$\Box $

Primitive Operation Illustrative Examples

Any negative value of the tuple is flagged as VOID. The relaxed intersection $X \cap _2 Y$ is carried out as follows.

1.2.1 Operations on Row $\langle X\rangle $

Definition 5 ($\langle X\rangle \cap \langle Y\rangle ,\langle X\rangle \setminus \langle Y\rangle $). If $\left\langle X \right\rangle $ and $\left\langle Y \right\rangle $ are two rows then The intersection and difference operations on $\langle X\rangle $ and $\langle Y\rangle $ are defined as:

$$\begin{aligned} \left\langle X \right\rangle \cap \left\langle Y \right\rangle= & {} \left\langle X\cap _k Y\right\rangle , { where }\,k = \min _j \left\{ (X^j\cap Y^j)_p \right\} , \end{aligned}$$

(5)

$$\begin{aligned} \left\langle X \right\rangle \setminus \left\langle Y \right\rangle= & {} \left\langle X\setminus Y\right\rangle . \end{aligned}$$

(6)

Executing the Row-Row Operation. Let $S_x$ be the sample haplotypes associated with ith row say $\left\langle X \right\rangle $ and $S_y$ be the sample haplotypes associated with $i'$th row say $\left\langle Y \right\rangle $. Note that the set S tracks multiplicities as well, i.e., multiple haplotypes of the same sample. In other words if $S = \{a (2), b\}$, this is interpreted as two haplotypes of sample a and 1 haplotype of sample b.

The row-row operation on $\langle X\rangle $ and $\langle Y\rangle $ is defined as follows.

Case I $X_p >1$, $Y_p >1$: The intersection or overlap operation between the two row results in the following three new rows (that replace the ith and $i'$th rows):
1. 1.
  $\left\langle Z \right\rangle \leftarrow \left\langle X \right\rangle \cap \left\langle Y \right\rangle $ with $S_z = S_x \cup S_y$ and $\left\langle Z \right\rangle _p = k$, where k is defined in Eq. 5.
2. 2.
  $\left\langle V \right\rangle \leftarrow \left\langle X \right\rangle \setminus \left\langle Z \right\rangle $ with $S_v = S_x$ and $\langle V \rangle _p = \langle X \rangle _p - k$.
3. 3.
  $\left\langle W \right\rangle \leftarrow \left\langle Y \right\rangle \setminus \left\langle Z \right\rangle $ with $S_w = S_y$ and $\langle W \rangle _p = \langle Y \rangle _p - k$.
Case II $X_p >1$, $Y_p =1$: The intersection or overlap operation between the two row results in the following new row (that replace the ith row):
1. 1.
  $\left\langle V \right\rangle \leftarrow \left\langle X \right\rangle \setminus \left\langle Y \right\rangle $ with $S_v = S_x$ and $\langle V \rangle _p = \langle X \rangle _p - 1$.
2. 2.
  $S_y \leftarrow S_y \cup S_x$.

Row-Row Operation FAILURE. Let X and Y be two genotypes. Then $X \cap Y$ is successful, if and only if the following hold.

Case I $X_p >1$, $Y_p >1$: None of the following result in EMPTY/VOID: (1) $Z = X \cap Y$ (2) $X \setminus Z$ and (3) $Y \setminus Z$.
Case II $X_p >1$, $Y_p =1$: $X \setminus Y$ is not VOID.

Use “$\cap _f$” instead of “$\cap $” and “$\setminus _f$” instead of “$\setminus $” for the genotype pair when there is EMPTY or VOID result.

Empirical Lemmas. Let n be the number of samples and m the number of SNPs.

Lemma 5

Accuracy of the algorithm improves with increase in n and m.

Lemma 6

For a given fraction of heterozygous alleles and m, the value of n can be estimated where the accuracy of reconstruction saturates.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Parida, L., Utro, F. (2020). Simultaneous Phasing of Multiple Polyploids. In: Raposo, M., Ribeiro, P., Sério, S., Staiano, A., Ciaramella, A. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2018. Lecture Notes in Computer Science(), vol 11925. Springer, Cham. https://doi.org/10.1007/978-3-030-34585-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-34585-3_5
Published: 23 January 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34584-6
Online ISBN: 978-3-030-34585-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics