Abstract
We address the problem of phasing polyploids specifically with polyploidy larger than two. We consider the scenario where the input is the genotype of samples along a genic chromosomal segment. In this setting, instead of NGS reads of the segments of a sample, genotype data from multiple individuals is available for simultaneous phasing. For this mathematically interesting problem, with application in plant genomics, we design and test two algorithms under a parsimony model. The first is a linear time greedy algorithm and the second is a more carefully crafted algebraic algorithm. We show that both the methods work reasonably well (with accuracy on an average larger than 80%). The former is very time-efficient and the latter improves the accuracy further.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aguiar, D., Istrail, S.: Haplotype assembly in polyploid genomes and identical by descent shared tracts. Bioinformatics 29(13), i352–i360 (2013). https://doi.org/10.1093/bioinformatics/btt213. http://dx.doi.org/10.1093/bioinformatics/btt213
Browning, S., Browning, B.: Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703 (2011)
Chaisson, M.J., Mukherjee, S., Kannan, S., Eichler, E.E.: Resolving multicopy duplications de novo using polyploid phasing. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 117–133. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56970-3_8
Halldórsson, B.V., Bafna, V., Edwards, N., Lippert, R., Yooseph, S., Istrail, S.: A survey of computational methods for determining haplotypes. In: Istrail, S., Waterman, M., Clark, A. (eds.) RSNPsH 2002. LNCS, vol. 2983, pp. 26–47. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24719-7_3
He, D., Saha, S., Finkers, R., Parida, L.: Efficient algorithms for polyploid haplotype phasing. BMC Genom. 19(2), 110 (2018)
Motazedi, E., Finkers, R., Maliepaard, C., de Ridder, D.: Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study. Brief. Bioinform. 19(3), 387–403 (2018)
Siragusa, E., Haiminen, N., Utro, F., Parida, L.: Linear time algorithms to construct populations fitting multiple constraint distributions at genomic scales. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 1132–1142 (2018)
Siragusa, E., Haiminen, N., Finkers, R., Visser, R., Parida, L.: Haplotype assembly of autotetraploid potato using integer linear programing. Bioinformatics (2019). https://doi.org/10.1093/bioinformatics/btz060
Utro, F., et al.: iXora: exact haplotype inferencing and trait association. BMC Genet. 14(1), 48 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
6 Supplement
6 Supplement
1.1 6.1 On the Greedy Method
The Greedy Algorithm: Concrete Examples. Refer to Fig. 2 for the example input matrix G and the trie \(\mathcal T\) that is constructed. The nodes in the Trie are labeled by the column identifier on top and then following vertically down from that column. For example, \(v_{5,3}\) refers to the third node from top along the column marked 5. It is labeled by SNP 0 (i.e., hollow circle) and the cardinality of the label set is 4.
Greedy Min Set cover algorithm:
-
1.
Associate a Weight to the columns by the number of rows they cover (each multiplied by the number in that row).
-
2.
Sort the columns by this weight.
-
3.
Solution: Traverse down the sorted list in descending order till all rows are covered.
A node is untouchable if and only if all its ancestral nodes are produced by homozygosity. Every other node is touchable.
Priority to fix the matrix, after the 1’s have been assigned:
-
1.
Eliminate singletons.
-
(a)
Mark a singleton column as 1 if you need to remove the 1 and −1 if you need to remove a 0.
-
(b)
Then, search for pairs of −1 marked columns and 1 marked columns where the exchange can happen, i.e., there exists at least one row in the matrix such that both are not marked X for these 2 columns. Make the exchange.
-
(c)
If columns are still marked; then check if they can be moved without generating new singletons.
-
(d)
If all fails then leave the singletons as they are.
-
(a)
-
2.
Reduce gaps between siblings.
-
(a)
Mark a column as 1 if you need to remove the 1 to get a balance. Mark it as −1 if you need to move the 0 to balance it. A balanced column is marked 0.
-
(b)
Then, search for pairs of −1 marked columns and 1 marked columns where the exchange can happen, i.e., there exists at least one row in the matrix such that both are not marked X for these 2 columns. Make the exchange.
-
(a)
The Heuristics proposed are: (Heuristic I): Using set cover (explained earlier); (Heuristic II) controlling the MAF allele (1’s) along each path of the tree or haplotype; (Heuristic III (backtracking)) Collapsing two isomorphic sub-trees:
-
1.
they must have the same values, i.e., 0 or 1 and
-
2.
there must exist a column in the matrix where they can get aligned, i.e., collapsed.
and (Heuristic IV (Trie-shaking)) See Fig. 3.
1.2 6.2 On the Algebraic Method
Notation and Basic Definitions. Each element of the matrix is a genotype, say X.
Definition 1 (coded genotype X and \(x_1,x_0,x_l,x_L, X_p, \{X\}\) of X). Each genotype is equivalent to a 3-tuple (triple)
The implementation tracks the states of the variables as v or \(\bar{v}\) where \(v \in x_L\). For brevity, we skip the details. Some concrete illustrative examples of genotypes:
Definition 2 (VOID, empty genotype). A coded genotype X is empty when \(\{X\} = \emptyset \). X is VOID when \(x_1 < 0\) or \(x_0 <0\) holds.
Definition 3 (X \(\le \) Y). Let X and Y be genotypes. \(X \le Y \Leftrightarrow x_1\le y_1, x_0 \le y_0, x_l \le y_l\).
Lemma 3
For a genotype \(X \equiv (x_1,x_0,x_l)\) the following hold:
-
1.
\(|X| = x_l + 1.\)
-
2.
\(X \subseteq Y \Leftrightarrow (X_p=Y_p) \hbox { AND } y_1 \le x_1 \le x_1 + x_l \le y_1 + y_l.\)
Sketch of Proof: 2. Since \(X_p=Y_p\), it is adequate to base the arguments only on the number of 1’s in X and Y. The possible number of 1’s in X is in the interval \(\left[ x_1,x_1+x_l\right] \) and similarly in Y. So if \(X\subseteq Y\), then \(\left[ x_1,x_1+x_l\right] \) is contained in \(\left[ y_1,y_1+y_l\right] \) and vice-versa, leading to the above. \(\Box \)
Definition 4 (\(\langle X \rangle \), ploidy \(\langle X \rangle _p\)). \(\left\langle X \right\rangle \) is defined to an ordered finite list of coded genotypes \(X^1, X^2, \ldots , X^j \ldots \) with the same ploidy k. Then k is defined to be \(\left\langle X \right\rangle _p\), the ploidy of the \(\langle X \rangle \).
Algebra of Genotypes
Resolving Variables. Two randomized procedures variable-to-constant (v2c) and variable-to-variable (v2v) are defined below. Also, a composition of these two primitive operations in resVar() on two coded genotypes.
Primitive Genotype Operations. When X and Y are two given coded genotypes, Z is produced based on the operations as follows:
VOID/Empty Genotypes. When a primitive operation fails, i.e., either results in an empty genotype \(Z \equiv (0,0,\emptyset )\) or at least one of \(z_0\), \(z_1\) is negative, then we resolve some of the variables, either by assigning explicit 1 or 0 (v2c) or assigning it to other variables (v2v). Note that if \(z_L\) is empty, then there is no variable to resolve and this failure cannot be rescued (unless the model admits possible errors in the input). However, when \(z_L\) is non-empty, there is a possibility that it can be rescued and in the following operations we minimize the number of resolved variables to do so:
Lemma 4
1. If \(Z = X \cap _k Y\), then \(Z_p = k\).
2. If \(Z = X \setminus Y\), then \(Z_p = X_p - Y_p\).
Sketch of Proof: 1. Note that the union is over genotypes that each have a ploidy of k. Since the union operation maintains the ploidy, the result must hold.
2. If the operation is not a failure, then
\(\Box \)
Primitive Operation Illustrative Examples
Any negative value of the tuple is flagged as VOID. The relaxed intersection \(X \cap _2 Y\) is carried out as follows.
1.2.1 Operations on Row \(\langle X\rangle \)
Definition 5 (\(\langle X\rangle \cap \langle Y\rangle ,\langle X\rangle \setminus \langle Y\rangle \)). If \(\left\langle X \right\rangle \) and \(\left\langle Y \right\rangle \) are two rows then The intersection and difference operations on \(\langle X\rangle \) and \(\langle Y\rangle \) are defined as:
Executing the Row-Row Operation. Let \(S_x\) be the sample haplotypes associated with ith row say \(\left\langle X \right\rangle \) and \(S_y\) be the sample haplotypes associated with \(i'\)th row say \(\left\langle Y \right\rangle \). Note that the set S tracks multiplicities as well, i.e., multiple haplotypes of the same sample. In other words if \(S = \{a (2), b\}\), this is interpreted as two haplotypes of sample a and 1 haplotype of sample b.
The row-row operation on \(\langle X\rangle \) and \(\langle Y\rangle \) is defined as follows.
-
Case I \(X_p >1\), \(Y_p >1\): The intersection or overlap operation between the two row results in the following three new rows (that replace the ith and \(i'\)th rows):
-
1.
\(\left\langle Z \right\rangle \leftarrow \left\langle X \right\rangle \cap \left\langle Y \right\rangle \) with \(S_z = S_x \cup S_y\) and \(\left\langle Z \right\rangle _p = k\), where k is defined in Eq. 5.
-
2.
\(\left\langle V \right\rangle \leftarrow \left\langle X \right\rangle \setminus \left\langle Z \right\rangle \) with \(S_v = S_x\) and \(\langle V \rangle _p = \langle X \rangle _p - k\).
-
3.
\(\left\langle W \right\rangle \leftarrow \left\langle Y \right\rangle \setminus \left\langle Z \right\rangle \) with \(S_w = S_y\) and \(\langle W \rangle _p = \langle Y \rangle _p - k\).
-
1.
-
Case II \(X_p >1\), \(Y_p =1\): The intersection or overlap operation between the two row results in the following new row (that replace the ith row):
-
1.
\(\left\langle V \right\rangle \leftarrow \left\langle X \right\rangle \setminus \left\langle Y \right\rangle \) with \(S_v = S_x\) and \(\langle V \rangle _p = \langle X \rangle _p - 1\).
-
2.
\(S_y \leftarrow S_y \cup S_x\).
-
1.
Row-Row Operation FAILURE. Let X and Y be two genotypes. Then \(X \cap Y\) is successful, if and only if the following hold.
-
Case I \(X_p >1\), \(Y_p >1\): None of the following result in EMPTY/VOID: (1) \(Z = X \cap Y\) (2) \(X \setminus Z\) and (3) \(Y \setminus Z\).
-
Case II \(X_p >1\), \(Y_p =1\): \(X \setminus Y\) is not VOID.
Use “\(\cap _f\)” instead of “\(\cap \)” and “\(\setminus _f\)” instead of “\(\setminus \)” for the genotype pair when there is EMPTY or VOID result.
Empirical Lemmas. Let n be the number of samples and m the number of SNPs.
Lemma 5
Accuracy of the algorithm improves with increase in n and m.
Lemma 6
For a given fraction of heterozygous alleles and m, the value of n can be estimated where the accuracy of reconstruction saturates.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Parida, L., Utro, F. (2020). Simultaneous Phasing of Multiple Polyploids. In: Raposo, M., Ribeiro, P., Sério, S., Staiano, A., Ciaramella, A. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2018. Lecture Notes in Computer Science(), vol 11925. Springer, Cham. https://doi.org/10.1007/978-3-030-34585-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-34585-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34584-6
Online ISBN: 978-3-030-34585-3
eBook Packages: Computer ScienceComputer Science (R0)