1. Introduction
A deletion channel is one of the most important channels in the history of communication. The channel has a deletion error where the symbol is being removed without knowing the position of it. Unlike many other channels where the positions of symbols remain the same, the decoder needs to specify the position of each symbol, which is called a synchronization issue. Mainly due to this issue, the deletion channel is surprisingly hard to analyze.
There are several different mathematical problem formulations of the deletion channel. One natural way is the probabilistic approach where the deletion occurs in i.i.d. manner with probability
p. Kanoria and Montanari provided an approximation of the channel capacity of binary deletion channel when
[
1]. However, the channel capacity is unknown in this setting even in binary case.
Alternatively, we can define the problem in algebraic way. We assume that there will be at most
k deletion errors while transmitting
n number of symbols. The question is the maximum size of the deletion code that can correct any
k deletion errors. There is a nice survey paper by Sloane [
2], and it is easy to see that the problem is extremely challenging even in single deletion case.
Although the problem is still open, there has been some progress in this algebraic setting. In binary case, Varshamov and Tenengolts proposed a simple code (VT-code) construction that corrects any single deletion error [
3]. It is asymptotically optimum when
n grows while the number of deletions
is fixed. VT-code is known to be optimum when
and conjectured that it is optimum for all
n. Tenengolts generalized VT-code to a non-binary version [
4]. Gabrys and Sala proposed a code that can correct two deletions [
5]. Sima and Bruck also generalized VT-code that can correct
k deletions [
6]. Both results [
5,
6] show
where
is the deletion code. This implies that they achieved order optimal redundancy where the optimality is shown by Levenshtein [
7]. However, their works do not specify the coefficient of
term while
code satisfies
in the single deletion case.
VT-code naturally suggests a lower bound of the size of the optimum single deletion code in binary case, but the upper bounds are rarely provided. Levenshtein found an analytic formula for upper bounds [
8]. Kulkarni and Kiyavashi proposed better upper bounds for any number of deletions and any size of alphabet [
9]. Cullina and Kiyavashi refined Levenshtein’s upper bound [
10].
There are attempts to find nonasymptotic upper bounds numerically for small
n. The most popular approach is using graph theory, which is based on the fact that the optimum single deletion code problem has an equivalent formulation of maximum independent set problem. In this formulation, the size of the largest single deletion code is equal to the size of the maximum independent set. A well-known upper bound of the maximum independent set is Lovász theta number, which can be obtained by solving Semidefinite Programming (SDP) relaxed problem [
11]. Upper bounds from Lovász theta number are tight for
; in other words, VT-code is optimum when the block length
n is smaller than 8. Butenko et al. provided a numerical approach to bound the size of maximum independent set [
12], and showed that VT-code is optimum when the block length is
. Kulkarni et al. proposed a bounding technique using Linear Programming (LP) relaxation of the graph problem [
13]. This bound is weaker than that of Lovász’s, but the complexity is lower so that we can compute the bounds for larger block lengths
n. For recent progress, we refer to Sloane’s webpage [
14], which also mentions that VT-code is optimum when the block length is
.
In this paper, we propose a new method that provides an improved nonasymptotic upper bound. We partially relax the graph problem, and obtain Mixed Integer Linear Program (MILP) where some of variables are relaxed but some of them are not. MILP has reasonably low complexity and provides a better bound. For example, when the block length is , we show the size of optimum single deletion code is less than or equal to 173, where the SDP provides a bound of 174.
We also find an equivalent formulation of the conjecture that “VT-code is optimum”. This equivalent form consists of several small sub-conjectures where VT-code is optimum if and only if all those sub-conjectures are true. The advantage of this equivalent conjecture is that we can numerically verify the sub-conjectures using (Mixed) Integer Programming. Note that we can disprove the original conjecture if any of sub-conjectures are not true. We numerically verified some of sub-conjectures for . This does not prove or disprove the optimality of VT-code, but it supports the original conjecture.
The remainder of the paper is organized as follows. In
Section 2, we revisit some of the known results of single deletion codes.
Section 3 presents the new bounding technique using Mixed Integer Linear Programming with improved upper bound. In
Section 4, we present an equivalent formula of the conjecture as well as some numerical supports. Finally, we conclude in
Section 5.
Notation
Let be a set of binary alphabet. We denote a n-tuple using super script, i.e., . In addition, let be an all zero vector. For clarity, we use for an n-dimensional binary vector, and for an -dimensional binary vector. We also use concatenation of binary vectors. For example, is an -dimensional binary vector where the last bit is 0.
2. Preliminaries
2.1. Single Deletion Code
Define a deletion ball
, which is a collection of
-dimensional binary vectors that are one-bit deleted version of
. We can define an insertion ball
in a similar manner.
Definition 1. For a positive integer n, a set of binary vectors is a single deletion code if for all in .
The definition implies that the single deletion code can always correct a single deletion error. The following lemma shows that the single deletion code can also correct a single insertion error.
Lemma 1. ([13], Lemma 2.1) A set of binary vectors is a single deletion code if and only if for all in . The above lemma is simply from the fact that if and only if .
2.2. Varshamov–Tenengolts Codes
Let
be a function that computes the “VT-weights” of an
n-dimensional binary vector.
Note that we do not take any modulo operations, and therefore can take value from 0 to .
For
, VT-code [
3] is defined by
which is a single deletion code [
7,
15]. Levenshtein showed that
is perfect for all
[
16]. In other words,
For any
a, we have
where the first inequality is from Varshamov [
3] and the second inequality is from Ginzburg [
17]. Thus, the size of the optimum single deletion code is lower bounded by
. An analytic formula
is given in [
2]:
where
is Euler’s totient function.
Borchers showed that
is optimum single deletion code for
[
14]. The optimality of VT-code is still open for
. In the case of
, the size of VT-code is
but the best known upper bound of the largest single deletion code is
.
2.3. Maximum Independent Set Approach
Consider a graph where all binary vectors are nodes. There exists an edge between two nodes and if and only if . Then, the optimum single deletion code corresponds to the maximum independent set. Note that the Maximum Independent Set (MIS) problem is NP-complete.
There are reduction rules in graph theory that provide an equivalent graph problem while reducing the size of the graph. Isolated vertex removal technique is useful in our case. An isolated vertex is a node for which its neighborhood forms a clique. For example, in our case, the neighborhood set of node
is
, and
forms a clique. This implies that the node
is an isolated vertex. Butenko et al. showed that there exists a maximum independent set that contains all isolated vertices [
12]. Thus, there exists an optimum single deletion code that contains both
and
.
Note that it is still infeasible to solve our MIS problem with state of the art algorithms [
18,
19] when
. Segundo et al. proposed a variation of BBMC [
20] and found the maximum independent set of the graph induced from two deletion (
) channel [
14]. However, the graph induced from single deletion channel (
) is more challenging to find the maximum independent set.
2.4. LP Relaxation
For simplicity, we define several functions and new notations. Let , and be the set that contains all nonnegative integers smaller than or equal to . We further let be the function that converts the decimal number to the binary vector (e.g., ). Note that we drop n if it is clear from the context, i.e., .
In the above section, we define a graph where there exists an edge between and if their deletion balls share an element. Instead, we define an equivalent graph where the set of nodes are . The set of edges is derived naturally where if and only if there is an edge between and .
Our goal is to find an independent set
that has maximum number of elements. For
, let
be binary variables where
if
and
if
. If there exists an edge between
i and
j, then the independent set
cannot contain both
i and
j. Thus, the following Integer Programming (IP) problem is equivalent to the maximum independent set problem.
Clearly, it is an NP-hard problem, which is extremely challenging to solve. Instead, we can relax it to an easier problem, and bounding the solution of the IP. One way of doing it is classical Linear Programming (LP) relaxation, which is given by
LP relaxation allows the variable
to take a value between 0 and 1, and the solution of relaxed problem provides an upper bound of the original IP. However, this gives a trivial solution in our case, which is
for all
. The maximum value of the objective function is
, and it is much larger than the known upper bound
[
9].
Kulkarni et al. proposed another LP relaxation [
9]. The idea is that the independent set can contain at most one node from each clique. For any
-dimensional vector
, the set of nodes
forms a clique since
for all
. Thus, the independent set
can take at most one node from
, and we have
clique constraints . This implies another LP relaxation which provides a tighter upper bound of the original IP problem.
On the other hand, Lovász proposed an SDP-relaxation of the Maximum Independent Set (MIS) problem [
11]. The solution of SDP problem is called Lovász theta number, which provides a tighter bound of MIS problem.
The following table presents the upper bound from the above relaxations as well as
which is a lower bound.
| 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| 10 | 16 | 30 | 52 | 94 | 172 | 316 |
Lovász | 10.00 | 16.84 | 30.00 | 53.03 | 95.98 | 174.73 | - |
LP | 10.25 | 17.15 | 30.32 | 53.56 | 96.52 | 175.19 | 321.27 |
Note that the complexity of LP relaxed problem is low, and we can get bounds for as well. For example, the size of maximum deletion code is smaller than or equal to 593 when . However, due to the complexity issue, we are not able to compute Lovász theta number for . For example, it took more than 24 h on our machine when .
3. Mixed Integer Linear Programming
LP is faster than the original IP problem; however it is hard to parallelize. On the other hand, IP inherently uses branch-and-bound technique which can be parallelized, but still intractable with current multi-thread processors. In this section, we propose a Mixed Integer Linear Programming (MILP) problem, which is in between LP problem and the original IP problem.
3.1. Main Results
In the LP relaxation, all variables are relaxed as described in
Section 2.4. Instead, we relax specific variables only, which provides a semi-relaxed optimization problem. More precisely, we design
and keep
to be binary variable for
while other variables are relaxed as in LP relaxation. This provides a Mixed Integer Programming (MIP) problem where variables are either integer or real numbers:
The last two constraints are because there exists a maximum independent set that contains both
and
from
Section 2.3.
MIP is generally as computationally demanding as IP problems. However, since all constraints are linear, the above optimization problem is a Mixed Integer Linear Programming (MILP) problem. MILP can be solved in reasonable amount of time if we carefully design the set . Let denote the above MILP problem.
If , then is equivalent to LP. On the other extreme, if , then is equivalent to the original IP. If is nontrivial subset of , then the solution of provides a tighter upper bound of maximum independent set problem while having low complexity.
Clearly, we prefer smaller because of complexity. Thus, the goal is designing in smart way. The main idea is increasing the size of in greedy manner. We start from fully relaxed LP problem, i.e., , and add elements one by one under certain criterion.
More precisely, we solve
in each iteration and add a node
i to
based on the following rule. Let
be the function which indicates the number of
clique constraints that contains
i. Since
if and only if
, we have
. Thus, the variable
affects the
number of clique constraints. Furthermore, if
is large, it restricts other variables in
clique constraints more. Thus, we measure the amount of “impact” of variable
by
. Finally, the algorithm finds the node
i that maximizes
and add it to
. This procedure is described in Algorithm 1.
Algorithm 1 Sequential MILP. |
Input: target threshold |
Output: new bound T |
procedure SEQMILP() |
Set , and |
do |
Solve and let T be the objective function value and X be the solution |
|
|
while and |
return T |
end procedure |
The above algorithm takes a target threshold as an input which can be a previously known upper bound of the original IP. In each iteration, it computes the objective function value T of . Whenever T is smaller than the target bound , then the algorithm halts and we get a new bound T of the size of maximum independent set. For example, suppose we let and the above program halts with , then we have a new upper bound that the size of the maximum single deletion code is strictly smaller than . In such case, we can claim that is the optimum single deletion code. On the other hand, suppose the program ends with which means . In such case, the return value T is the size of the maximum independent set because it is the objective function value of the original IP. Note that the size of is increased by 1 in each iteration, and therefore there will be at most N iterations.
Note that the way of choosing
is not an optimum way. However, it is an effective way, as shown in
Section 3.2.
3.2. Experiments
We implemented Python code using PULP python package [
21] with cbc solver [
22]. Note that cbc solver supports multiple threads. For our experiments, we used a machine with AMD Threadripper 1950X processor and 64 GB of RAM. The operating system was Ubuntu 18.04 LTS.
In the case of
, for any single deletion code
, we have an upper bound
from Lovász theta number. LP provides a weaker upper bound
. However, the proposed MILP problem
provides
when
Note that we convert the binary vectors to decimal numbers for simplicity (e.g., ). This implies which is the same upper bound from Lovász number, but obtained more efficiently.
More interestingly,
provides
when
Note that it took 27,096 s to solve the last (88-th)
with 28 threads. This provides an improved upper bound
. Thus, we have the following Corollary.
Corollary 1. For , the size of single deletion code is smaller than or equal to 173.
In case of
, for any single deletion code
, an upper bound from LP implies
. However, the proposed method
provides
when
Thus, we have an improved upper bound .
3.3. Connection to Metaheuristics
Although MILP problem has lower complexity than the original IP problem, MILP often encounter the computational issue as well. This is because the most state-of-the-art MILP solvers such as CPLEX [
23] are based on branch-and-bound techniques, and it often has exponentially large search space. Thus, the smartly fixing the variable to binary is necessary as we presented in the previous section.
Similar heuristic algorithms appear in various other computationally challenging (Mixed) Integer Programming problems, such as lot sizing problem [
24,
25] and connected facility location problem [
26]. This is commonly referred to as hybrid metaheuristics. For example, Wilbaut and Hanafi proposed iterative idea to solve MIP problems [
27]. The authors applied this idea to IP problems such as knapsack [
28]. The idea is iteratively solving the LP relaxed problem to get an upper bound and reduced problem with fixing variables to get a lower bound until the lower and upper bounds match. In this paper, we do not fix the value of variable, but remove the relaxed constraints (so that some variables remain binary). For other works in metaheuristics, we refer the interested reader to the nice survey paper by Blum et al. [
29].
4. Equivalent Conjecture
The above semi-relaxation provides an improved upper bound; however, the running time is still an issue. In this section, we provide smaller optimization problems that can provide insights for the optimality of code.
4.1. VT-Sum Based Partition
For
, let
be the set of binary vectors whose VT-weights are
i.
Clearly, is a partition of , and is a partition of .
The following lemma provides useful properties of .
Lemma 2. Suppose n and are positive integers. Then,
- 1.
.
- 2.
If , then .
- 3.
If , then .
Proof. There exists a one-to-one correspondence between and which is an element-wise binary complement.
For any element , it is easy to show that . If , then , and therefore .
For any , it is clear that if and only if . Thus, the third property is a direct consequence of the second property.
□
Remark 1. If we view the original problem as a maximum independent set problem, the above partition can be useful since:
There are no internal edges in for all i.
There are no edges between and if .
This is not exactly a “partite” graph but has a similar flavor of it, and there might be an efficient way of finding a maximum independent set.
4.2. Equivalent Conjecture
For a given single deletion code
, we also define a similar partition of
. For
, we let
. Then, we are ready to state our first lemma which is a building block of the main conjecture.
Lemma 3. Let n be a positive integer, and be a single deletion code. If there exists an integer such thatthen is not an optimum single deletion code. Proof. Suppose there exists
k such that
. Then, we can define a new single deletion code
Clearly,
is a single deletion code since
for all
and
. Then, the size of the new single deletion code is given by
This concludes the proof. □
By the first property of Lemma 2, we have for odd n. Thus, we have the following lemma as well, which is essentially the same as Lemma 3.
Lemma 4. Let n be an odd positive integer, and be a single deletion code. If there exists an integer such thatthen is not an optimum single deletion code. Proof. Suppose
n is odd and there exists
k such that
. Then, we can define a new single deletion code
Again, the size of the new code is given by
This concludes the proof. □
Conjecture 1. - 1.
For even n, any single deletion codesatisfiesfor all.
- 2.
For odd n, any single deletion codesatisfiesfor all.
The following theorem tells us that the above conjecture is equivalent to the original conjecture that “VT-code is optimum”.
Theorem 1. code is an optimum single deletion code if and only if Conjecture 1 holds.
Proof. Note that the “only if” part (for both even and odd n) directly comes from Lemmas 3 and 4.
Suppose
n is even and
for all
k. If we let
k be
, then we have
Suppose
n is odd and
for all
k. If we let
k be
, then we have
□
4.3. Special Case of
Theorem 1 implies that if we can find any code that satisfies the inequality in Equation (
1) or the inequality in Equation (
2), then VT code is not optimum. Thus, the plausible strategy to disprove the conjecture is finding a counterexample for Conjecture 1. However, in this section, we show the inequality in Equation (
1) is true for all
n when
.
We define two functions that are useful in the remaining sections. The following lemma is for the definition of the first function.
Lemma 5. For any n-dimensional binary vector , Proof. It is well-known that the size of
is
for all
[
30]. Clearly, a single insertion can only increase a VT-sum, in other words, for
,
On the other hand, a single insertion can increase a VT-sum by at most
by adding “1” to the last position. More precisely, for
, we have
Note that
are all distinct, and therefore
□
First, we define a map g which maps to where and . More precisely, we have , where . Since consists of consecutive numbers, we can determine uniquely when . On the other hand, if , both and are possible candidates for . In this case, we set .
Define another function , where . The function h simply deletes the last bit. If we combine two functions, we get . Then, the function f satisfies the following properties.
Lemma 6. - 1.
For all , we have .
- 2.
If , then .
- 3.
For all , we have .
Proof. Let
, then we have
If , then we have , and therefore .
If we let
, then we have
Recall that we have
. If
, then
, which implies
. On the other hand, if
, we have
. In such case,
achieves the maximum value when
and
. In other words,
On the other hand, it achieves the minimum value when
and
. In other words,
This concludes the proof.
□
Then, we have a theorem that supports Conjecture 1.
Theorem 2. For positive integer n, any single deletion code satisfies the following inequality. Proof. Let
. From the above lemma, we have
for all
. Since
, it is clear that
should be either 0 or
. In other words,
Suppose there exists such that , then is either 0 or . First, if , then . In this case, the weights (number of ones) of and are at most one, i.e., . This implies that which is a contradiction. On the other hand, consider the case where . Since and , we have . This implies that has at least one element, and therefore . This is a contradiction.
Thus,
is an injective function, and therefore
□
4.4. Integer Programming
As mentioned above, if we can find a single deletion code that satisfies the inequality in Equation (
1) or the inequality in Equation (
2), that immediately disproves the optimality of VT code. Since the size of the problem is relatively smaller when
k is small, we can numerically solve the Integer Programming (IP) problem without relaxation.
For example, we can check whether the inequality in Equation (
1) holds or not for some fixed
n and
k, using the following optimization problem.
For some
n and
k, if the maximum value of the objective function is smaller than or equal to
then it means that no single deletion code satisfies the inequality in Equation (
1).
Note that the number of variables are
, which is strictly smaller than
. We solve the above optimization numerically for various
n and
k. For all combinations that we tried, the maximum value of the objective function is
. The following table shows all combinations of
that we were able to check in a reasonable amount of time.
| 11 | 12 | 13 | 14 | 15 | 16 |
| ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| ✓ | ✓ | ✓ | ✓ | | |
Entries without check mark are combinations that we could not check due to the running time.
Similarly, we numerically verify the inequality in Equation (
2) from Lemma 4. For all combinations of
that we tried, we were not able to find any counterexample. The following table shows all combinations that we were able to check.
| 11 | 13 | 15 |
| ✓ | ✓ | ✓ |
| ✓ | ✓ | ✓ |
| ✓ | ✓ | ✓ |