On maximal almost balanced non-overlapping codes and non-overlapping codes with restricted run-lengths

86 Accesses
1 Altmetric
Explore all metrics

Abstract

This paper concerns non-overlapping codes, block codes motivated by synchronisation and DNA-based storage applications. Most existing constructions of these codes do not account for the restrictions posed by the physical properties of communication channels. If undesired sequences are not avoided, the system using the encoding may start behaving incorrectly. Hence, we aim to characterise all non-overlapping codes satisfying two additional constraints. For the first constraint, where approximately half of the letters in each word are positive, we derive necessary and sufficient conditions for the code’s non-expandability and improve known bounds on its maximum size. We also determine exact values for the maximum sizes of polarity-balanced non-overlapping codes having small block and alphabet sizes. For the other constraint, where long sequences of consecutive equal symbols lead to undesired behaviour, we derive bounds and constructions of constrained non-overlapping codes. Moreover, we provide constructions of non-overlapping codes that satisfy both constraints and analyse the sizes of the obtained codes.

In search of maximum non-overlapping codes

Article Open access 08 January 2024

On the maximum size of variable-length non-overlapping codes

Article 25 June 2024

Anticode-based locally repairable codes with high availability

Article 20 April 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Non-overlapping codes have been studied for sixty years due to their ability to facilitate synchronisation. Since the physical characteristics of the available transmission channel may not be compatible with every code, constrained coding is often used to incorporate the additional requirements. In classical storage and communication systems, balanced codes and codes with restricted run-lengths are used to avoid voltage imbalance and charge accumulation between connected components (Immink 2004). A more novel application of constrained non-overlapping codes, introduced by Yazdi et al. (2018), is DNA-based storage, where several constraints from biochemical properties and methods should be considered in addition to code balance and restriction of run-lengths.

Although many constructions of unconstrained non-overlapping codes exist [see, for example, Levenshtein (1970), Chee et al. (2013), Blackburn (2015), Barcucci et al. (2016), Wang and Wang (2022), Stanovnik et al. (2024)], only a few methods to obtain constrained codes have been developed by Bilotta et al. (2012); Levy and Yaakobi (2018), and Yazdi et al. (2018). None of them attain the theoretical upper bound on the constrained code size. Hence, whether they are maximum cannot be concluded from current knowledge. One way of approaching this problem is by characterising the set of all non-overlapping codes satisfying a specified constraint. Then, the properties of this set can be studied to determine the optimal size.

This paper studies non-overlapping codes satisfying the balance and the run-length constraint. The first deals with the number of occurrences of positive and negative symbols in a word that should not differ by too much, and the second deals with avoiding long sequences of consecutive equal symbols. For each constraint, we first derive bounds on the code size and then determine the set of non-overlapping codes satisfying the constraint that includes all other such codes as subsets. With a computer search, we also determine the optimal code sizes for non-overlapping codes of even block sizes where the number of positive and negative symbols is equal.

The rest of the paper is organised as follows: we introduce the necessary notation and recall some known results in Sect. 2. Section 3 characterises the maximal $\epsilon $-balanced non-overlapping codes and determines an upper bound on their size. Section 4 deals with non-overlapping codes avoiding long runs, and Sect. 5 with codes satisfying both constraints. Since our constructions are given in a manner that requires exponential space, we provide an enumerative encoding algorithm in Sect. 6. Section 7 concludes the paper and suggests future research directions.

2 Preliminaries

2.1 Notation and definitions

This paper will use the following notation. The set of integers $1,\dots , n$ is denoted by $\left[ n\right] $, and the set of integers $i, \dots , n$ by $\left[ i,n\right] $. We use the Macaulay brackets to denote the ramp function

$$\begin{aligned}\langle k \rangle {:}{=}{\left\{ \begin{array}{ll} 0 & \text {if } k < 0\\ k & \text {otherwise}, \end{array}\right. }\end{aligned}$$

and $\mu (d)$ to denote the Möbius function

$$\begin{aligned} \mu (n) = {\left\{ \begin{array}{ll} (-1)^t & \text {if } n = p_1p_2\cdots p_t \text { for distinct primes } p_i, \\ 1 & \text {if } n = 1, \\ 0 & \text {if } n \text { is divisible by a square.} \end{array}\right. } \end{aligned}$$

Recall that a weak composition of a positive integer n with k parts is a k-tuple of non-negative integers whose sum equals n. A composition of a positive integer n with k parts is a k-tuple of positive integers whose sum equals n. The number of all compositions of n into k parts equals $\left( {\begin{array}{c}n-1\\ k-1\end{array}}\right) $, and the number of compositions of n into k parts such that no part exceeds l, herein denoted by c(n, k, l), is given by the formula in Eq. (1). For its proof , we refer the reader to Flajolet and Sedgewick (2009).

$$\begin{aligned} c(n,k,l) =&\sum _{j = 0}^{\lfloor (n-k)/l \rfloor } (-1)^j \left( {\begin{array}{c}k\\ j\end{array}}\right) \left( {\begin{array}{c}n-lj-1\\ k-1\end{array}}\right) . \end{aligned}$$

(1)

For sets of words L and R, LR denotes the concatenation of sets L and R, i.e., the set of all words of the form lr, where $l \in L$ and $r \in R$. If x is a word, we write xR and Lx to denote the sets $\{x\}R$ and $L\{x\}$ respectively. $L^i$ denotes the concatenation of the set L with itself i times. The Kleene star denotes the smallest superset that is closed under concatenation and includes the empty set $L^* = \bigcup _{i \ge 0} L^i$, and the Kleene plus denotes the smallest superset that is closed under concatenation and does not include the empty set $L^+ = \bigcup _{i > 0} L^i$. A word $w \in \Sigma ^n$ has a period i if $w = (w_1\cdots w_i)^+$. The smallest number i satisfying this relation is called the least period, and words with the least period n are called primitive.

A balanced Dyck word is a binary word of an even length 2n composed of n zeroes and n ones such that none of its prefixes has more zeroes than ones. We denote the set of all balanced Dyck words of length n by $\mathcal {D}_{2n}$. It is well known that $D_{2n} = C_n$ , where $C_n$ is the nth Catalan number, $C_n = \frac{1}{n+1} \left( {\begin{array}{c}2n\\ n\end{array}}\right) $. A Smirnov word (also called a Carlitz word) is any word $w_1\cdots w_n$ with no consecutive equal letters, i.e., $w_i \ne w_{i+1}$ for $i \in \left[ n-1\right] $. If, in addition, its cyclic shifts are Smirnov, the word $w_1\cdots w_n$ is called cyclic Carlitz. The ordinary generating function of cyclic Carlitz words on $\Sigma $, determined by MacFie (2019), is

$$\begin{aligned} F_q(z) = q (q-1) \frac{z^2}{(z+1)(1 - (q-1)z)}. \end{aligned}$$

Whenever referring to a block code in $\Sigma ^n$, we assume that the integer $n \ge 2$ and $\Sigma $ is an alphabet of size $q \ge 2$. We partition the alphabet $\Sigma $ into two non-empty sets, the positive letters $\Sigma _P$ and the negative letters ${\Sigma _N}$, and denote the number of positive letters in a word $w \in \Sigma ^n$ by p(w). For an $\epsilon $ such that $0 \le \epsilon \le 0.5$, a word w is called $\epsilon $-balanced if $\lceil (0.5-\epsilon ) n \rceil \le p(w) \le \lfloor (0.5 + \epsilon ) n\rfloor $. A more vague term, almost balanced, is often used to describe words with p(w) close to n/2, but the exact definition varies between resources. A word that is 0-balanced is also called polarity-balanced.

Remark 1

When constructing codes over the DNA alphabet, one sets ${\Sigma }_P = \{C,G\}$ and ${\Sigma }_N = \{T,A\}$. In the binary case, ${\Sigma }_P = \{1\}$ and ${\Sigma }_N = \{0\}$.

Remark 2

In the definition of polarity-balanced words, one usually assumes that both n and q are even and $|\Sigma _P |= |\Sigma _N |= q/2$. However, our constructions work for any non-empty partition of $\Sigma $. In this sense, our definition differs from that of Weber et al. (2013). For odd values of q, they partition the alphabet into three parts: the set of positive symbols, the set of negative symbols, and the set with a zero symbol, so that $|\Sigma _P |= |\Sigma _N |= \lfloor q/2 \rfloor $ and the number of zero symbols in a codeword is not constrained. On the other hand, we assign the zero symbol either to $\Sigma _P$ or $\Sigma _N$.

A code $C \subseteq \Sigma ^n$ is called non-overlapping if no proper prefix of a word in C occurs as a suffix of any word in C. It is called $\epsilon $-balanced (polarity-balanced) if all of its codewords are $\epsilon $-balanced (polarity-balanced respectively). A code contained in a larger code of the same class is called expandable. Otherwise, it is called maximal. A code with the largest number of words among all codes in the same class is called maximum. We denote the size of a maximum non-overlapping code by S(q, n).

2.2 Constructions

To our knowledge, two constructions of polarity-balanced non-overlapping codes exist. The binary non-overlapping codes based on Dyck paths due to Bilotta et al. (2012), provided in Construction 1 (below), are polarity-balanced if their length is even and $(2n)^{-1}$ -balanced if their length n is odd. The sizes of the obtained codes equal

$$\begin{aligned} |D_n |= {\left\{ \begin{array}{ll} C_m & n = 2m+1, \\ \sum _{i \in \left[ m/2\right] } C_iC_{m-i} & n = 2m+2, m \text { even}, \\ \sum _{i \in \left[ (m+1)/2\right] } C_iC_{m-i} - C_{(m-1)/2}^2& n = 2m+2, m \text { odd}. \end{array}\right. } \end{aligned}$$

Construction 1

(Bilotta et al. 2012) The code

$$\begin{aligned} D_{2n+1} = \{1a: a\in \mathcal {D}_{2n}\} \end{aligned}$$

is a maximal non-overlapping code. If n is even, then the code

$$\begin{aligned} D_{2n+2} = \bigcup _{i\in \left[ 0, n/2\right] } \left\{ a1b0: a \in \mathcal {D}_{2i}, b \in \mathcal {D}_{2(n-i)}\right\} \end{aligned}$$

is a maximal non-overlapping code. If n is odd, then the code

$$\begin{aligned} D_{2n+2} = \bigcup {_{i\in \left[ 0,(n+1)/2\right] } } \left\{ a1b0: a \in \mathcal {D}_{2i}, b \in D_{2(n-i)}\right\} \setminus \left\{ {1a'01b'0} , a {'} ,b {' } \in \mathcal {D}_{n-1}\right\} \end{aligned}$$

is a maximal non-overlapping code.

Another construction of a polarity-balanced non-overlapping code based on the binary construction of unconstrained non-overlapping codes developed independently in Levenshtein (1970) and Chee et al. (2013), was presented by Levy and Yaakobi (2018). It is stated in Construction 2 (below). Note that these codes also contain no zero runs longer than k. The authors prove that the size of the obtained code is lower-bounded by $\left( 2^k+2k+1-n\right) \left( {\begin{array}{c}n-2k-2\\ n/2-2\end{array}}\right) $ but for many pairs of values of k and n this bound is negative. Thus, we determine an explicit formula to compute the code size in Proposition 1. Moreover, the authors show that a q-ary polarity-balanced non-overlapping code where $|{\Sigma }_P |= |{\Sigma }_N |$ has at most $\left( {\begin{array}{c}n\\ n/2\end{array}}\right) \left( \frac{q}{2}\right) ^n / n$ codewords.

Construction 2

(Levy and Yaakobi 2018) Let n and k be integers, $k \le n/2$. The set C(n, k) that contains all words of the form $0^k1c1$ where c is a binary word with exactly $n/2-2$ ones and no zero runs of length k is a polarity-balanced non-overlapping code.

Proposition 1

Let $n > 4$ and $1 \le k \le n/2$. The number of words in the code C(n, k) obtained from Construction 2 equals

$$\begin{aligned} \sum _{r = \lceil \frac{n}{2k}\rceil - 1}^{n/2-k} \sum _{i=-1}^{1} \sum _{j = 0}^{\lfloor (n/2-2k)/r\rfloor } (-1)^j(2-|i |)\left( {\begin{array}{c}n/2-3\\ r+i\end{array}}\right) \left( {\begin{array}{c}r\\ j\end{array}}\right) \left( {\begin{array}{c}n/2-k-(k-1)j-1\\ r-1\end{array}}\right) . \end{aligned}$$

Proof

The number of words in C(n, k) equals the number of binary words of length $n-k-2$ with exactly $n/2-2$ ones and no zero run of length k. Every binary word has one of the following four forms: $0^{\alpha _1}1^{\beta _1}\cdots 0^{\alpha _r}1^{\beta _r}$, $1^{\beta _1}0^{\alpha _1}\cdots 1^{\beta _r}0^{\alpha _r}$, $0^{\alpha _1}1^{\beta _1}\cdots 0^{\alpha _{r-1}}1^{\beta _{r-1}}0^{\alpha _r}$ or $1^{\beta _1}0^{\alpha _1}\cdots 1^{\beta _r}0^{\alpha _r}1^{\beta _{r+1}}$. If there are exactly $n/2-2$ ones in the word, $\beta $ is a composition of $n/2-2$, and $\alpha $ is a composition of $n/2-k$. The restriction of runs is expressed as a restriction of the size of the parts of $\alpha $. Moreover, for $r < \lceil \frac{n}{2k}\rceil - 1$ and $r > n/2-k$ there are no compositions of $n/2-k$ into r parts not exceeding $k-1$. Thus

$$\begin{aligned} |C(n,k) |= \sum _{r = \lceil \frac{n}{2k}\rceil - 1}^{n/2-k} \biggl (2\left( {\begin{array}{c}n/2-3\\ r-1\end{array}}\right) +&\left( {\begin{array}{c}n/2-3\\ r\end{array}}\right) \\ +&\left( {\begin{array}{c}n/2-3\\ r-2\end{array}}\right) \biggr ) c(n/2-k,r,k-1) \\ =\sum _{r = \lceil \frac{n}{2k}\rceil - 1}^{n/2-k} \sum _{i=-1}^{1} (2-|i |)&\left( {\begin{array}{c}n/2-3\\ r+i\end{array}}\right) c(n/2-k,r,k-1), \end{aligned}$$

and the formula follows from Eq. (1). $\square $

In our previous paper on non-overlapping codes, Stanovnik et al. (2024), we showed that every maximal non-overlapping code is contained in the set $\mathcal {M}_{q,n}$ (defined below in Construction 3). Every non-overlapping code is, as implied by Zorn’s lemma, a subset of some code from $\mathcal {M}_{q,n}$ and in particular, this also holds for every $\epsilon $-balanced non-overlapping code and every non-overlapping with restricted run-lengths. Hence, we will study the largest subsets of codes in $\mathcal {M}_{q,n}$ that satisfy the required constraints.

Construction 3

(Stanovnik et al. 2024) Let $n \ge 3$ . We define $\mathcal {M}_{q,n}$ to be the set of all codes

$$\begin{aligned}{C = \bigcup _{i\in \left[ n-1\right] } L_i R_{n-i} \subseteq \Sigma ^n,}\end{aligned}$$

where $(L_1, R_1)$ a partition of $\Sigma $ into two non-empty parts, and $(L_i, R_i)$ a partition of $\bigcup _{j\in \left[ i-1\right] } L_j R_{i-j} $ for every $i \in \left[ 2,n-1\right] $. Every code $C \in \mathcal {M}_{q,n}$ is non-overlapping.

To prove some statements regarding the constrained non-overlapping codes, we will need some more properties of the codes from Construction 3.

Proposition 2

(Stanovnik et al. 2024) Let $C=\bigcup _{i\in \left[ n-1\right] } L_i R_{n-i}$ be a code from Construction 3 and $X {:}{=}\bigcup _{i\in \left[ 2n+1\right] } X_i$, such that $X_{2i-1} = L_i$, $X_{2i} = R_i$, and $X_{2n+1} = C$. Then no proper prefix in X occurs as a proper suffix in X.

Proposition 3

(Stanovnik et al. 2024) Let $C = \bigcup _{i\in \left[ n-1\right] } L_i R_{n-i}$ be a code obtained from Construction 3. For every word $w \in C$, there exists a unique index i such that $w \in L_iR_{n-i}$.

3 Balanced non-overlapping codes

Before we proceed with constructing maximal $\epsilon $-balanced non-overlapping codes, we provide some bounds on the size of $\epsilon $-balanced non-overlapping codes. In particular, we first count the number of $\epsilon $-balanced primitive words. This value is then used in Proposition 5 to determine an upper bound on the size of $\epsilon $-balanced non-overlapping codes. Then, we determine the exact size of the maximum polarity-balanced non-overlapping code of length 2 in Proposition 6.

Proposition 4

The number of $\epsilon $-balanced primitive words in $\Sigma ^n$ equals

$$\begin{aligned} \sum _{\begin{array}{c} d\mid n \\ (0.5-\epsilon )d \le m \le (0.5+\epsilon )d \end{array}} \mu (d) \left( {\begin{array}{c}d\\ m\end{array}}\right) |{\Sigma }_P|^{m} |{\Sigma }_N |^{d-m}. \end{aligned}$$

Proof

Let $p_1,\dots ,p_k$ be the distinct prime divisors of n. For an integer d, define

$$\begin{aligned} {Q_d = \{w \mid w = u^d, (0.5-\epsilon )d \le p(w) \le (0.5+\epsilon )d\}, }\end{aligned}$$

i.e, the set of $\epsilon $-balanced words with a period d. Its size equals

$$\begin{aligned} {|Q_d |= \sum _{(0.5-\epsilon )d \le m \le (0.5+\epsilon )d} \left( {\begin{array}{c}d\\ m\end{array}}\right) |}{{\Sigma }}{_P|^{m} |}{{\Sigma }}{_N |^{d-m}, }\end{aligned}$$

since for m between $ \lceil (0.5 - \epsilon )d\rceil $ and $\lfloor (0.5+\epsilon )d\rfloor $, there are $\left( {\begin{array}{c}d\\ m\end{array}}\right) $ choices for the position of positive symbols. There are $|\Sigma _P |^{m}$ choices to select the symbols at these positions, and $|\Sigma _N |^{d-m}$ choices to select the negative symbols in the word. Clearly, the number of $\epsilon $-balanced primitive words B equals

$$\begin{aligned} {B = Q_1 \setminus \bigcup _{\begin{array}{c} d \mid n \\ d > 1 \end{array}} Q_d. }\end{aligned}$$

If $e \mid d$, then $Q_e \subseteq Q_d$. Hence, using the principle of inclusion and exclusion and denoting $d(Y) {:}{=}\prod _{i \in Y} p_i$,

$$\begin{aligned} {|B |}&{= |Q_1 |- \sum _{\emptyset \ne Y \subseteq \left[ k\right] } (-1)^{|Y |-1} |Q_{d(Y)} |}\\&{= |Q_1 |+ \sum _{\emptyset \ne Y \subseteq \left[ k\right] } (-1)^{|Y |} |Q_{d(Y)} |}\\&{= \sum _{d\mid n} \mu (d) |Q_d |, }\end{aligned}$$

proving the claim. $\square $

Proposition 5

Let C be an $\epsilon $-balanced q-ary non-overlapping code of length n. Then

$$\begin{aligned} {|C |\le \sum _{\begin{array}{c} d\mid n \\ (0.5-\epsilon )d \le m \le (0.5+\epsilon )d \end{array}} \frac{\mu (d)}{n} \left( {\begin{array}{c}d\\ m\end{array}}\right) |}{{\Sigma }}{_P|^{m} |}{{\Sigma }}{_N |^{d-m}. }\end{aligned}$$

Proof

Let w be a word in C. Shifting preserves the number of positive letters in a word, so any cyclic shift of w is $\epsilon $-balanced. Since C is a non-overlapping code, none of the cyclic shifts of w is a word in C (otherwise, w has a prefix that occurs as a suffix in C). Moreover, w and all of its shifts are primitive – if any of them has a period d, then the prefix of w of length d occurs as a suffix of w of length d. Thus, $n|C |\le |B |$ and the proposition follows. $\square $

Proposition 6

$S_B(|\Sigma _P |,|\Sigma _N|,2) = |\Sigma _P ||\Sigma _N|.$

Proof

Every polarity-balanced code of length two is obtained by concatenating a positive and a negative letter. A letter that occurs as a prefix should not occur as a suffix to obtain a non-overlapping code. Hence, we partition the set $\Sigma _P$ into two parts, $L_P$ and $R_P$, and the set $\Sigma _N$ into parts $L_N$ and $R_N$, and write down a polarity-balanced non-overlapping code C in the form $C = L_N R_P \cup L_P R_N$. Clearly, $\max \; |C |= S_B(|\Sigma _P |,|\Sigma _N|,2)$. Compute

$$\begin{aligned} |C |=&|L_N ||R_P |+ |L_P ||R_N |\\ =&|L_N |(q/2 - |L_P |) + |L_P |(q/2 - |L_N |) \\ =&q/2 (|L_N|+ |L_P |) - 2 |L_N ||L_P |. \end{aligned}$$

This function is continuous; therefore, its global maxima lie at the boundaries or critical points. The only critical point is $(|L_N|, |L_P |) = (|\Sigma _P|/2,|\Sigma _N|/2)$, but the corresponding Hessian, $H(|\Sigma _P|/2,|\Sigma _N|/2) = \begin{bmatrix}0 & -2 \\ -2 & 0\end{bmatrix}$, has a negative determinant and hence $(|\Sigma _P|/2,|\Sigma _N|/2)$ is a saddle point. The maxima, therefore, lies at the boundaries. C is symmetric in $|L_N|$ and $|L_P|$, and therefore we only have to test the cases $|L_N|= 0$ and $|L_N|= 1$. If $|L_N|= 0$,

$$\begin{aligned} |C |= |\Sigma _N ||L_P |, \end{aligned}$$

and the maximum, $|\Sigma _N||\Sigma _P |$, is attained at $|L_P |= |\Sigma _P |$. If $|L_N|= 1$,

$$\begin{aligned} |C |=&\, |\Sigma _P |+ |\Sigma _N ||L_P |- 2 |L_P |\\ =&\, |\Sigma _P |+ (|\Sigma _N |-2)|L_P |. \end{aligned}$$

For $|\Sigma _N |> 2$, $(|\Sigma _N |-2) > 0$, and the function above attains its maximum at $|L_P |= |\Sigma _P |$.

$$\begin{aligned} |C |\le&\, |\Sigma _P |+ (|\Sigma _N |-2) |\Sigma _P |\\ =&\, |\Sigma _P |(|\Sigma _N |-1) \\<&\, |\Sigma _P ||\Sigma _N |. \end{aligned}$$

If $|\Sigma _N |= 2$, $|C |= |\Sigma _P |\le |\Sigma _P ||\Sigma _N |$, and if $|\Sigma _N |= 1$, the maximum is attained at $|L_P |= 0$, hence $|C |= |\Sigma _P |\le |\Sigma _P ||\Sigma _N |$. $\square $

One can construct an $\epsilon $-balanced non-overlapping code and in particular all maximal $\epsilon $-balanced non-overlapping codes by extracting codewords from $X \in \mathcal {M}_{q,n}$ that have between $\lceil (0.5-\epsilon )n \rceil $ and $\lfloor (0.5+\epsilon )n \rfloor $ positive letters. To obtain an extraction, we will partition the sets of prefixes so that $L_i^j$ contains all words from $L_i$ with j positive letters. Similarly, we partition the sets of suffixes so that $R_i^j$ contains all words from $R_i$ with j positive letters. Some properties of the sets $L_i^j$ and $R_i^j$ are given in Proposition 7.

Proposition 7

(i)
If $j > i$, then $L_i^j \cup R_i^j = \emptyset $.
(ii)
If $j \ne k$, then $(L_i^j \cup R_i^j) \cap ( L_i^k \cup R_i^k) = \emptyset $.
(iii)
If $i > 1$ and $0 \le j \le i$, then
$$\begin{aligned}L_i^j \cup R_i^j =\bigcup _{k=1}^{i-1} \bigcup _{m=\langle k+j-i \rangle } ^{\min (j,k)}.\end{aligned}$$

Proof

(i)
A word $x \in L_i \cup R_i$ has length i and therefore contains at most i positive symbols, so $p(x) \le i$.
(ii)
Let $x \in L_i^j \cup R_i^j$. Then, by definition, $p(x) = j$. Therefore $x \in L_i^k \cup R_i^k$, if and only if $k = p(x) = j$.
(iii)
By definition the set $L_i^j \cup R_i^j$ contains all words $w \in \bigcup _{k=1}^{i-1}(L_k R_{i-k})$ such that $p(w) = j$. Let $w = uv \in L_i^j \cup R_i^j$ be a decomposition of w such that $u \in L_k$ and $v \in R_{i-k}$. We know by Proposition 3 that this decomposition is unique. There are at most j positive letters in the prefix u and at most j positive letters in the suffix v. After applying statement (i) we obtain
$$\begin{aligned} 0 \le p(u)&\le \min (k,j), \\ 0 \le p(v)&\le \min (i-k,j), \\ p(u)&+ p(v) = j. \end{aligned}$$
Therefore $0 \le p(v) = j-p(u) \le i-k$ and $p(u) \ge j+k-i$.

$\square $

Every non-overlapping code from $\mathcal {M}_{q,n}$ can be composed in terms of the new sets (see Proposition 8 below). The parameter k in the formula denotes the number of positive letters in a codeword. The $\epsilon $-balanced codewords in X are therefore exactly those with k between $\lceil (0.5 - \epsilon ) n \rceil $ and $\lfloor (0.5 + \epsilon ) n \rfloor $. This observation leads to Proposition 4. Moreover, we determine the number of $\epsilon $-balanced codewords in X in Proposition 9.

Proposition 8

$$\begin{aligned} \bigcup _{i=1}^{n-1} L_i R_{n-i} = \bigcup _{i = 1}^{n-1} \bigcup _{j=0}^{i} \bigcup _{k = j}^{n+j-i} L_i^j R_{n-i}^{k-j}. \end{aligned}$$

Proof

From statement (i) of Proposition 7 we know that

$$\begin{aligned} \bigcup _{i=1}^{n-1} L_i R_{n-i} = \bigcup _{ i = 1}^{n-1} \bigcup _{j=0}^{i} \bigcup _{k = 0}^{n-i} L_i^jR_{n-i}^{k}. \end{aligned}$$

After rearrangement of the right-hand side by the amount of positive letters in $L_i^j R_{n-i}^k$, $j+k$, we obtain

$$\begin{aligned} \bigcup _{i=1}^{n-1} L_i R_{n-i} = \bigcup _{i = 1}^{n-1} \bigcup _{j=0}^{i} \bigcup _{k = j}^{n+j-i} L_i^j R_{n-i}^{k-j}. \end{aligned}$$

$\square $

Construction 4

Let $X = \bigcup _{i=1}^{n-1} L_iR_{n-i} \in \mathcal {M}_{q,n}$.

The largest $\epsilon $-balanced non-overlapping code contained in X is

$$\begin{aligned} C = \bigcup _{i = 1}^{n-1} \bigcup _{k = k_{\text {min}}}^{k_{\text {max}}} \bigcup _{j = \langle k + i - n \rangle }^{\min (i,k)} L_i^j R_{n - i}^{k-j}, \end{aligned}$$

where $k_{\text {min}} = \lceil (0.5 - \epsilon ) n \rceil $ and $k_{\text {max}} = \lfloor (0.5 + \epsilon ) n \rfloor $.

Proof

The set of $\epsilon $-balanced codewords in X is exactly

$$\begin{aligned}\bigcup _{i = 1}^{n-1} \bigcup _{k = k_{\text {min}}}^{k_{\text {max}}} \bigcup _{j = 0}^{\min (i,k)} L_i^j R_{n - i}^{k-j}.\end{aligned}$$

If $0 \le j < k + i - n$, then $k - j > n-i$ and there is no such word $w \in R_{n-i}$ that satisfies $p(w) = k - j$. Therefore $R_{n-i}^{k-j}$ is empty. C is non-overlapping since every subset of a non-overlapping code is itself non-overlapping. $\square $

Proposition 9

Let C be an $\epsilon $-balanced non-overlapping code from Proposition 4. Then

$$\begin{aligned} |C |= \sum _{i=1}^{n-1} \sum _{k = k_{\text {min}}}^{k_{\text {max}}} \sum _{j=\langle i-k \rangle }^{\min (i, k)} |L_i^j ||R_{n-i}^{k -j}|. \end{aligned}$$

Proof

Suppose $w \in L_i^j R_{n-i}^{p-j} \cap L_l^m R_{n-l}^{r-m}$. There are $p(w) = p = r$ positive letters in w. The code C is a subset of some non-overlapping code $X = \bigcup _{i<n} L_iR_{n-i}$ with $L_i^j \subseteq L_i$, $R_{n-i}^{p-j} \subseteq R_{n-i}$, $L_l^m \subseteq L_l$ and $R_{n-l}^{r-m} \subseteq R_{n-l}$. Therefore $w\in L_iR_{n-i} \cap L_l R_{n-l}$ and from Proposition 3 we get $i = l$. Statement (ii) of Proposition 7 now yields $j = m$. $\square $

We performed a computer search to determine the maximum sizes of binary polarity-balanced non-overlapping codes of small block sizes. Table 1 shows that Construction 1 is optimal for many block sizes. The only parameter value where the sizes differ is $n=14$. On the other hand, the codes obtained from Construction 2 are much smaller. Moreover, the computer search revealed that for $q \in \{4,6\}$ and $n \in \{4,6,8\}$ the maximum sizes of polarity-balanced non-overlapping codes also coincide with the sizes of codes obtained by the straight-forward generalisation of Construction 1. A computer search of Construction 4 becomes infeasible for larger parameter values.

Table 1 Comparison of the maximum size of a binary polarity-balanced non-overlapping code and the sizes of codes obtained by Construction 1 and 2

Full size table

One wonders whether it is possible to determine which partitions yield maximal codes. We establish necessary and sufficient conditions in Proposition 10. This generalises a result from Stanovnik et al. (2024), where a characterisation of unconstrained maximal non-overlapping codes is provided in a slightly different form since for $\epsilon = 0.5$ the set $R_{n-\sum _{m=0}^{k}i_m}^{p-\sum _{m=0}^{k}j_m} \cup L_{n-\sum _{m=0}^{k} i_m}^{p-\sum _{m=0}^{k}j_m}$ can only be empty when $k=0$, $i_0=n/2$, $j_0 = p/2$ and $L_{i_0}^{j_0} \cup R_{i_0}^{j_0} = \{x_0\}$.

Proposition 10

An $\epsilon $-balanced non-overlapping code

$$\begin{aligned} C = \bigcup _{i = 1}^{n-1} \bigcup _{k = \lceil (0.5 - \epsilon ) n \rceil }^{\lfloor (0.5 + \epsilon ) n \rfloor } \bigcup _{j = \langle k + i - n \rangle }^{\min (i,k)}, \end{aligned}$$

is maximal if and only if $x_0 \in L_{i_0}^{j_0}$ is not a prefix in C (respectively $x_0 \in R_{i_0}^{j_0}$ not a suffix in C) implies that for every $k \ge 0$, $i_0 + \dots + i_k < n$ and $j_0 + \dots + j_k \le p $ where $\lceil (0.5-\epsilon )n \rceil \le p \le \lfloor (0.5+\epsilon )n \rfloor $ and there exists a sequence of words $(x_1,\dots ,x_k)$ that are not prefixes (suffixes) in C such that $x_l \in L_{i_l}^{j_l} x_{l-1}$ for $1\le l \le k$ (respectively $x_l \in x_{l-1} R_{i_l}^{j_l}$ ), either

$$\begin{aligned} R_{n-\sum _{m=0}^{k}i_m}^{p-\sum _{m=0}^{k}j_m} \cup L_{n-\sum _{m=0}^{k} i_m}^{p-\sum _{m=0}^{k}j_m} = \emptyset , \end{aligned}$$

or $k = 0$, $i_0 = n/2$, $j_0=p/2$ and $ L_{i_0}^{j_0} \cup R_{i_0}^{j_0} = \{x_0\}$.

Proof

$(\Leftarrow ):$

Let C satisfy the assumptions. Define $\{p_k(w)\}$ to be a sequence of decompositions of a polarity-balanced word w encoded with a word over the alphabet $\{l_0,\dots ,l_{\lfloor (0.5+ \epsilon )n\rfloor }, r_0,\dots ,r_{\lfloor (0.5 + \epsilon )n\rfloor }\}$ such that

a) $p_0(w) \in \{l_0,l_1,r_0,r_1\}$ with $p_0(w) = l_0$ if $w_i \in L_1^0$, $p_0(w) = l_1$ if $w_i \in L_1^1$, $p_0(w) = r_0$ if $w_i \in R_1^0$ and $p_0(w) = r_1$ if $w_i \in R_1^1$ , and

b) if $p_k(w)$ contains $l_{i}r_{j}$ as substring, obtain $p_{k+1}(w)$ by replacing its occurrence with $l_{i+j}$ if $l_ir_j \in L_{m}^{i+j}$ or with $r_{i+j}$ if $l_ir_j \in R_m^{i+j}$ for some m.

The length of the decomposition is strictly decreasing, so the sequence is finite. The last element of this sequence is therefore in one of the following forms $l_ir_{p-j}$, $l_{\alpha _1}\cdots l_{\alpha _k}$, $r_{\alpha _1}\cdots r_{\alpha _k}$ or $r_{\alpha _1}\cdots r_{\alpha _m}l_{\alpha _{m+1}}\cdots l_{\alpha _k}$ where $(\alpha _1,\dots ,\alpha _k)$ is a weak composition of $p \in \{\lceil (0.5+\epsilon )n \rceil , \dots , \lfloor (0.5-\epsilon )n \rfloor \}$. If $p(w) = l_ir_{p-j}$, then $w \in C$. If $p(w) = l_{\alpha _1}\cdots l_{\alpha _k}$, then by assumption w has a suffix that occurs as a prefix in C. If $p(w) = r_{\alpha _1}\cdots r_{\alpha _k}$, then by assumption w has a prefix that occurs as a suffix in C. If $p(w) = r_{\alpha _1}\cdots r_{\alpha _m}l_{\alpha _{m+1}}\cdots l_{\alpha _k}$, then a shift of w that corresponds to $l_{\alpha _{m+1}}\cdots l_{\alpha _k}r_{\alpha _1}\cdots r_{\alpha _m}$ is a word in C. Hence, every polarity-balanced word not already in C overlaps with a word in C, so C is maximal.

$(\Rightarrow ):$

Suppose C is maximal and $x_o \in L_{i_0}^{j_0}$ is not a prefix in C. The case when $x_0 \in R_{i_0}^{j_0}$ is not a suffix in C is symmetric. For sake of contradiction, take the smallest k such that there exist p and $(x_1,\dots ,x_k)$ that are not prefixes in C but there exists $y \in R_{n-\sum _{m=0}^{k}i_m}^{p-\sum _{m=0}^{k}j_m} \cup L_{n-\sum _{m=0}^{k} i_m}^{p-\sum _{m=0}^{k}j_m}$. If $y \in R_{n-\sum _{m=0}^{k}i_m}^{p-\sum _{m=0}^{k}j_m}$, then $x_ky \not \in C$ since $x_k$ is not a prefix in C. By Proposition 2 , y and none of its suffixes is a prefix in C, and no prefix of y occurs as a suffix in C. If a prefix of $x_k$ or $x_k$ itself is a suffix in C, then there is a word or a prefix in $L_{i_m}$ that occurs as a suffix in C violating Proposition 2. Moreover, if $x_ky$ has a suffix longer than $n - \sum _{m=0}^{k}i_m$ that occurs as a prefix in C, then either $x_ {i_m}$ is a prefix in C or a word in $L_{i_m}$ has a suffix that is a prefix in C. The first cannot occur due to the assumption at the start of this proof, and the second due to Proposition 2. Hence, $C \cup \{x_ky\}$ is non-overlapping and larger than C. $C \cup \{x_ky\}$ is also $\epsilon $-balanced since $p(x_ky) = p(x_k)+p(y) = p$, violating the assumption that C is maximal. If $y \in L_{n-\sum _{m=0}^{k}i_m}^{p-\sum _{m=0}^{k}j_m}$, then we have to prove that unless $k = 0$, $i_0 = n/2$, $j_0=n/4$ and $ L_i^j \cup R_i^j = \{x_0\}$, $yx_k \cup C$ is a non-overlapping code larger than C. Note that in the special case mentioned, $yx_k = x_0x_0$ and overlaps with itself. To show that $yx_k\cup C$ is non-overlapping and $\epsilon $-balanced follow a similar procedure as in the previous case. To see that $yx_k$ does not belong to C, it is sufficient to observe that $yx_k$ ends in $x_0$ and from Proposition 2 we know that $x_0$ is not a suffix in C. $\square $

4 Restricting the length of the maximal run

Since the first and the last letters of a non-overlapping code are always distinct, the longest run in a cyclic shift of a codeword w is upper-bounded by the longest run of w . Furthermore, a non-overlapping code of length n cannot have any run of length l whenever $l \ge n$. Now, denote the number of a maximum q-ary non-overlapping code with run-length of at most l and length n by $S_l(q,n)$. We will use the number of cyclically run-length-restricted q-ary words to determine an upper bound on the size of a non-overlapping code with restrictions on its run-lengths. We denote the set of q-ary words where the longest (cyclic) run does not exceed l by $L_{q,l}^n$ and the corresponding generating function by $L_{q,l}(x)$.

Proposition 11

The number of q-ary words of length n such that the length of no run in the word nor in any of its cyclic shifts exceeds l equals

$$\begin{aligned} {|L_{q,l}^n|= \sum _{\sum ik_i = n} \left( {\begin{array}{c}\sum k_i\\ k_1,\dots ,k_{l}\end{array}}\right) (q-1) (q^3-q^2-q)^{\sum k_i}. }\end{aligned}$$

Proof

First observe that every word satisfying the constraints can be written as $x_1^{\alpha _1}\dots x_k^{\alpha _k}$ where $x_1 \cdots x_k$ is a cyclic Carlitz word and $1 \le \alpha _i \le l$. Thus, we can take the ordinary generating function of cyclic Carlitz words and substitute $z \mapsto q \sum _{i=1}^{l} x^i$ to obtain the ordinary generating function of the number of words satisfying the required constraint.

$$\begin{aligned} {L_{q,l}(x) }&{= \frac{q^3 (q-1) \left( \sum _{i=1}^{l} x^i\right) ^2}{\left( \sum _{i=1}^{l} x^i + 1\right) \left( 1 - (q-1)\sum _{i=1}^{l} x^i\right) } }\\&{= q^2 (q-1) \left( \sum _{i=1}^{l} x^i\right) ^2 \left( q^{-1} - (q^2 - q -1)\sum _{i=1}^{l} x^i\right) ^{-1}. }\end{aligned}$$

Applying the binomial theorem on $\left( q^{-1} - (q^2 - q -1)\sum _{i=1}^{l} x^i\right) ^{-1}$ and rearrangement of the result obtain

$$\begin{aligned} {L_{q,l}(x) = \sum _{n=2}^{\infty } (q-1) q^{n} (q^2-q-1)^{n-2} \left( \sum _{i=1}^{l} x^i\right) ^{n}. }\end{aligned}$$

After using the multinomial theorem, the generating function is rewritten as

$$\begin{aligned} {L_{q,l}(x) = \sum _{n=2}^{\infty } (q-1) q^{n} (q^2-q-1)^{n-2} \sum _{k_1 + \dots +k_l = n} \left( {\begin{array}{c}n\\ k_1,\dots ,k_l\end{array}}\right) x^{\sum _{i=1}^{l-1} i k_i}, }\end{aligned}$$

and the coefficient $\left[ x^n\right] L(x)_{q,l}$ is extracted to obtain the desired result. $\square $

The run-length restricted primitive words are counted with the same procedure as the primitive $\epsilon $-balanced words, and the upper bound on the non-overlapping code with restricted run-lengths is determined the same way as $\epsilon $-balanced non-overlapping code, giving the following results.

Lemma 1

The number of q-ary primitive words of length n such that the length of no run in the word nor in any of its cyclic shifts exceeds l equals

$$\begin{aligned} {\sum _{d \mid n} \mu (d) |L_{q,l}^d |. }\end{aligned}$$

Corollary 1

$$\begin{aligned} {S_l(q,n) \le \sum _{d \mid n} \frac{\mu (d)}{n} |L_{q,l}^d |.}\end{aligned}$$

Now let us determine lower bounds on $S_l(q,n)$. If a non-overlapping code C has a run of length $n-1$, then there exists $w \in C$ that starts with $n - 1$ identical symbols or ends with $n-1$ identical symbols. Let $L_1$ be the set of all letters $x \in \Sigma $ such that there exists a word in C beginning in x, and $R_1$ the set of all letters $y \in \Sigma $ such that there exists a word in C ending in y. Since no element in $L_1R_1$ occurs as both a prefix and a suffix in C, for any pair of symbols $x \in L_1$ and $y \in R_1$ at most one of the words $x^{n-1}y$ and $xy^{n-1}$ belongs to C. If we remove all such words, the remaining words constitute a non-overlapping code without any run of length $n-1$. Therefore

$$\begin{aligned} S_{n-2}(q,n)&\ge S(q,n) - |L_1 ||R_1 |\\&\ge S(q,n) - |L_1 |(q - |L_1 |) \\&\ge S(q,n) - q^2 / 4. \end{aligned}$$

Proposition 12

For $q > 2$,

$$\begin{aligned} {S_l(q,n) \ge (q-1)^{lm + 1 + r} (q-2)^{m}, }\end{aligned}$$

where m is the quotient and r the remainder of division of $n-2$ by l.

Proof

Partition $\Sigma $ into P and S so that one symbol belongs to P and $q-1$ symbols belong to S. The set $PS^{n-1}$ is non-overlapping by Construction 3. To avoid repetitions longer than l elements, we remove all words with the same letter on positions $1+kl$ and $2+kl$ for some positive integer k. The size of the resulting set C equals

$$\begin{aligned} |C |&= |P ||S |^{l}(|S |-1)|S |^{l-1}(|S |-1)|S |^{l-1} \dots \\&= |P ||S |^{1 + (l-1)m} \left( |S |- 1\right) ^m |S |^{r}. \end{aligned}$$

Since $q > 2$, $|S |- 1 > 0$. Hence, the code C is non-empty, non-overlapping, and has no run of length l. $\square $

Now, we proceed with constructions of maximal non-overlapping codes with restricted run-lengths. Observe step i of the iterative construction of partitions $(L_j,R_j)_{j > 0}$. If $w = uv$, $u\in L_j, v \in R_{i-j}$ for some j, $i-1> j > 2$, then u ends in a symbol from $R_1$ and v starts in a symbol from $L_1$. Hence, w has no runs longer than l if and only if both u and v do not have them. Therefore, during step i, a subsequence of identical symbols can only be lengthened in the concatenations $L_1R_{i-1}$ and $L_{i-1}R_1$. In the first case the subset of words with run-length of at most l equals

$$\begin{aligned} X_i = \{xy \mid x \in L_1, y \in R_{i-1} \text { starts with at most } (l-1) \\\text { consecutive }x's\}, \end{aligned}$$

and in the second case the subset of words with run-length at most l equals

$$\begin{aligned} Y_i = \{yx \mid x \in R_1, y \in L_{i-1} \text { ends with at most } (l-1) \\ \text { consecutive }x's\}. \end{aligned}$$

Therefore, Construction 5 (below) generates a set of non-overlapping codes with run-length of at most l. In particular, it creates the largest set of words with no run longer than l contained in a code from $\mathcal {M}_{q,n}$. Hence, every maximal non-overlapping code with runs restricted by l can be obtained from this construction. The generalisation, when run-length restrictions are different for distinct letters, is also straightforward.

Construction 5

Let the following hold for $n \ge 3$ and $l < n-1$.

(i)
$(L_1, R_1)$ is a partition of $\Sigma $ into two non-empty parts,
(ii)
for $i \in \{2,\dots , l+1\}$, $(L_i, R_i)$ is a partition of $\bigcup _{j=1}^{i-1} L_j R_{i-j} $,
(iii)
and for every $i \in \{l+2,\dots , n-1\}$, $(L_i, R_i)$ is a partition of $X_i \cup Y_i \cup \bigcup _{j=2}^{i-2} L_j R_{i-j}$.

Then the code

$$\begin{aligned}C = X_n \cup Y_n \cup \bigcup _{j=2}^{n-2} \end{aligned}$$

is non-overlapping and its longest run does not exceed l.

5 Balanced non-overlapping codes with restricted run-lengths

To generate an $\epsilon $-balanced non-overlapping code with restricted run-lengths, we combine the observations from Constructions 4 and 5. Instead of partitioning only the sets $L_i$ and $R_i$ into sets $L_i^j$ and $R_i^j$ based on the number of positive letters in a word as was done from $\epsilon $-balanced codes, we also partition the sets $X_i$ and $Y_i$ defined for codes with restricted run-lengths into sets $X_i^j$ and $Y_i^j$. Construction 6 is an immediate consequence of the previous results.

Table 2 Comparison between the number of codewords in the largest polarity-balanced codes with restricted run-lengths obtained by Constructions 7 and 8

Full size table

Construction 6

Let $n\ge 3$. The code

$$\begin{aligned}C = \bigcup _{j=j_{\text {min}}}^{j_{\text {max}}} \left( X_i^j \cup Y_i^j \cup \bigcup _{k=2}^{i-2} \bigcup _{m=\langle k+j-i \rangle } ^{\min (j,k)} L_k^mR_{i-k}^{j-m}\right) .\end{aligned}$$

where

(i)
$\left( L_1^0, R_1^0\right) $ and $\left( L_1^1, R_1^1\right) $ are such partitions of $\Sigma _N$ and $\Sigma _P$, respectively, that $L_1^0 \cup L_1^1$ and $R_1^0 \cup R_1^1$ are both non-empty,
(ii)
for $i \in \{2,\dots , l+1\}$ and $0 \le j \le i$,

$(L_i^j, R_i^j)$ is a partition of $\bigcup _{k=1}^{i-1} \bigcup _{m=\langle k+j-i \rangle } ^{\min (j,k)} L_k^mR_{i-k}^{j-m}$,
(iii)
for $i \in \{l+2,\dots , n\}$
$$\begin{aligned} X_i^j = \{xy \mid x \in L_1^0, y \in R_{i-1}^j \text { starts with at most } (l-1) \\ \text { consecutive }x's\} \\ \cup \{xy \mid x \in L_1^1, y \in R_{i-1}^{j-1} \text { starts with at most } (l-1)\\ \text { consecutive }x's\}, \end{aligned}$$
(iv)
for $i \in \{l+2,\dots , n\}$
$$\begin{aligned} Y_i^j = \{yx \mid x \in R_1^0, y \in L_{i-1}^j \text { ends with at most } (l-1) \\ \text { consecutive }x's\} \\ \cup \{yx \mid x \in R_1^1, y \in L_{i-1}^{j-1} \text { ends with at most } (l-1) \\ \text { consecutive }x's\}, \end{aligned}$$
(v)
for $i \in \{l+2,\dots , n-1\}$ and $0 \le j \le i$,

$(L_i^j, R_i^j)$ is a partition of
$$\begin{aligned}X_i^j \cup Y_i^j \cup \bigcup _{k=2}^{i-2} \bigcup _{m=\langle k+j-i \rangle } ^{\min (j,k)} L_k^mR_{i-k}^{j-m},\end{aligned}$$
(vi)
$j_{\text {min}} = \lceil (0.5 - \epsilon ) n \rceil $ and $j_{\text {max}} = \lfloor (0.5 + \epsilon ) n \rfloor $,

is an $\epsilon $-balanced non-overlapping code without runs longer than l.

Determination of the largest codes using this method is computationally exhaustive. Thus, we modify Constructions 1 and 2 to determine lower bounds on the size of a maximum polarity-balanced code that contains no runs longer than l. The comparison between the two constructions is given in Table 2.

We define $\mathcal {D}_{2n,l}$ as the set of all balanced Dyck words of length 2n having no runs longer than l. Moreover, we define $\hat{\mathcal {D}}_{2n,l}$ to be the set of all balanced Dyck words of length 2n starting with a sequence of at most $l-1$ ones, ending in a sequence of at most $l-1$ zeroes and having no runs longer than l in between. The modification of Construction 1 is now stated using these restricted Dyck words.

Construction 7

If n is even, then the code

$$\begin{aligned} D_{2n+2,l} = \bigcup _{i\in \left[ 0, n/2\right] } \{a1b0: a \in \mathcal {D}_{2i,l}, b \in \hat{\mathcal {D}}_{2(n-i),l}\} \end{aligned}$$

is a non-overlapping code with no runs longer than l. If n is odd, then the code

$$\begin{aligned} D_{2n+2,l} = \bigcup _{i\in \left[ 0,(n+1)/2\right] } \{a1b0: a \in \mathcal {D}_{2i,l}, b \in \hat{\mathcal {D}} _{2(n-i),l}\} \setminus \{ {1a'01b'0} , a {'} ,b {' } \in \hat{\mathcal {D}}_{n-1,l}\} \end{aligned}$$

is a non-overlapping code with no runs longer than l.

Proof

For an even n, the words in $D_{2n+2,l}$ are a subset of $D_{2n+2}$, hence, non-overlapping by Construction 1. For an odd n, in addition, we have to observe that the words $1a'01b'0$ where $a'$ or $b'$ have run longer than l, start with l ones or end in l zeroes, are not members of the set $\bigcup _{i\in \left[ 0,(n+1)/2\right] } \{a1b0: a \in \mathcal {D}_{2i,l}, b \in \hat{\mathcal {D}}_{2(n-i),l}\}$. In particular, we are left to show that the runs in a1b0 are at most l. The last run of a ends in 0, so it cannot be lengthened by the sequence 1b0. On the other hand, the runs at both borders of b are lengthened by one additional equal symbol. Hence, the longest run in a1b0 is at most l. $\square $

As shown in Proposition 13, one should first enumerate the restricted Dyck words to determine the size of the codes from Construction 7. To our knowledge, this problem has not been solved for general values of l. We provide some known results and simple observations in Propositions 14 and 15.

Proposition 13

$$\begin{aligned} {|D_{2n+2,l} |= {\left\{ \begin{array}{ll} \sum _{i =0}^{n/2} \; |\mathcal {D}_{2i,l} |\left|\hat{\mathcal {D}}_{2(n-i),l} \right|& n \text { even},\\ \sum _{i =0}^{(n+1)/2} \; \left|\mathcal {D}_{2i,l} \right|\left|\hat{\mathcal {D}}_{2(n-i),l} \right|- \left|\hat{\mathcal {D}}_{n-1,l}\right|^2 & n \text { odd}. \end{array}\right. } }\end{aligned}$$

Proof

We first observe that for every word in a1b0 where a and b are Dyck paths, there exist unique indices i and j such that $a \in \mathcal {D}_{i}$ and $b \in \hat{\mathcal {D}}_j$. Moreover, for an odd n, every word 1a0 such that $a \in \hat{\mathcal {D}}_{n-1,1}$ is also a word in $\mathcal {D}_{n+1,1}$. Hence,

$$\begin{aligned} \left\{ 1a01b0, a,b \in \hat{\mathcal {D}}_{n-1,l}\right\} \subseteq \bigcup _{i\in \left[ 0,(n+1)/2\right] } \left\{ a1b0: a \in \mathcal {D}_{2i,l}, { b \in \hat{\mathcal {D}}_{2(n-i),l}}\right\} , \end{aligned}$$

and the formula follows. $\square $

Proposition 14

$$\begin{aligned} |\mathcal {D}_{2n,1} |&= 1.\end{aligned}$$

(2)

$$\begin{aligned} \left|\hat{\mathcal {D}}_{2n, 2} \right|&= \left|\mathcal {D}_{2n-4, 2} \right|. \end{aligned}$$

(3)

$$\begin{aligned} \left|\mathcal {D}_{2n,n-1} \right|&= \left|\mathcal {D}_{2n} \right|- 1. \end{aligned}$$

(4)

$$\begin{aligned} \left|{\mathcal {D}}_{2n,l-1} \right|\le |&\hat{\mathcal {D}}_{2n,l} |\le |{\mathcal {D}}_{2n,l} |. \end{aligned}$$

(5)

Proof

The only Dyck word containing no runs of length two is $(10)^n$; thus, Eq. (2). To prove Eq. (3) , observe that a balanced Dyck word starting in a run of a single one and ending in a run of a single zero has a form 10b10 where b is a Dyck word of length $2n-4$. Equation (4) holds since the only balanced Dyck word having a run of n ones or zeroes is $1^n0^n$. We are left to show Eq. (5). Every balanced Dyck word without runs of length l is contained in $\hat{\mathcal {D}}_{2n,l}$. The words in $\hat{\mathcal {D}}_{2n,l}$ contain no runs longer then l. Thus $ {\mathcal {D}}_{2n,l-1} \subseteq \hat{\mathcal {D}}_{2n,l} \subseteq {\mathcal {D}}_{2n,l}$. $\square $

Proposition 15

(Donaghey 1980)

$$\begin{aligned} {|\mathcal {D}_{2n,2} |= \sum _{k=0}^{n-1} \frac{1}{n-k}\left( {\begin{array}{c}n-k\\ k+1\end{array}}\right) \left( {\begin{array}{c}n-k\\ k\end{array}}\right) . }\end{aligned}$$

The modification of Construction 2 is also straightforward. For $l < k$, there are no words in C(n, k) without zero runs of length $l+1$. For $l \ge k$, all zero runs in C(n, k) satisfy the constraint, and we only need to constrain the run-lengths of ones.

Construction 8

Let n and k be integers, $k \le n/2$ and $l > k$. The set C(n, k, l) that contains all words of the form $0^k1c1$ where c is a binary word with exactly $n/2-2$ ones, no zero run of length k and no run of ones having length l is a polarity-balanced non-overlapping code with run-length of at most l.

Proposition 16

Let $n > 4$, $1 \le k \le n/2$ and $l \ge k$. The number of words in the code C(n, k, l) obtained from Construction 8 equals

$$\begin{aligned} |C(n,k,l) |&= \sum _{r=\lceil \frac{n}{2k}\rceil - 1}^{n/2-k} \sum _{i=-1}^{1} (2-|i |)\, c(n/2-k,r,k-1)\, c(n/2-2,r,l-1). \end{aligned}$$

Proof

The statement is obtained following the procedure from the proof of Proposition 1 with an additional constraint on the sizes of $\beta _i$s being smaller than l. $\square $

Note that C(n, k, l) is not the largest subset of C(n, k) that satisfies the required constraints. It could be expanded into a code C by adding codewords of the form $0^k1c1$ where c is a binary word with exactly $n/2-2$ ones, no zero runs of length k, starting with a zero run with length of at most $l-1$, ending with a run of ones with length of at most $l-1$, having some runs of l ones, but no run of $l+1$ ones. Then, clearly $|C(n,k,l) |\le |C |\le |C(n,k,l+1) |$.

6 Enumerative encoding

The size of non-overlapping codes grows exponentially. Thus, holding the codewords in a table is not feasible for any practical application, and instead, an enumerative encoding is required. In Constructions 3, 4 and 6, the sizes of the partitions also grow exponentially, indicating that they should be equipped with a more space-efficient algorithmic procedure. We start by observing that if one of the sets $L_i$ and $R_i$ is empty in Construction 3, knowing which one is empty is sufficient to compute both sets from the partitions with smaller indices. Without loss of generality, assume that $R_i$ is empty. We can order the elements in $L_i$ so that the $k_1$-th element x lies in $L_jR_{i-j}$ if and only if

$$\begin{aligned} \sum _{l=1}^{j-1} |L_l ||R_{i-l} |< k_1 \le \sum _{l=1}^{j} |L_l ||R_{i-l}|, \end{aligned}$$

and x is the $k_2$-th element of $L_jR_{i-j}$ if $k_2 = k_1 - \sum _{l=1}^{j-1} |L_l ||R_{i-l} |$. Assuming there exist orderings on the sets $L_j$ and $R_{i-j}$, x is a concatenation of the $\lfloor k_2 / |R_{i-j} |\rfloor $-th element in $L_j$ and the $(k_2 \mod |R_{i-j}|)$-th element in $R_{i-j}$.

If the partitions $(L_l, R_l)_{l<i}$ and their sizes are stored ordered, the computation of j requires $O(i^2)$ time, the computation of $k_2$ O(i) and the computation and access of elements in $L_j$ and $R_{i-j}$ require constant time. The computation of x, therefore, requires $O(i^2)$ time. If, on the other hand, $R_j$ and $L_{i-j}$ are both empty, we could avoid storing the sets $L_j$ and $R_{i-j}$ and determine the prefix of x in $L_j$ in $O(j^2)$ and the suffix in $R_{i-j}$ in $O((i-j)^2)$ using the same procedure as above. Hence, if we store the sizes of partitions $(L_l, R_l)_{l<n}$ and those partitions with $|L_i ||R_i|\ne 0$, we can compute the k-th word of $\bigcup _{i <n} L_iR_{n-i}$ in $O(n^3)$ since at every step the subword is either divided into two shorter parts in $O(n^2)$ or accesses from memory in O(1), and, in the worst case, we repeat the division procedure $n-1$ times.

Now, let us explain that choosing a non-overlapping code with many partitions with empty parts is worth it. We have shown in Stanovnik et al. (2024) that among all non-overlapping codes $\bigcup _{i<n} L_iR_{n-i}$ with fixed partitions $(L_i, R_i)_{i\le \lfloor n/2 \rfloor }$ that achieve the maximum size, there exists one that satisfies $|L_i ||R_i |= 0$ for all $i > n/2$. Moreover, in the same article we determined the maximum non-overlapping codes for $q\in \{2,\dots ,6\}$ and $n\in \{3,\dots ,16\}$ and for all parameter values found a solution satisfying $|L_i ||R_i |= 0$ for all $i > 1+n(q+1)^{-1}$. Blackburn (2015) also proved that whenever n divides q a non-overlapping of maximum size exists with $|L_i ||R_i |= 0$ for all $i > 1$.

The described algorithm can easily be adapted for $\epsilon $-balanced non-overlapping codes by ordering the words in C so that the $k_1$-th element x of C is in $L_l^mR_{n-l}^{r-m}$ if and only if

$$\begin{aligned} \sum _{i=1}^{l-1}\sum _{k=k_{\min }}^{k_{\max }} \sum _{j=k+i-n}^{\min (i,k)} \left|L_i^j\right|\left|R_{n-i}^{k-j}\right|&< k_1 \le \sum _{i=1}^{l}\sum _{k=k_{\min }}^{k_{\max }} \sum _{j=k+i-n}^{\min (i,k)} \left|L_i^j\right|\left|R_{n-i}^{k-j}\right|, \\ \sum _{k=k_{\min }}^{r-1} \sum _{j=k+i-n}^{\min (i,k)} \left|L_l^j\right|\left|R_{n-l}^{k-j}\right|&< k_2 \le \sum _{k=k_{\min }}^{r} \sum _{j=k+i-n}^{\min (i,k)} \left|L_l^j\right|\left|R_{n-l}^{k-j}\right|, \\ \sum _{j=k+i-n}^{m-1} \left|L_l^j\right|\left|R_{n-l}^{r-j}\right|&< k_3 \le \sum _{j=k+i-n}^{m} \left|L_l^j\right|\left|R_{n-l}^{r-j}\right|, \end{aligned}$$

where

$$\begin{aligned} k_2 {:}{=}k_1 - \sum _{i=1}^{l-1}\sum _{k=k_{\min }}^{k_{\max }} \sum _{j=k+i-n}^{\min (i,k)} \left|L_i^j\right|\left|R_{n-i}^{k-j}\right|\end{aligned}$$

and

$$\begin{aligned} k_3 {:}{=}k_2 - \sum _{k=k_{\min }}^{r-1} \sum _{j=k+i-n}^{\min (i,k)} \left|L_l^j\right|\left|R_{n-l}^{k-j}\right|. \end{aligned}$$

The words in $L_i^j \cup R_i^j$ with $|L_i^j ||R_i^j |= 0$ should be ordered using the same method. If we define an ordering on $\{(x^k, y^m) \mid x\in L_1, y \in R_1, 0< k< l, 0< m < l\}$ and proceed in the same manner as above, we also obtain an algorithm for non-overlapping codes without runs of length l.

7 Conclusions

This paper develops three methods that guarantee the generation of maximal non-overlapping codes satisfying the $\epsilon $-balance constraint, the run-length limit, and the inclusion of both constraints simultaneously. Moreover, we provide simpler algorithms to obtain polarity-balanced non-overlapping codes with restricted run-length. We also offer relations with combinatorial objects that should be enumerated to determine the corresponding code sizes.

We demonstrate that none of the existing methods for constructing polarity-balanced non-overlapping codes is optimal for all alphabets and block sizes. The only tested parameter value where Construction 1 fails to provide the optimum suggests that it might be optimal when $n/2-1$ is a prime power. To support this claim, the algorithm’s search space used to determine the maximum size of Construction 4 should be sufficiently reduced to enable computations for larger block sizes. In particular, we believe that it is possible to show that there exists a maximum $\epsilon $-balanced non-overlapping code satisfying $|L_i^j ||R_i^j |= 0$ for all $i > n/2$ and all valid choices of j. Moreover, for each $i > n/2$, there should exist a polynomial in $\{|L_k^l |, |R_k^l |\mid k \le n- i, \; l \le \lfloor 0.5+\epsilon ) n\rfloor - j\}$ that determines which part of the partition $(L_i^j, R_i^j)$ is empty.

For the application in DNA-based storage, non-overlapping should have a large Hamming distance in addition to the constraints studied in this paper. Since our constructions generate maximal codes, the expected distance is small. If $x,z \in L_{i}$ and $y,w \in R_{j-i}$, the Hamming distance between xy and zw clearly equals $d_H(xy,zw) = d_H(x,z) + d_H(y,w)$. An open problem remains characterising pairs of words $x \in L_i R_{n-i}$ and $y \in L_jR_{n-j}$ at a large distance for $i \ne j$.

Data Availability

Not applicable.

References

Barcucci E, Bilotta S, Pergola E, Pinzani R, Succi J (2016) Cross-bifix-free sets generation via Motzkin paths. RAIRO-Theoretical Informatics and Applications-Informatique Théorique et Applications 50(1):81–91
Article MathSciNet MATH Google Scholar
Blackburn SR (2015) Non-overlapping codes. IEEE Transactions on Information Theory 61(9):4890–4894
Article MathSciNet MATH Google Scholar
Bilotta S, Pergola E, Pinzani R (2012) A new approach to cross-bifix-free sets. IEEE Transactions on Information Theory 58(6):4058–4063
Article MathSciNet MATH Google Scholar
Chee YM, Kiah HM, Purkayastha P, Wang C (2013) Cross-bifix-free codes within a constant factor of optimality. IEEE Transactions on Information Theory 59(7):4668–4674
Article MathSciNet MATH Google Scholar
Donaghey R (1980) Automorphisms on Catalan trees and bracketings. Journal of Combinatorial Theory, Series B 29(1):75–90
Article MathSciNet MATH Google Scholar
Flajolet P, Sedgewick R (2009) Analytic Combinatorics. Cambridge University Press, Cambridge
Book MATH Google Scholar
Immink KAS (2004) Codes for Mass Data Storage Systems. Shannon Foundation Publisher, Einhoven, The Netherlands
MATH Google Scholar
Levenshtein VI (1970) Maximum number of words in codes without overlaps. Problemy Peredachi Informatsii 6(4):88–90
MATH Google Scholar
Levy M, Yaakobi E (2018) Mutually uncorrelated codes for DNA storage. IEEE Transactions on Information Theory 65(6):3671–3691
Article MathSciNet MATH Google Scholar
MacFie A (2019) Enumerative properties of restricted words and compositions. PhD thesis, Carleton University
Stanovnik L, Moškon M, Mraz M (2024) In search of maximum non-overlapping codes. Designs, Codes and Cryptography, 1–28
Weber JH, Immink KA, Siegel PH, Swart TG (2013) Polarity-balanced codes. In: 2013 Information Theory and Applications Workshop (ITA), pp. 1–5 . IEEE
Wang G, Wang Q (2022) $Q$-ary non-overlapping codes: a generating function approach. IEEE Transactions on Information Theory 68(8):5154–5164
Article MathSciNet MATH Google Scholar
Yazdi ST, Kiah HM, Gabrys R, Milenkovic O (2018) Mutually uncorrelated primers for DNA-based data storage. IEEE Transactions on Information Theory 64(9):6283–6296
Article MathSciNet MATH Google Scholar

Download references

Funding

This research was partially supported by the scientific research program P2-0359 and by the basic research project J1-50024, both financed by the Slovenian Research and Innovation Agency and by the infrastructure program ELIXIR-SI RI-SI-2 financed by the European Regional Development Fund, the Ministry of Science, Education and Sports and by the Slovenian Research and Innovation Agency.

Author information

Authors and Affiliations

Faculty of Computer and Information Science, University of Ljubljana, Večna pot 113, 1000, Ljubljana, Slovenia
Lidija Stanovnik, Miha Moškon & Miha Mraz

Authors

Lidija Stanovnik
View author publications
You can also search for this author in PubMed Google Scholar
Miha Moškon
View author publications
You can also search for this author in PubMed Google Scholar
Miha Mraz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Lidija Stanovnik: Conceptualization, Formal analysis, Writing-original draft. Miha Moškon: Writing-review & editing, Funding acquisition. Miha Mraz: Supervision, Writing-review & editing, Funding acquisition.

Corresponding author

Correspondence to Lidija Stanovnik.

Ethics declarations

Conflict of interest

All authors certify that they have no affiliations with or involvement in any organisation or entity with any financial or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised due to a retrospective Open Access order.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Stanovnik, L., Moškon, M. & Mraz, M. On maximal almost balanced non-overlapping codes and non-overlapping codes with restricted run-lengths. Comp. Appl. Math. 44, 109 (2025). https://doi.org/10.1007/s40314-024-03055-0

Download citation

Received: 19 June 2024
Revised: 22 October 2024
Accepted: 15 December 2024
Published: 08 January 2025
DOI: https://doi.org/10.1007/s40314-024-03055-0

Keywords

Mathematics Subject Classification

94B25

On maximal almost balanced non-overlapping codes and non-overlapping codes with restricted run-lengths

Abstract

Similar content being viewed by others

In search of maximum non-overlapping codes

On the maximum size of variable-length non-overlapping codes

Anticode-based locally repairable codes with high availability

1 Introduction

2 Preliminaries

2.1 Notation and definitions

Remark 1

Remark 2

2.2 Constructions

Construction 1

Construction 2

Proposition 1

Proof

Construction 3

Proposition 2

Proposition 3

3 Balanced non-overlapping codes

Proposition 4

Proof

Proposition 5

Proof

Proposition 6

Proof

Proposition 7

Proof

Proposition 8

Proof

Construction 4

Proof

Proposition 9

Proof

Proposition 10

Proof

4 Restricting the length of the maximal run

Proposition 11

Proof

Lemma 1

Corollary 1

Proposition 12

Proof

Construction 5

5 Balanced non-overlapping codes with restricted run-lengths

Construction 6

Construction 7

Proof

Proposition 13

Proof

Proposition 14

Proof

Proposition 15

Construction 8

Proposition 16

Proof

6 Enumerative encoding

7 Conclusions

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation