Simple Linear Work Su X Array Construction: Abstract. A Su X Array Represents The Su Xes of A String in Sorted
Simple Linear Work Su X Array Construction: Abstract. A Su X Array Represents The Su Xes of A String in Sorted
Simple Linear Work Su X Array Construction: Abstract. A Su X Array Represents The Su Xes of A String in Sorted
n
DB
logM
B
n
B
log
2
n
I/Os
O
nlogM
B
n
B
log
2
n
internal work
integer [9]
O
n
DB
logM
B
n
B
I/Os
O
nlogM
B
n
B
internal work
integer [14],skew
Cache Oblivious [15]
M/B cache blocks of size B
O
n
B
logM
B
n
B
log
2
n
n
B
logM
B
n
B
nlog n
P
+ (L +
gn
P
)
log
3
n log P
log(n/P)
nlog n
P
+Llog
2
P +
gnlog n
P log(n/P)
n
1
processors O
n/P +Llog
2
P +gn/P
log
4
n
log
2
n
log
2
n
i
[1, 2n/3] for the triples s[i, i + 2] with i mod 3 ,= 0, i.e.,
numbers with the property that s
i
s
j
if and only if s[i, i + 2] s[j, j + 2].
This can be done in linear time by radix sort and scanning the sorted sequence
Simple Linear Work Sux Array Construction 947
of triples if triple s[i, i +2] is the k-th dierent triple appearing in the sorted
sequence, we set s
i
= k.
If all triples get dierent lexicographic names, we are done with step one.
Otherwise, the sux array SA
12
of the string
s
12
= [s
i
: i mod 3 = 1] [s
i
: i mod 3 = 2]
is computed recursively. Note that there can be no more lexicographic names
than characters in s
12
so that the alphabet size in a recursive call never exceeds
the size of the string. The recursively computed sux array SA
12
represents
the desired order of the suxes S
i
with i mod 3 ,= 0. To see this, note that
s
12
[
i1
3
,
n
3
) for i mod 3 = 1 represents the sux S
i
= s[i, n)[0] via lexicographic
naming. The 0 characters at the end of s make sure that s
12
[n/3 1] is unique
in s
12
so that it does not matter that s
12
has additional characters. Similarly,
s
12
[
n+i2
3
,
2n
3
) for i mod 3 = 2 represents the sux S
i
= s[i, n) [0, 0].
The second step is easy. The suxes S
i
with i mod 3 = 0 are sorted by sorting
the pairs (s[i], S
i+1
). Since the order of the suxes S
i+1
is already implicit in
SA
12
, it suces to stably sort those entries SA
12
[j] that represent suxes S
i+1
,
i mod 3 = 0, with respect to s[i]. This is possible in linear time by a single pass
of radix sort.
The skew algorithm is so simple because also the third step is quite easy.
We have to merge the two sux arrays to obtain the complete sux array SA.
To compare a sux S
j
with j mod 3 = 0 with a sux S
i
with i mod 3 ,= 0, we
distinguish two cases:
If i mod 3 = 1, we write S
i
as (s[i], S
i+1
) and S
j
as (s[j], S
j+1
). Since i +
1 mod 3 = 2 and j + 1 mod 3 = 1, the relative order of S
j+1
and S
i+1
can
be determinded from their position in SA
12
. This position can be determined in
constant time by precomputing an array SA
12
with SA
12
[i] = j +1 if SA
12
[j] = i.
This is nothing but a special case of lexicographic naming.
1
Similarly, if i mod 3 = 2, we compare the triples (s[i], s[i + 1], S
i+2
) and
(s[j], s[j + 1], S
j+2
) replacing S
i+2
and S
j+2
by their lexicographic names in
SA
12
.
The running time of the skew algorithm is easy to establish.
Theorem 1. The skew algorithm can be implemented to run in time O(n).
Proof. The execution time obeys the recurrence T(n) = O(n) + T(,2n/3|),
T(n) = O(1) for n < 3. This recurrence has the solution T(n) = O(n). .
3 Other Models of Computation
Theorem 2. The skew algorithm can be implemented to achieve the following
performance guarantees on advanced models of computation:
1
SA
12
1 is also known as the inverse sux array of SA
12
.
948 J. Karkkainen and P. Sanders
model of computation complexity alphabet
External Memory [38]
D disks, block size B,
fast memory of size M
O
n
DB
logM
B
n
B
I/Os
O
nlogM
B
n
B
internal work
integer
Cache Oblivious [15] O
n
B
logM
B
n
B
nlog n
P
+Llog
2
P +
gnlog n
P log(n/P)
time general
P = O
n
1
processors
O
n/P +Llog
2
P +gn/P
time
integer
EREW-PRAM [25]
O
log
2
n
log
2
n
log
M/B
n/M
.
BSP: For the case of many processors, we proceed as for the EREW-PRAM
algorithm using the optimal comparison based sorting algorithm [19] that takes
time O(nlog n/P + (gn/P +L)
log n
log(n/P)
).
For the case of few processors, we can use a linear work sorting algorithm
based on radix sort [7] and a linear work merging algorithm [17]. The integer
2
Simpler randomized algorithms with favorable constant factors are also available
[10].
Simple Linear Work Sux Array Construction 949
sorting algorithm remains applicable at least during the rst (log log n) levels
of recursion of the skew algorithm. Then we can aord to switch to a comparison
based algorithm without increasing the overall amount of internal work.
CRCW PRAM: We employ the stable integer sorting algorithm [35] that
works in O(log n) time using linear work for keys with O(log log n) bits. This
algorithm can be used for the rst (log log log n) iterations. Then we can af-
ford to switch to the algorithm [22] that works for polynomial size keys at the
price of being inecient by a factor O(log log n). Lexicographic naming can be
implemented by computing prex sums using linear work and logarithmic time.
Comparison based merging can be implemented with linear work and O(log n)
time using [23]. .
The resulting algorithms are simple except that they may use complicated
subroutines for sorting to obtain theoretically optimal results. There are usually
much simpler implementations of sorting that work well in practice although they
may sacrice determinism or optimality for certain combinations of parameters.
4 Longest Common Prexes
Let lcp(i, j) denote the length of the longest common prex (lcp) of the suxes
S
i
and S
j
. The longest common prex array LCP contains the lengths of the
longest common prexes of suxes that are adjacent in the sux array, i.e.,
LCP[i] = lcp(SA[i], SA[i + 1]). A well-known property of lcps is that for any
0 i < j < n,
lcp(i, j) = min
ik<j
LCP[k] .
Thus, if we preprocess LCP in linear time to answer range minimum queries in
constant time [3,4,24], we can nd the longest common prex of any two suxes
in constant time.
We will show how the LCP array can be computed from the LCP
12
array
corresponding to SA
12
in linear time. Let j = SA[i] and k = SA[i + 1]. We
explain two cases; the others are similar.
First, assume that j mod 3 = 1 and k mod 3 = 2, and let j
= (j 1)/3 and
k
and k
in SA
12
, and thus = lcp
12
(j
, k
) = LCP
12
[SA
12
[j
] 1].
Then LCP[i] = lcp(j, k) = 3 +lcp(j +3, k +3), where the last term is at most
2 and can be computed in constant time by character comparisons.
As the second case, assume j mod 3 = 0 and k mod 3 = 1. If s[j] ,= s[k],
LCP[i] = 0 and we are done. Otherwise, LCP[i] = 1 + lcp(j + 1, k + 1), and we
can compute lcp(j + 1, k + 1) as above as 3 + lcp(j + 1 + 3, k + 1 + 3), where
= lcp
12
(j
, k
) with j
and k