Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Run Generation Revisited: What Goes Up May or May Not Come Down

2015, Lecture Notes in Computer Science

Run Generation Revisited: What Goes Up May or May Not Come Down Michael A. Bender† Samuel McCauley† Andrew McGregor‡ Shikha Singh† arXiv:1504.06501v1 [cs.DS] 24 Apr 2015 Hoa T. Vu§ Abstract In this paper, we revisit the classic problem of run generation. Run generation is the first phase of external-memory sorting, where the objective is to scan through the data, reorder elements using a small buffer of size M , and output runs (contiguously sorted chunks of elements) that are as long as possible. We develop algorithms for minimizing the total number of runs (or equivalently, maximizing the average run length) when the runs are allowed to be sorted or reverse sorted. We study the problem in the online setting, both with and without resource augmentation, and in the offline setting. • We analyze alternating-up-down replacement selection (runs alternate between sorted and reverse sorted), which was studied by Knuth as far back as 1963. We show that this simple policy is asymptotically optimal. Specifically, we show that alternating-up-down replacement selection is 2-competitive and no deterministic online algorithm can perform better. • We give online algorithms having smaller competitive ratios with resource augmentation. Specifically, we exhibit a deterministic algorithm that, when given a buffer of size 4M , is able to match or beat any optimal algorithm having a buffer of size M . Furthermore, we present a randomized online algorithm which is 7/4-competitive when given a buffer twice that of the optimal. • We demonstrate that performance can also be improved with a small amount of foresight. We give an algorithm, which is 3/2-competitive, with foreknowledge of the next 3M elements of the input stream. For the extreme case where all future elements are known, we design a PTAS for computing the optimal strategy a run generation algorithm must follow. • We present algorithms tailored for “nearly sorted” inputs which are guaranteed to have optimal solutions with sufficiently long runs. † Department of Computer Science, Stony Brook University, Stony Email: {bender,smccauley, shiksingh}@cs.stonybrook.edu. ‡ Department of Computer Science, University of Massachusetts, Email: {mcgregor,hvu}@cs.umass.edu. Brook, Amherst, NY 11794-4400, MA 01003, USA. USA. 1 Introduction External-memory sorting algorithms are tailored for data sets too large to fit in main memory. Generally, these algorithms begin their sort by bringing chunks of data into main memory, sorting within memory, and writing back out to disk in sorted sequences, called runs [15, 19, 26, 34]. We revisit the classic problem of how to maximize the length of these runs, the run-generation problem. The run-generation problem has been studied in its various guises for over 50 years [14,17–19,25,30,31,34]. The most well-known external-memory sorting algorithm is multi-way merge sort [1, 8, 15, 22, 28, 29, 40, 42, 44]. The multi-way merge sort is formalized in the disk-access machine1 (DAM) model of Aggarwal and Vitter [1]. If M is the size of RAM and data is transferred between main memory and disk in blocks of size B, then an M/B-way merge sort has a complexity of O (N/B) logM/B (N/B) I/Os, where N is the number of elements to be sorted. This is the best possible [1]. A top-down description of multi-way merge sort follows. Divide the input into M/B subproblems, recursively sort each subproblem, and merge them together in one final scan through the input. The base case is reached when each subproblem has size O(M ), and therefore fits into RAM. A bottom-up description of the algorithm starts with the base case, which is the run-generation phase. Naı̈vely, we can always generate runs of length M : ingest M elements into memory, sort them, write them to disk, and then repeat. The point of run generation is to produce runs longer than M . After all, with typical values of N and M , we rarely need more than one or two passes over the data after the initial run-generation phase. Longer runs can mean fewer passes over the data or less memory consumption during the merge phase of the sort. Because there are few scans to begin with, even if we only do one fewer scan, the cost of a merge sort is decreased by a significant percentage. Run generation has further advantages in databases even when a full sort is not required [22, 24]. Replacement Selection. The classic algorithm for run generation is called replacement selection [20,26,28]. We describe replacement selection below by assuming that the elements can be read into memory and written to disk one at a time. To create an increasing run starting from an initially full internal memory, proceed as follows: 1. From main memory, select the smallest element2 at least as large as every element in the current run. 2. If no such element exists, then the run ends; select the smallest element in the buffer. 3. Eject that element, and ingest the next element, so that the memory stays full. Replacement selection can deal with input elements one at a time, even though the DAM model transfers input between RAM and disk B elements at a time. To see why, consider two additional blocks in memory, an “input block,” which stores elements recently read from disk, and an “output block,” which stores elements that have already been placed in a run and will be written back to disk. To ingest, take an element from the input block, and to eject an element, put the element in the output block. When the input block becomes empty, fill it from disk and when the output block fills up, flush it to disk. Similar to previous work, in this paper, we ignore these two blocks. Properties of Replacement Selection. It has been known for decades that when the input appears in random order, then the expected length of a run is actually 2M , not M [18, 19, 25]. In [26], Knuth gives memorable intuition about this result, conceptualizing the buffer as a snowplow traveling along a circular track. Replacement selection performs particularly well on nearly sorted data (for many intuitive notions of “nearly”), and the runs generated are much larger than M . For example, when each element in the input appears at a distance at most M from its actual rank, replacement selection produces a single run. 1 The external-memory model, also called the I/O model, applies to any two levels of the memory hierarchy. Observe that data structures such as in-memory heaps can be used to identify the smallest elements in memory. However, from the perspective of minimizing I/Os, this does not matter—computation is free in the DAM model. 2 1 On the other hand, replacement selection performs poorly on reverse-sorted data. It produces runs of length M , which is the worst possible. Up-Down Replacement Selection. From the perspective of the sorting algorithm, it matters little, or not at all, whether the initially generated runs are sorted or reverse sorted. This observation has motivated researchers to think about run generation when the replacement-selection algorithm has a choice about whether to generate an up run or a down run, each time a new run begins. Knuth [25] analyzes the performance of replacement selection that alternates deterministically between generating up runs and down runs. He shows that for randomly generated data, this alternative policy performs worse, generating runs of expected length 3M/2, instead of 2M . Martinez-Palau et al. [34] revive this idea in an experimental study. Their two-way-replacement-selection algorithms heuristically choose between whether the run generation should go up or down. Their experiments find that two-way replacement selection (1) is slightly worse than replacement selection for random input (in accordance with Knuth [25]) and (2) produces significantly longer runs on inputs that have mixed up-down runs and reverse-sorted inputs. Our Contributions. The results in our present paper complement these earlier results. In contrast to Knuth’s negative result for random inputs [25], we show that strict up-down alternation is best possible for worst-case inputs. Moreover, we give better competitive ratios with resource augmentation and lookahead, which helps explain why heuristically choosing between up and down runs based on what is currently in memory may lead to better solutions. Resource augmentation is a standard tool used in competitive analysis [9, 11–13, 38, 39] to empower an online algorithm when comparing against an omniscient and all-powerful optimal algorithm. Up-down run generation boils down to figuring out, each time a run ends, whether the next run should be an up run or a down run. The objective is to minimize the number of runs output.3 We establish the following: 1. Analysis of alternating-up-down replacement selection. We revisit (online) alternating-up-down replacement selection, which was earlier analyzed by Knuth [25]. We prove that alternating-up-down replacement selection is 2-competitive and asymptotically optimal for deterministic algorithms. To put this result in context, it is known that up-only replacement selection is a constant factor better than up-down replacement selection for random inputs, but can be an unbounded factor worse than optimal for arbitrary inputs. 2. Resource augmentation with extra buffer. We analyze the effect of augmenting the buffer available to an online algorithm on its performance. We show that with a constant factor larger buffer, it is possible to perform better than twice optimal. Specifically, we exhibit a deterministic algorithm that, when given a buffer of size 4M , matches or beats any optimal algorithm having a buffer of size M . We also design a randomized online algorithm which is 7/4-competitive using a 2M -size buffer. 3. Resource augmentation with extra visibility. We show that performance factors can also be improved, without augmenting the buffer, if an algorithm has limited foreknowledge of the input. In particular, we propose a deterministic algorithm which attains a competitive ratio of 3/2, using its regular buffer of size M , with a lookahead of 3M incoming elements of the input (at each step). 4. Better bounds for nearly sorted data. We give algorithms that perform well on inputs that have some inherent sortedness. We show that the greedy offline algorithm is optimal for inputs on which the optimal runs are at least 5M elements long. We also give a 3/2-competitive algorithm with 2M -size buffer when the optimal runs are at least 3M long. These results are reminiscent of previous literature studying sorting on inputs with “bounded disorder” [10] and adaptive sorting algorithms [16, 33, 41]. 5. PTAS for the offline problem. We give a polynomial-time approximation scheme for the offline rungeneration problem. Specifically, our offline polynomial-time approximation algorithm guarantees a (1 + ε)-approximation to the optimal solution. We first an algorithmwith the running time of  give √  1+ 5 1/ε 1/ε O(2 N log N ) and then improve the running time to O N log N . 2 3 Note that for a given input, minimizing the number of runs output is equivalent to maximizing the average length of runs output. 2 Paper Outline. The paper is organized as follows. In Section 2, we formalize the up-down run generation problem and provide necessary notation. Section 3 contains important structural properties of run generation and key lemmas used in analyzing our algorithms. Analysis of alternating-up-down replacement selection and online lower bounds are in Section 4. Algorithms with resource augmentation, along with properties of the greedy algorithm, are presented in Section 5. The offline version of the problem is studied in Section 6. Improvements on well-sorted input are presented in Section 7. Section 8 summarizes related work and we conclude with open problems in Section 9. Due to space constraints, we defer some proofs to the appendix (Appendix A). 2 Up-Down Run Generation In this section, we formalize the up-down run generation problem and introduce notation. 2.1 Problem Definition An instance of the up-down run generation problem is a stream I of N elements. The elements of I are presented to the algorithm one by one, in order. They can be stored in the memory of size M available to the algorithm, which we henceforth refer to as the buffer. Each element occupies one slot of the buffer. In general, the model allows duplicate elements, although some results, particularly in Section 5 and Section 7, do require uniqueness. We say that an algorithm A reads an element of I when A transfers the element from the input sequence to the buffer. We say that an algorithm A writes an element when A ejects the element from its buffer and appends it to the output sequence S. Every time an element is written, its slot in the buffer becomes free. Unless stated otherwise, the next element from the input takes up the freed slot. Thus the buffer is always full, except when the end of the input is reached and there are fewer than M unwritten elements.4 An algorithm can decide which element to eject from its buffer based on (a) the current contents of the buffer and (b) the last element written. The algorithm may also use o(M ) additional words to maintain its internal state (for example, it can store the direction of the current run). However, the algorithm cannot arbitrarily access S or I—it can only append elements to S, and access the next in-order element of I. We say the algorithm is at time step t if it has written exactly t elements. A run is a sequence of sorted or reverse-sorted elements. The cost of the algorithm is the smallest number of runs we can use to partition its output. Specifically, the number of runs in an output S, denoted R(S), is the smallest number of mutually disjoint sequences S1 , S2 , . . . , SR(S) such that each Si is a run and S = S1 ◦ · · · ◦ SR(S) where ◦ indicates concatenation. We let OPT(I) be the minimum number of runs of any possible output sequence on input I, i.e., the number of runs generated by the optimal offline algorithm. If I is clear from context, we denote this as OPT. Our goal is to give algorithms that perform well compared to OPT for every I. We say that an online algorithm is β-competitive if on any input, its output S satisfies R(S) ≤ βOPT. At any time step, an algorithm’s unwritten-element sequence is comprised of the contents of the buffer, concatenated with the remaining (not yet ingested) input elements. For the purpose of this definition, we assume that the elements in the buffer are stored in their arrival order (their order in the input sequence I). Time step t is a decision point or decision time step for an algorithm A if t = 0 or if A finished writing a run at t. At a decision point, A needs to decide whether the next run will be increasing or decreasing. 4 Reading in the next element of the input when there is a free slot in the buffer never hurts the performance of any algorithm. However, we allow the algorithm in the proof of Lemma 16 to maintain free slots in the buffer to simplify the analysis. 3 2.2 Notation We employ the following notation. We use (x ր y) to denote the increasing sequence x, x + 1, x + 2, . . . , y and (x ց y) to denote the decreasing sequence x, x − 1, x − 2 . . . , y. We use ◦ to denote concatenation: if A = a1 , a2 , . . . , ak and B = b1 , b2 , . . . , bℓ then A ◦ B = a1 , a2 , . . . ak , b1 , b2 , . . . , bℓ . Let A = a1 , a2 , . . . , ak . We use A ⊕ x to denote the sequence a1 + x, a2 + x, . . . ak + x. Similarly, we use A ⊗ x to denote the sequence a1 x, a2 x, . . . , ak x. Let A, B be sequences. We say A covers B if for all e ∈ B, e ∈ A. A subsequence of a sequence A = a1 , . . . , ak is a sequence B = an1 , an2 , . . . , anℓ where 1 ≤ n1 < n2 < . . . < nℓ ≤ k. 3 Structural Properties In this section, we identify structural properties of the problem and the tools used in the analysis of our algorithms, which will be important in the rest of the paper. 3.1 Maximal Runs We show that in run generation, it is never a good idea to end a run early, and never a good idea to “skip over” an element (keeping it in buffer instead of writing it out as part of the current run). To begin, we show that adding elements to an input sequence never decreases the number of runs. Note that if S ′ is a subsequence of S, then R(S ′ ) ≤ R(S) by definition. Lemma 1. Consider two input streams I and I ′ . If I ′ is a subsequence of I, then OPT(I ′ ) ≤ OPT(I). Proof. Let A be an algorithm with input stream I and output S. Suppose that A produces the optimal number of runs on I, that is R(S) = OPT(I). Consider an algorithm A′ on I ′ . Algorithm A′ performs the same operations as A, but when it reaches an element that is not in I ′ (but is in I), it executes a no-op. These no-ops mean that the buffer of A′ may not be completely full, since elements that A has in buffer do not exist in the buffer of A′ . Let S ′ be the output of A′ ; S ′ is a subsequence of S. Then OPT(I ′ ) ≤ R(S ′ ) ≤ R(S) = OPT(I). A maximal increasing run is a run generated using the following rules (a maximal decreasing run is defined similarly): 1. Start with the smallest element in the buffer and always write the smallest element that is larger than the last element written. 2. End the run only when no element in the buffer can continue the run, i.e., all elements in buffer are smaller than the last element written. Lemma 2. At any decision time step, a maximal increasing (decreasing) run r covers every other (nonmaximal) increasing (decreasing) run r′ . A proper algorithm is an algorithm that always writes maximal runs. We say an output is proper if it is generated by a proper algorithm. We show that there always exists an optimal proper algorithm. Theorem 3. For any input I, there exists a proper algorithm A with output S such that R(S) = OPT(I). Proof. We prove this by induction on the number of runs. If there is only one run, it must be maximal. Assume that all inputs It with OPT(It ) = t have a maximal proper algorithm. Consider an input It+1 with OPT(It+1 ) = t + 1. Assume that an optimal algorithm on It+1 is AO , and it is not proper; we will construct a proper A with the same number of runs. The first run A writes is maximal and has the same direction of 4 the first run that AO writes; the first run AO writes may or may not be maximal. Then A is left with an unwritten-element sequence IA and AO is left with IO . Note that OPT(IO ) = t by definition. By Lemma 2, IO is a subsequence of IA . Then by Lemma 1, OPT(IA ) ≤ OPT(IO ). Then by the inductive hypothesis, IA has an optimal proper algorithm. Thus A is a proper algorithm generating the optimal number of runs. In conclusion, we have established that it always makes sense for an algorithm to write maximal runs. Furthermore, we use the following property of proper algorithms throughout the rest of the paper. Property 4. Any proper algorithm satisfies the following two properties: 1. At each decision point, the elements of the buffer must have arrived while the previous run was being written. 2. A new element can not be included in the current run if the element just written out is larger (smaller) and the current run is increasing (respectively, decreasing). 3.2 Analysis Toolbox We now present observations and lemmas that play an integral role in analysing the algorithms presented in the rest of the paper. Observation 5. Consider algorithms A1 and A2 on input I. Suppose that at time step t1 algorithm A1 has written out all the elements that algorithm A2 already wrote out by some previous time step t2 . Then, the unwritten-element sequence of algorithm A1 at time step t1 forms a subsequence of the unwritten-element sequence of algorithm A2 at time step t2 . Lemma 6. Consider a proper algorithm A. At some decision time step, A can write k runs p1 ◦ · · · ◦ pk or ℓ runs q1 ◦ · · · ◦ qℓ such that |p1 ◦ · · · ◦ pk | ≥ |q1 ◦ · · · ◦ qℓ |. Then p1 ◦ · · · ◦ pk ◦ pk+1 , where pk+1 is either an up or down run, covers q1 ◦ · · · ◦ qℓ . Therefore, the unwritten-element sequence after A writes pk+1 (if A writes p1 ◦· · ·◦pk+1 ) is a subsequence of the unwritten-element sequence after A writes qℓ (if A writes q1 ◦ · · · ◦ qℓ ). Proof. Since |p1 ◦ · · · ◦ pk | ≥ |q1 ◦ · · · ◦ qℓ |, the set of elements that are in q1 ◦ · · · ◦ qℓ but not in p1 ◦ · · · ◦ pk have to be in the buffer when pk ends. By 4, pk+1 will write all such elements. The next theorem serves as a template for analyzing the algorithms in this paper. It helps us restrict our attention to comparing the output of our algorithm against that of the optimal in small partitions. We show that if in every partition i, an algorithm writes xi runs that cover the first yi runs of an optimal output (on the current unwritten-element sequence), and xi /yi ≤ β, then the algorithm outputs no more than βOPT runs. Theorem 7. Let A be an algorithm with output S. Partition S into k contiguous subsequences S1 , S2 . . . Sk . Let xi be the number of runs in Si . For 1 < i ≤ k, let Ii be the unwritten-element sequence after A outputs Si−1 ; let I1 = I and Ik+1 = ∅. Let α, β ≥ 1. For each Ii , let Si′ be the output of an optimal algorithm on Ii . If for all i ≤ k, Si covers the first yi runs of Si′ , and xi /yi ≤ β, then R(S) ≤ βOPT. Similarly, if for all i ≤ k, Si covers the first yi runs of Si′ , and E[xi ]/yi ≤ α, then E[R(S)] ≤ αOPT. ′ Proof. Consider Ii′ , the unwritten element sequence at the end of the first y runs of Si−1 (we let I1′ = I). We Pi−1 show that OPT(Ii ) ≤ OPT − j=1 yi for all 1 ≤ i ≤ k using induction. Note that OPT(I1 ) = OPT (the P base case). Induction hypothesis: assume OPT(Ii ) ≤ OPT − i−1 j=1 yi . Since Si+1 covers the first y runs of ′ ′ ′ ). By definition, for Si+1 , by 5, Ii+1 is a subsequence of Ii+1 . Then by Lemma 1, OPT(Ii+1 ) ≤ OPT(Ii+1 i > 1, i X ′ OPT(Ii+1 ) = OPT(Ii ) − yi ≤ OPT − yi . j=1 5 P P Therefore, OPT(Ii+1 ) ≤ OPT − ij=1 yi . When i = k, we have OPT(Ik+1 ) ≤ OPT − kj=1 yi . But since P P Ik+1 contains no elements, OPT(Ik+1 ) = 0, and we have kj=1 yi ≤ OPT. Since R(S) = kj=1 xi , and Pk Pk i=1 xi ≤ β i=1 yi , we have the following: R(S) = Pk i=1 xi OPT Pk · OPT ≤ Pi=1 k xi i=1 yi · OPT ≤ βOPT. We also have the same in expectation, that is, E[R(S)] = E[ n X i=1 xi ] ≤ α n X i=1 yi ≤ α · OPT. 4 Up-Down Replacement Selection We begin by analyzing the alternating up-down replacement selection, which deterministically alternates between writing (maximal) up and down runs. Knuth [25] showed that when the input elements arrive in a random order (all permutations of the input are equally likely), alternating-up-down replacement selection performs worse than standard replacement selection (all up runs). Specifically, he showed that the expected length of runs generated by up-down-replacement selection is 1.5M on random input, compared to the expected length of 2M of replacement selection. In this section, we show that for deterministic online algorithms, alternating-up-down replacement selection is, in fact, asymptotically optimal for any input. It generates at most twice the optimal number of runs in the worst case. This is the best possible—no deterministic algorithm can have a better competitive ratio. 4.1 Alternating-Up-Down Replacement Selection is 2-competitive We begin by giving a structural lemma, analyzing identical runs on two inputs in which one input is a subsequence of the other. Lemma 8. Consider two inputs I1 and I2 , where I2 is a subsequence of I1 . Let S1 and S2 be proper outputs of I1 and I2 such that: 1. S1 and S2 have initial runs r1 and r2 respectively, 2. r1 and r2 have the same direction Let the unwritten-element sequence after r1 and r2 be I1′ and I2′ respectively. Then I2′ is a subsequence of I1′ . Proof. Assume that r1 and r2 are up runs (a similar analysis works for down runs). Let r2′ be a run that is a subsequence of r1 , consisting of all elements of r1 that are also in I2 . Then r2′ can be produced by an algorithm A′ that mirrors the algorithm A that generates r1 . When A reads or writes an element in I2 , A′ reads or writes that element; when A reads or writes an element not in I2 , A′ does nothing. Since r2 is maximal, it covers r2′ by Lemma 2. Theorem 9. Alternating up-down replacement selection is 2-competitive. Proof. We show that we can apply Theorem 7 to this algorithm with β = 2. In any partition that is not the last one of the output, the alternating algorithm writes a maximal up run ru and then writes a maximal down run rd . We must show that ru ◦ rd covers any run rO written by a proper optimal algorithm on Ir , the unwritten element sequence at the beginning of the partition. 6 If rO is an up run, then rO = ru and thus is covered by ru ◦ rd . If rO is a down run, consider I ′ , the unwritten-element sequence after ru is written; I ′ is a subsequence of Ir . By Lemma 8 (with I1 = Ir and I2 = I ′ ), ru ◦ rd covers rO . In the last partition, the algorithm can write at most two runs while any optimal output must contain at least one run. Hence xi /yi ≤ 2 in all partitions as required. 4.2 Lower Bounds on Online Algorithms for Up-Down Run Generation Now, we show that no deterministic online algorithm can hope to perform better than alternating-up-down replacement selection. Then, we partially answer the question of whether randomization helps overcome this impossibility result. Specifically, we show that no randomized algorithm can achieve a competitive ratio better than 3/2. We provide the main ideas of the proofs here and defer the details to Appendix A. Theorem 10. Let A be any online deterministic algorithm with output SI on input I. Then there are arbitrarily long I such that R(SI ) ≥ 2OPT(I). Proof Sketch. Given any M elements in the buffer, every time A commits to a run direction (up/down), the adversary sets the incoming elements such that they do not help the current run. Thus, A is forced to have runs of length at most M while OPT (since it has knowledge of the future) can do better. We also give a lower bound for randomized algorithms using similar ideas; however, in this case we do not have a matching upper bound. We use Yao’s minimax principle to prove this bound. That is, we generate a randomized input and show that any deterministic algorithm cannot perform better than 3/2 times OPT on that input against an oblivious adversary. Theorem 11. Let A be any online, randomized algorithm. Then there are arbitrarily long input sequences such that E[R(SI )] ≥ (3/2)OPT(I). 5 Run Generation with Resource Augmentation In this section, we use resource augmentation to circumvent the impossibility result on the performance of deterministic online algorithms. We consider two kinds of augmentation: • Extra Buffer: The algorithm’s buffer is actually a constant factor larger, that is, it can use its large buffer to read elements from the input, rearrange them, and write to the output. • Extra Visibility: The algorithm’s buffer is restricted to be of size M but it has prescience—the algorithm can see some elements in the immediate future (say, the next 3M elements), without the ability to write them early. We present algorithms that, under the above conditions, achieve a competitive ratio better than 2 when compared against an optimal offline algorithm with a buffer of size M . Resource augmentation is a common tool used in competitive analysis [9, 11–13, 38, 39]. It gives the online algorithm power to make better decisions and exclude worst case inputs, allowing us to compare the performance, more realistically, against an all-powerful offline optimal algorithm. The results in this section require the elements of the input to be unique. Duplicate elements can nullify the extra ability to see or write future (non-repeated) elements which is provided by visibility and bufferaugmentation respectively. For example, consider the input, I = (99, 101, 100, . . . , 100, . . .). {z } | cM −2 times On input I, any algorithm with cM -size buffer or visibility is as powerless as the one without any augmentation. 7 Note that the assumption of distinct elements in run generation is not new. Knuth’s analysis of the average run lengths [25] also requires uniqueness. We begin by analyzing the greedy algorithm for run generation. Greedy is a proper algorithm which looks into the future at each decision point, determines the length of the next up and down run and writes the longer run. Greedy is not an online algorithm. However, it is central to our resource augmentation results. The idea of resource augmentation, in part, is that the algorithm can use the extra buffer or visibility to determine, at each decision point, which direction (up or down) leads to the longer next run. We next look at some guarantees on the length of a run chosen by greedy (or the greedy run) and also on the run that is not chosen by greedy (or the non-greedy run). 5.1 Greedy is Good but not Great We first show that greedy is not optimal. The following example demonstrates that greedy can be a factor of 3/2 away from optimal. Example 12. Consider the input I = I1 ◦ (I1 ⊕ 10M ) ◦ (I1 ⊕ 20M ) ◦ · · · ◦ (I1 ⊕ 10cM ), where I1 = (4M + 4 ր 5M + 3) ◦ (M + 2) ◦ (5M + 4 ր 6M + 3) ◦ (2M + 1 ր 3M − 1) ◦ (4M + 3 ց 3M + 4) ◦ (2M ց M + 3) ◦ (M + 1 ց 1). On input I above, writing down runs repeatedly produces 2c runs; two for each I ⊕ i10M . On the other hand, the output of greedy is S1 ◦ (S1 ⊕ 10M ) ◦ · · · ◦ (S1 ⊕ c10M ), where S1 = (4M + 4 ր 6M + 3) ◦ (M + 2) ◦ (2M + 1 ր 3M − 1) ◦ (3M + 4 ր 4M + 3) ◦ (2M ց M + 3) ◦ (M + 1 ց 1) which contains 3c runs. Next, we show that all the runs written by the greedy algorithm (except the last two) are guaranteed to have length at least 5M/4. In contrast, up-down replacement selection can have have runs of length M in the worst case. Theorem 13. Each greedy run, except the last two runs, has length at least M + ⌈⌊M/2⌋/2⌉. We now bound how far into the future an algorithm must see to be able to determine which direction greedy would pick at a particular decision point. Intuitively, an algorithm should never have to choose between a very long up run and a very long down run. We formalize this idea about the non-greedy run not being too long in the following lemma. Lemma 14. Given an input I with no duplicate elements. Let the two possible initial increasing and decreasing runs be r1 and r2 . Then |r1 | < 3M or |r2 | < 3M . The next example shows that the above bound is tight. Example 15. Consider the input I = I1 ◦ I2 ◦ I3 , where I1 =(1 ր (M − 1)) ⊗ M, I2 = (M 2 ց M 2 − M + 1) I3 =(M − 1 ց 1) ◦ (M 2 + 2 ր M 2 + M + 1) . Then, r1 =((1 ր (M − 1)) ⊗ M ) ◦ (M 2 − M + 1 ր M 2 + M + 1) r2 =(M 2 ց M 2 − M + 1) ◦ ((M − 1 ց 1) ⊗ M ) ◦ (M − 1 ց 1). Thus, we have |r1 | = 3M and |r2 | = 3M − 1. The following lemma sheds some light on the choices made by an optimal algorithm with respect to that of greedy. It says, roughly, that if at any decision point, an optimal algorithm chooses to write the non-greedy run, and then writes the next run in the opposite direction, it performs no better than an optimal algorithm which chooses the greedy run in the first place. 8 Lemma 16. At any decision time step consider two possible next maximal runs r1 and r2 . If |r1 | ≥ |r2 |, then one of the following is the prefix of an optimal output on the unwritten-element sequence: 1. r1 ◦ r3 where r3 is a maximal run after r1 and it can be either up or down. 2. r2 ◦ r4 where r4 is maximal run after r2 with the same direction of r2 . 5.2 Online Algorithms with Resource Augmentation We now present several online algorithms which use resource augmentation (buffer or visibility) to determine an up-down replacement selection strategy, beating the competitive ratio of 2. For a concise summary of results, see Figure 1. Matching OPT using 4M -size Buffer. We present an algorithm with 4M -size buffer that writes no more runs than an optimal algorithm with an M -size buffer. Later on, we prove that (4M − 2)-size is necessary even to be 3/2-competitive; thus this augmentation result is optimal up to a constant. Consider the following deterministic algorithm with a 4M -size buffer. The algorithm reads elements until its buffer is full. It then uses the contents of its buffer to determine, for an algorithm with buffer size M , if the maximal up run or the maximal down run would be longer. If the maximal up run is longer, the algorithm uses its full buffer (of size 4M ) to write a maximal up run; otherwise it writes a maximal down run. The algorithm stops when there is no element left to write. Theorem 17. Let A be the algorithm with a 4M -size buffer described above. On any input I, A never writes more runs than an optimal algorithm with buffer size M . Proof Sketch. At each decision point, A determines the direction that a greedy algorithm on the same unwritten element sequence, but with a buffer of size M , would have picked. It is able to do so using its 4M -size buffer because, by Lemma 14, we know the length of the non-greedy run is bounded by 3M . Note that it does not need to write any elements during this step. In each partition, A writes a maximal run r in the greedy direction and thus covers the greedy run by Lemma 2. Furthermore, r covers the non-greedy run as well since all of the elements of this run must already be in A’s initial buffer and hence get written out. An optimal algorithm (with M -size buffer), on the unwritten-element-sequence, has to choose between the greedy and the non-greedy run. Since A covers both choices of the optimal in one run, by Theorem 7, it is able to match or beat OPT. A natural question is whether resource augmentation boosts performance automatically, without using the run-simulation technique. However, the following example shows that our 2-competitive algorithm, even when allowed to have 4M -size buffer, may still be as bad when using M -size buffer. Example 18. Consider the input, (8M ց 1) ◦ (16M ց 8M + 1) ◦ · · · ◦ (8cM ց 8(c − 1)M + 1) . The alternating algorithm from Section 4.1 which alternates maximal up and maximal down runs will write 2c runs given a 4M -size buffer. In contrast, the optimal number of runs with an M -size buffer has c runs. 3/2-competitive using 4M -visibility. When we say that an algorithm has X-visibility (X ≥ M ) or (X − M )-lookahead, it means that the algorithm has knowledge of the next X elements of its unwritten element sequence, and can use this knowledge when deciding what to write. However, only the usual M -size buffer is used for reading and writing. Furthermore, the algorithm must continue to read elements into its buffer sequentially from I, even if it sees elements further down the stream it would like to read or rearrange instead. We present a deterministic algorithm which uses 4M -visibility to achieve a competitive ratio of 3/2. At each decision point, similar to the algorithm in Theorem 17, we can use 3M -lookahead to determine the direction leading to the longer (greedy) run. However, unlike Theorem 17 we cannot use a large buffer to 9 Buffer size M 2M M 4M Lookahead 3M - Competitive ratio 2 1.75 1.5 1 Comments Deterministic Randomized Deterministic Deterministic Figure 1: Summary of online algorithms on run generation on any input write future elements. Instead, we do the following—write a maximal greedy run, followed by two additional maximal runs in the same direction and opposite direction respectively. We show that, at each decision point, the above algorithm is able to cover two runs of optimal (on the unwritten-element-sequence) using three runs. Lemma 16 and Lemma 6 are key in this analysis (see Appendix A for details). Thus, we have the following. Theorem 19. Let OPT be the optimal number of runs on input I given an M -size buffer, where I has no duplicate elements. Then there exists an online algorithm A with an M -size buffer and 4M -visibility such that A always outputs S satisfying R(S) ≤ (3/2)OPT. 7/4-competitive using 2M -size buffer. We have seen that it is possible to achieve a competitive ratio of 3/2 using a standard M -size buffer as long as the algorithm is able to determine the direction leading to the longer (greedy) run (see Theorem 19). Now we only have a 2M -size buffer. The algorithm will pick a direction randomly, and write a maximal run in that direction using its regular M buffer. It use the additional M -size buffer to simulate a run in the opposite direction (and thus figure out which one is longer). With probability 1/2, the algorithm is lucky and picks the greedy direction. In this case, we can cover the first two runs of optimal (on the unwritten-element sequence) with three runs as in Theorem 19. With probability 1/2, the algorithm picks the wrong direction and we spend four (alternating) runs to cover two runs of optimal. Thus, in expectation we achieve a competitive ratio of 1/2(3/2) + 1/2(4/2) = 7/4. Theorem 20. Let OPT be the optimal number of runs on input I given an M -size buffer, where I has no duplicate elements. Then there exists an online algorithm A with a 2M -size buffer such that A always outputs S satisfying E[R(S)] ≤ (7/4)OPT and R(S) ≤ 2OPT. 5.3 Lower Bound for Resource Augmentation We show that with less than (4M − 2)-augmentation, no deterministic online algorithm can be 3/2competitive on all inputs. Thus, an algorithm with (4M − 2)-size buffer cannot be optimal, so Theorem 17 is nearly tight. Similarly, Theorem 19 is nearly tight, since 4M − 2-size buffer implies 4M − 2-visibility. Theorem 21. With buffer size less than (4M − 2), for any deterministic online algorithms A, there exists an input I such that if S is the output of A on I, then R(S) ≥ (3/2)OPT. 6 Offline Algorithms for Run Generation We give offline algorithms for run generation. The offline problem is the following—given the entire input, compute (using a standard polynomial computation time algorithm) the optimal strategy which when executed by a run generation algorithm (with a buffer of size M ) produces the minimum possible number of runs. For any ε, we provide an offline polynomial time approximation algorithm that gives a (1 + ε)approximation to the optimal solution. This is called a polynomial-time approximation scheme, or PTAS. The 10 1/ε 1/ε running time of our √ first attempt is O(2 N log N ). We then improve the running time to O(ϕ N log N ) where ϕ = (1 + 5)/2 ≈ 1.618 is the well-known golden ratio. Simple PTAS. Our first attempt breaks the output into sequences with a small number of runs, and uses brute force to find which set of runs writes the most elements. We show that for any ε, we can achieve a 1 + ε approximation in polynomial time using this strategy. Theorem 22. There exists an offline algorithm A that always writes an S satisfying R(S) ≤ (1 + ε) · OPT. The running time of A is O(21/ε N log N ). Improved PTAS. We reduce the running time by bounding the number of choices we need to consider in a brute-force search. We do this using Lemma 16. At each decision point, an algorithm chooses between starting an increasing run r1 and a decreasing run r2 . If |r1 | ≥ |r2 |, then by Lemma 16, we are able to discard r2 followed by an increasing run. Let Fd be the number of run sequences we need to consider if d runs remain to be written (for example, naı̈ve PTAS has Fd = 2d ). First, the algorithm must handle all run sequences beginning with r1 ; this is the same as an instance of Fd−1 . Then the algorithm handles all run sequences beginning with r2 followed by a decreasing run; this is an instance of Fd−2 . Thus Fd = Fd−1 + Fd−2 ; by examination, F1 = 1 and F2 = 2. This is the Fibonacci sequence, which gives us the ϕ1/ε factor in the running time. Theorem 23. There exists an offline algorithm A that writes S such √ that R(S) ≤ (1 + ε) · OPT. The running 1/ε time of A is O(ϕ N log N ) where ϕ is the golden ratio (1 + 5)/2. 7 Run Generation on Nearly Sorted Input This section presents results proving that up-down replacement selection performs better when the input has inherent sortedness (or “bounded-disorder” [34]). Replacement selection produces longer runs on nearly sorted data. In particular, if every input element is M away from its target position, then a single run is produced. Similarly, we give algorithms which perform well on inputs, where the optimal runs are also long. In particular, we say that an input is c-nearly-sorted if there exists a proper optimal algorithm whose outputs consists of runs of length at least cM . 3/2-competitive using 2M -size Buffer. We provide a randomized online algorithm that, on inputs which are 3-nearly-sorted, achieves a competitive ratio of 3/2, while using an augmented-buffer of size 2M . A sketch of the algorithm follows. At each decision point, the algorithm picks a run direction at random. It starts a maximal run in that direction, but uses its extra M -buffer to simulate the run in the opposite direction. By Lemma 14, the algorithm can tell if it picked the same run as greedy (with M -buffer), similar to Theorem 20. If the algorithm got lucky and picked the greedy run, it repeats the process. If the algorithm picked the non-greedy run, it uses some careful bookkeeping to write elements and simulate the run in the opposite direction. In doing so, the algorithm winds up at the same point in the input it would have reached, had it written the greedy run in the first place, but with an additional cost of one run. Theorem 24. There exists a randomized online algorithm A using M space in addition to its buffer such that, on any 3-nearly-sorted input I that has no duplicates, A is a 3/2-approximation in expectation. Furthermore, A is at worst a 2-approximation regardless of its random choices. Exact Offline Algorithm on Nearly Sorted Input. We show that the greedy (offline) algorithm is a linear time optimal algorithm on inputs which are 5-nearly-sorted. We first prove the following lemma. Lemma 25. If a proper algorithm produces runs of length at least 5M on a given input with no duplicates, then it is optimal. 11 Thus, we get our required linear time exact offline algorithm. Theorem 26. The greedy offline algorithm, i.e., picking the longer run at each decision point, is optimal on a 5-nearly-sorted input that contain no duplicates. The running time of the algorithm is O(N ). 8 Additional Related Work Replacement Selection. The classic algorithm for run generation is replacement selection [20]. While replacement selection considers only up runs, Knuth [25] analyzed alternating up-down replacement selection in 1963. He showed that for uniformly random input, alternating up-down replacement selection produces runs of expected length 3M/2, compared to 2M of the standard replacement selection [18, 19, 26]. Recently, Martinez-Palau et al. [34] introduce Two-way replacement selection (2WRS), reviving the idea of up-down replacement selection. The 2WRS algorithm maintains two heaps in memory for up and down runs and heuristically decides in which heap each element must be placed. Their simulations show that 2WRS performs significantly better on inputs with mixed up-down, alternating up-down, and random sequences. Replacement selection with a fixed-sized reservoir appears in [17, 40]. Larson [28] introduced batched replacement selection, a cache-conscious replacement selection which works for variable-length records. Koltsidas, Müller, and Viglas [27] study replacement selection for sorting hierarchical data. Improvements for the merge phase of external sorting have been considered in [10, 15, 37, 43, 44], but this is beyond the scope of this paper. Reordering Buffer Management. Run generation problem is reminiscent of the buffer reordering problem (also known as the sorting buffer problem), introduced by Räcke et al. [36]. It consists of a sequence of n elements that arrive over time, each having a certain color. A buffer, that can store up to k elements, is used to rearrange them. When the buffer becomes full, an element must be output. A cost is incurred every time an element is output that has a color different from the previous element in the output sequence. The goal is to design a scheduling strategy for the order in which elements must be output, so as to minimize the total number of color changes. The buffer reordering problem models a number of important problems in manufacturing processes and network routing and has been extensively studied, both in the online and offline case [3–7,9,12,23]. The offline version of the buffer reordering problem is NP hard [9], while the complexity of our problem remains unresolved. Patience Sort and Longest Increasing Subsequence. An old sorting technique used to sort decks of playing cards, Patience Sort [33] has two phases—the creation of sorted piles or runs, and the merging of these runs. The elements arrive one at a time and each one can be added to an existing run or starts a new run of its own. Unlike this paper, a legal run only consists of elements decreasing in value, and patience sort can form any number of parallel runs. The goal is to minimize the number of runs. The greedy strategy of placing an element to the left-most legal run is optimal. Moreover, the minimum number of such runs is the length of the longest increasing subsequence of the input [2]. Patience sort has been studied in the streaming model [21]. Similar to Replacement Selection, Patience Sort is able to leverage partially sorted input data. Chandramouli and Goldstein [10] present improvements to patience sort, and combine it with replacement selection to achieve practical speed up. Adaptive Sorting Algorithms. Python’s inbuilt sorting algorithm, Timsort [41] works by finding contiguous runs of increasing or decreasing value during the run generation phase. External memory sorting for wellordered or “partially sorted” data has been studied by Liu et al. [32]. They minimize the I/O cost of run generation phase by finding “naturally occurring runs”. See [16] for a survey on adaptive sorting algorithms. 12 9 Conclusion and Open Problems In this paper, we present an in-depth analysis of algorithms for run generation. We establish that considering both up and down runs can substantially reduce the number of runs in an external sort. The notion of updown replacement selection has received relatively little attention since Knuth’s negative result [25], until its promise was acknowledged by the experimental work of Martinez-Palau et al. [34]. The results in our paper complement the findings of Knuth [25] and Martinez-Palau et al. [34]. In particular, strict up-down alternation being the best possible strategy explains why heuristics for up-down rungeneration can lead to better performance in some cases. Moreover, our constant-factor competitive ratios with resource augmentation and lookahead may guide followup heuristics and practical speed-ups. We conclude with open problems. Can randomization help circumvent the lower bound of 2 on the competitive ratio of online algorithms (without resource augmentation)? We know that no randomized online algorithm can have a competitive ratio better than 3/2, but there is still a gap. What is the performance of the greedy offline algorithm compared to optimal? We show that greedy can as bad as 3/2 times optimal. Is there a matching upper bound? Can we design a polynomial, exact, algorithm for the offline run-generation problem? We find it intriguing that our attempts at an exact dynamic program requires maintaining too many buffer states to run in polynomial time. 10 Acknowledgments We gratefully acknowledge Goetz Graefe and Harumi Kuno for introducing us to this problem and for their advice. This research was supported by NSF grants CCF 1114809, CCF 1217708, IIS 1247726, IIS 1251137, CNS 1408695, CCF 1439084, and by Sandia National Laboratories. References [1] Alok Aggarwal and Jeffrey S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116–1127, September 1988. [2] David Aldous and Persi Diaconis. Longest increasing subsequences: from patience sorting to the BaikDeift-Johansson theorem. Bulletin of the American Mathematical Society, 36(4):413–432, 1999. [3] Yuichi Asahiro, Kenichi Kawahara, and Eiji Miyano. NP-hardness of the sorting buffer problem on the uniform metric. Discrete Applied Mathematics, 160(10):1453–1464, 2012. [4] Noa Avigdor-Elgrabli and Yuval Rabani. A constant factor approximation algorithm for reordering buffer management. In Proc. 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 973–984, 2013. [5] Noa Avigdor-Elgrabli and Yuval Rabani. An improved competitive algorithm for reordering buffer management. In Proc. 54th Annual Symposium on the Foundations of Computer Science (FOCS), pages 1–10, 2013. [6] Noa Avigdor-Elgrabli and Yuval Rabani. An optimal randomized online algorithm for reordering buffer management. arXiv preprint arXiv:1303.3386, 2013. [7] Reuven Bar-Yehuda and Jonathan Laserson. Exploiting locality: approximating sorting buffers. Journal of Discrete Algorithms, 5(4):729–738, 2007. 13 [8] Dina Bitton and David J DeWitt. Duplicate record elimination in large data files. ACM Transactions on database systems, 8(2):255–265, 1983. [9] Ho-Leung Chan, Nicole Megow, René Sitters, and Rob van Stee. A note on sorting buffers offline. Theoretical Computer Science, 423:11–18, 2012. [10] Badrish Chandramouli and Jonathan Goldstein. Patience is a virtue: Revisiting merge and sort on modern processors. In Proc. 2014 ACM SIGMOD Int’l Conference on Management of Data, pages 731–742, 2014. [11] Chandra Chekuri, Ashish Goel, Sanjeev Khanna, and Amit Kumar. Multi-processor scheduling to minimize flow time with ε resource augmentation. In Proc. of the 36th annual ACM Symposium on Theory of Computing (STOC), pages 363–372. ACM, 2004. [12] Matthias Englert and Matthias Westermann. Reordering buffer management for non-uniform cost models. In Proc. 32nd Int’l Colloquium on Automata, Languages and Programming (ICALP), pages 627– 638, 2005. [13] Leah Epstein and Rob Van Stee. Online bin packing with resource augmentation. Discrete Optimization, 4(3):322–333, 2007. [14] Terje O Espelid. On replacement selection and dinsmore’s improvement. BIT Numerical Mathematics, 16(2):133–142, 1976. [15] Vladimir Estivill-Castro and Derick Wood. Foundations for faster external sorting. Foundation of Software Technology and Theoretical Computer Science, 880:414–425, 1994. [16] Vladmir Estivill-Castro and Derick Wood. A survey of adaptive sorting algorithms. ACM Computing Surveys (CSUR), 24(4):441–476, 1992. [17] WD Frazer and CK Wong. Sorting by natural selection. Communications of the ACM, 15(10):910–913, 1972. [18] Edward H Friend. Sorting on electronic computer systems. Journal of the ACM, 3(3):134–168, 1956. [19] Betty Jane Gassner. Sorting by replacement selecting. Communications of the ACM, 10(2):89–93, 1967. [20] Martin A Goetz. Internal and tape sorting using the replacement-selection technique. Communications of the ACM, 6(5):201–206, 1963. [21] Parikshit Gopalan, TS Jayram, Robert Krauthgamer, and Ravi Kumar. Estimating the sortedness of a data stream. In Proc. 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 318–327, 2007. [22] Goetz Graefe. Implementing sorting in database systems. ACM Computing Surveys (CSUR), 38(3):10, 2006. [23] Sungjin Im and Benjamin Moseley. New approximations for reordering buffer management. In Proc. 25th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1093–1111, 2014. [24] Tom Keller, Goetz Graefe, and David Maier. Efficient assembly for complex objects. In Proc. 2014 ACM SIGMOD Int’l Conference on Management of Data, pages 148–157, 1991. [25] Donald Ervin Knuth. Length of strings for a merge sort. Communications of the ACM, 6(11):685–688, 1963. 14 [26] Donald Ervin Knuth. The Art of Computer Programming: Sorting and Searching, volume 3. Pearson Education, 1998. [27] Ioannis Koltsidas, Heiko Müller, and Stratis D Viglas. Sorting hierarchical data in external memory for archiving. Proceedings of the VLDB Endowment, 1(1):1205–1216, 2008. [28] Per-Åke Larson. External sorting: Run formation revisited. IEEE Transactions on Knowledge and Data Engineering, 15(4):961–972, 2003. [29] Per-Åke Larson and Goetz Graefe. Memory management during run generation in external sorting. In Proc. 1998 ACM SIGMOD Int’l Conference on Management of Data, volume 27, pages 472–483, 1998. [30] Yen-Chun Lin. Perfectly overlapped generation of long runs for sorting large files. Journal of Parallel and Distributed Computing, 19(2):136–142, 1993. [31] Yen-Chun Lin and Horng-Yi Lai. Perfectly overlapped generation of long runs on a transputer array for sorting. Microprocessors and Microsystems, 20(9):529–539, 1997. [32] Yang Liu, Zhen He, Yi-Ping Phoebe Chen, and Thi Nguyen. External sorting on flash memory via natural page run generation. The Computer Journal, 54(11):1882–1990, 2011. [33] Colin L Mallows. Patience sorting. Bulletin of Inst. of Math. Appl., 5(4):375–376, 1963. [34] Xavier Martinez-Palau, David Dominguez-Sal, and Josep Lluis Larriba-Pey. Two-way replacement selection. In Proc. of the VLDB Endowment, volume 3, pages 871–881, 2010. [35] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Chapman & Hall/CRC, 2010. [36] Harald Räcke, Christian Sohler, and Matthias Westermann. Online scheduling for sorting buffers. In Proc. 10th European Symposium on Algorithms (ESA), pages 820–832, 2002. [37] Betty Salzberg. Merging sorted runs using large main memory. Acta Informatica, 27(3):195–215, 1989. [38] Daniel D Sleator and Robert E Tarjan. Amortized efficiency of list update and paging rules. Communications of the ACM, 28(2):202–208, 1985. [39] Hongyang Sun and Rui Fan. Improved semi-online makespan scheduling with a reordering buffer. Information Processing Letters, 113(12):434–439, 2013. [40] TC Ting and YW Wang. Multiway replacement selection sort with dynamic reservoir. The Computer Journal, 20(4):298–301, 1977. [41] Wikipedia. Timsort, 2004. http://en.wikipedia.org/wiki/Timsort. [42] Weiye Zhang and Per-Äke Larson. Dynamic memory adjustment for external mergesort. In Proc. of the 23rd International Conference on Very Large Data Bases (VLDB), pages 376–385. Morgan Kaufmann Publishers Inc., 1997. [43] Weiye Zhang and Per-Åke Larson. Buffering and read-ahead strategies for external mergesort. In Proc. 24rd Int’l Conference on Very Large Data Bases (VLDB), pages 523–533, 1998. [44] LuoQuan Zheng and Per-Åke Larson. Speeding up external mergesort. IEEE Transactions on Knowledge and Data Engineering, 8(2):322–332, 1996. 15 A Appendix: Omitted proofs Proof of Lemma 2. Without loss of generality, assume r and r′ are increasing runs. Consider any time step t when elements from both r and r′ are being written. Let Bt and Bt′ be the buffer of r and r′ at time step t respectively; let Ct be the set of elements in Bt that are eventually written to r, and Ct′ be the set of elements in Bt′ that are eventually written to r′ . We prove inductively that Ct′ ⊆ Ct . This implies that r covers r′ . The base case is true as r and r′ start with the same buffer. Since we have Ct′ ⊆ Ct , we must show that (a) the ′ ′ element z written to r is not in Ct+1 and (b) the element e read into Ct+1 must also be in Ct+1 . ′ ′ Consider the elements z and z written by r and r, respectively, at time t + 1. We must have z ′ ≥ z; thus ′ ). either z = z ′ , or z is never written to r′ (either way it is not in Ct+1 ′ , it is eventually written by r ′ ; thus e ≥ z ′ . Thus e ≥ z, but that means e is eventually Since e is in Ct+1 written by r. Since e was just read it is in Bt+1 ; thus e ∈ Ct+1 . Observation A. If A has just written an element e, and is writing a down (up) run, then A cannot write any element larger (smaller) than e in the same run. Similarly, if A has just written e, then A cannot write both an element larger than e and an element smaller than e in the same run. Proof of Theorem 10. Let I1 , the first M elements of the input, be I1 = 1, 2, . . . , M . We divide the rest of the input into segments of size M . Let the (t + 1)st such segment be It+1 . Then, It+1 = (1 + tM ր M + tM ) or (−(1 + tM ) ց −(M + tM )) . Call this a positive segment and a negative segment respectively. At time M (t − 1) + 1 we decide whether It+1 is a positive or a negative segment based on A. Specifically, we choose It+1 using either the direction of the run A is writing, or the value of the most recent element written. If A is writing a down run, It+1 is a positive segment; if A is writing an up run, It+1 is a negative segment. It may be that A has only written one element of a run (so A could turn this into either an up run or a down run). If this element was the smallest element in the buffer of A, It+1 is a negative segment. Otherwise, It+1 is a positive segment. First we show that A must write at least one new run for each It ; thus R(S) ≥ t. At least one run is required for A to write I1 , so for the remainder of the proof we assume t > 1. Consider time M (t − 2) + 1, when It begins. We assume that It is a positive segment—a mirroring argument works when It is a negative segment. Furthermore, note that the elements of It are the largest in the instance so far. There are two cases: A is currently writing a down run, or the initial element of a new run. Case 1: Algorithm A is currently writing a down run. Then the elements of It must be larger than any element in A’s down run. Thus A must use another run to write the elements of It by Observation A. Case 2: Algorithm A is writing the initial element of a new run. By construction, the element written is not the smallest element in A’s buffer, but is smaller than all elements in It . Then A must spend one run to write the smallest element in its buffer, and another to write It . Thus, It causes A to write a run in addition to its current run by Observation A. On the other hand, an offline algorithm can write I2i and I2i+1 in one run. Assume that I2i is a positive segment—a mirroring argument works when I2i is a negative segment. If I2i+1 is positive, both can be written using an up run. If I2i+1 is negative, both can be written using a down run. Thus OPT is no more than ⌈t/2⌉. Proof of Theorem 11. Our lower bound uses the same basic principles as Theorem 10. We first show the lower bound with some repeated elements, then show how to perturb the elements to avoid repetitions. We generate a randomized input and show that any deterministic algorithm cannot perform better than 3/2 times OPT on that input. The theorem is then proven by Yao’s minimax principle. Yao’s minimax principle states 16 that the best expected performance of a randomized algorithm is at least as large as the expected performance of the best deterministic algorithm over a (known) distribution of inputs. (See, e.g., [35] for details.) As in Theorem 10, we divide the input into segments of size 4M . Call these segments It for t = 1, 2, . . . , ⌊N/4M ⌋. Note that this input is randomized: for each t, we pick one of two inputs, each with probability 1/2. We choose either It1 = (1 ր M ) It1 = (1 ր M ) 2 It = (2M, ց M + 1) It2 = (−2M, ր −M − 1) It3 = (3M ց 2M + 1) or It3 = (−3M ր −2M − 1) It4 = (4M ց 3M + 1) It4 = (−4M ր −3M − 1). (a positive segment) (a negative segment) 1 2 3 4 Let It = It ◦ It ◦ It ◦ It . A positive or negative segment is chosen randomly for each It with probability 1/2. The optimal algorithm spends no more than one run per It , using an up run for a positive segment or a down run for a negative segment. 4 We show that any deterministic algorithm requires at least one new run to write It−1 and It1 for t > 1. Further analysis shows that with probability 1/2, any deterministic algorithm requires at least one run to write the remainder of It1 , It2 , and It3 . Note that one run is also required to write I11 ; summing, this gives a total expected cost of (3/2)OPT. 4 3 Consider a segment It ; t > 1. Once all of It−1 has been read into its buffer, at least one element of It−1 1 4 has been written. Once all of It has been read into its buffer, at least one element of It−1 has been written. Finally, once all of It2 has been read into its buffer, at least one element x of It1 has been written. Applying Observation A, at least one new run is required to write these three elements. Now we show that with probability 1/2, an additional run is required to write It2 . Let x be the first element written by It1 (thus, the cost of writing x itself was handled in the above case—we show when an additional run is required). Note that the algorithm must choose an x before it sees any element of It2 (so it does now know if It is positive or negative). Let x 6= 1 and It be a positive segment. By Observation A, an additional run is required to write both 1 and any element of It2 . If 1 is not written, all of It2 cannot be stored in the buffer—but then, It2 and It3 cannot be written using one run. Similarly, let x = 1 and It be a negative segment. By Observation A, an additional run is required to write both M and any element of It2 ; otherwise It2 and It3 require an additional run to be written. Thus any deterministic algorithm cannot perform better than a 3/2-approximation. Applying Yao’s minimax principle proves the theorem. Now we perturb the input to avoid duplicate elements. We multiply each element by ⌊N/4M ⌋, and add t to each element of It . In other words, we use a new segment It′ = (It ⊗ ⌊N/4M ⌋) ⊕ t. Our arguments above only depended on the relative ordering of the elements, which is preserved by this perturbation. For example, assume It and It−1 are both positive segments. Then all elements of It1 are less 4 , and all elements of I 2 are greater than all elements of I 1 . than all elements of It−1 t t Proof of Theorem 13. We will build S constructively. At each time t where greedy chooses an up or down run, we show that one of its choices leads to a run of length at least 5M/4. Since the greedy algorithm always picks the longer run at each decision time step, the run with length less than 5M/4 can never be part of its output. Consider any time step t where t < N − 5M/4. If t is larger than this value, the final run will have length n − t. Let the contents of buffer at time t be σ1 , . . . , σM . Let I ′ be the sequence of ⌊M/2⌋ elements of I arriving after t. 17 Consider the run starting at σM and continuing downwards; call this down run r1 . Let r2 be the up run starting at σ1 and continuing upwards. Any element of I ′ less than σ⌈M/2⌉ will be (eventually) written out to r1 ; any that are greater than σ⌊M/2⌋ will be written out to r2 . Every number must fall into one of these categories, so there must be at least ⌈⌊M/2⌋/2⌉ numbers added to the larger run. Each run at least includes the elements already in the buffer, so the larger run has length at least M + ⌈⌊M/2⌋/2⌉. The last 5M/4 elements can be handled by greedy in at most 2 runs (since each trivially has length at least M ). Thus the above applies to all but the last two runs of greedy. Proof of Lemma 14. Let S1 be an output that writes r1 initially and S2 be an output that writes r2 initially. Without loss of generality, suppose that r1 is increasing and r2 is decreasing. Let r1 = r1 (1), r1 (2), . . . , r1 (k) and r2 = r2 (1), r2 (2), . . . , r2 (ℓ). The idea of the proof is to split these runs into two phases (a) elements of r1 are smaller than the corresponding elements of r2 and (b) when the elements of r1 are greater than or equal to those of r2 . During each of these two phases, we use the fact that incoming elements written by S1 have to be in the buffer of S2 (and vice versa) to bound their length. We assume that both runs write exactly one element for each element they read in; this cannot affect the length of the runs. Let B0 be the original buffer, i.e., the first M elements of I. Let i be the transition point between the two phases mentioned above; in other words, r1 (i + 1) ≥ r2 (i + 1) but r1 (i) < r2 (i). Divide r1 into s1 and t1 , where s1 is the first i elements of r1 , and t1 is the remainder of r1 . We further N B divide s1 into sB 1 , the elements of s1 that are in B0 , and s1 , the elements of s1 that are not in B0 . Let t1 be the elements of t1 that are in B0 . Let f1 be the set of elements in r1 that are read in after xi+1 is written. Let u1 be the set of elements not in r1 that are read in before xi+1 is written. We define the corresponding sets N B for r2 as well: sB 2 , s2 , t2 , f2 , r2 , and u2 . We can bound the size of several of these sets by M . Note that s1 cannot have more than M elements, B N B since all must be stored in the buffer while s2 is being written. Thus |sN 1 |+|s1 | ≤ M . Similarly, |s2 |+|s2 | ≤ B N M . We must also have |u1 | ≤ M and |u2 | ≤ M . Finally, consider sN 2 ∪ t1 . Any element in s2 must be read N before time step i. Since s1 is disjoint from s2 (by definition of i), all elements of s2 must be in S1 ’s buffer N B at time step i. All elements of tB 1 must also be in S1 ’s buffer at time step i, so |s2 | + |t1 | ≤ M . Starting from time step i + 1, any new element e that is read in cannot be in both r1 and r2 . This means that all elements of f1 must be in the buffer of S2 until r2 ends, and all elements of f2 must be in the buffer of S1 until r1 ends. On the other hand, all elements of u2 must eventually be a part of r1 , and similarly for u1 and r2 . B To begin, we show a weaker version of the lemma for runs of length 4M . We have |r1 | ≤ (|sN 1 | + |s1 |) + B (|u2 |) + (|sN 2 | + |t1 |) + |f1 | ≤ 3M + |f1 |. Then if |r1 | ≥ 4M , then |f1 | ≥ M . Since all elements of f1 must be stored in the buffer of S2 until r2 ends, r2 must end when the M th element of f1 is read in. Then we must have |r2 | < |r1 |. B N B We have |f1 | ≥ M − |u2 |; otherwise, |r1 | ≤ (|sN 1 | + |s1 |) + (|s2 | + |t1 |) + (|u2 | + |f1 |) < 3M . Consider the first M − |u2 | − 1 elements read in after i that are eventually written to r1 (this is a prefix of f1 ), call them f1′ . Since |f1 | ≥ M − |u2 |, there must be another element e ∈ f1 that is read after all elements of f1′ . Note e ∈ r1 . Let t be the time when e arrives. At t, the buffer of S2 must contain all elements of u2 , as well as all elements of f1′ and e. The buffer of S2 is then full of elements that cannot be written in r2 . Hence, S2 is forced to start a new run at time t < |r1 |, so |r2 | < |r1 |. Then we must have |f2 | + |u1 | < M , none of the elements in these sets are in r1 , and must be stored in S1 ’s buffer until r1 ends (which is after r2 ends). Finally, we have, B N B |r2 | ≤ (|sN 2 | + |s2 |) + (|s1 | + |t2 |) + (|u1 | + |f2 |) < 3M , as required. 18 Proof of Theorem 17. Algorithm A will simulate the maximal up run, r1 , and maximal down run, r2 , to see which is longer, but it does not actually need to write any elements during this simulation. By Lemma 14, if we find that one run has length at least 3M , it must be the longer run. We now describe exactly how to simulate a run r using 4M space without writing any elements. Algorithm A simulates the run one step at a time. We describe the actions and buffer of an algorithm with M -size buffer writing r as the simulated algorithm. Without loss of generality, assume r is an up run. Assume that all elements are stored in the buffer in the order they arrive. Thus, after t elements have been written, the first M + t elements of the buffer are exactly the elements the simulated algorithm has read from the input up to time t. Of these elements, M must be in the buffer of the simulated algorithm, while the other t will have been written to r; however, A does not explicitly keep track of which elements are in the buffer. The algorithm A keeps track of ℓ, the last element written to r, because at each t, all of the first M + t elements larger than ℓ are: (a) in the simulated algorithm’s buffer at time t and (b) will be written to r at a later point. Thus, once no item in the first M + t elements is larger than ℓ, r must end. At each time step, the smallest element larger than ℓ is written to r. Specifically, at time step t, A finds the smallest element e in the first M + t elements of the buffer that is larger than ℓ. This is the next element of r. Thus in the next time step, A updates ℓ ← e, and repeats. If no such e can be found, no element in M + t (and thus no element in the buffer) can continue the run, so the run ends at time t. The last time A can update ℓ is when the simulated algorithm has seen all elements in the buffer; in other words, t = 3M . By Lemma 14, this is sufficient to determine which run is longer. The algorithm now knows which run is longer; without loss of generality, assume |r1 | ≥ |r2 |. Then the algorithm writes a maximal run r using its 4M -size buffer in the direction of r1 . Run r is guaranteed to contain all elements of r1 by Lemma 2. Since r2 has length less than 3M by Lemma 14, all of its elements must already be in the 4M -size buffer. Thus they are written during r because a maximal run always writes its buffer contents. The first initial run of a proper optimal algorithm on the unwritten-element sequence has to be either r1 or r2 . Since r covers both r1 and r2 , by Theorem 7 with β = 1, A never writes more runs than an optimal algorithm with an M -size buffer. Lemma A. Consider two algorithms A1 and A2 that have the same remaining input I when they both start writing a new maximal run, called r1 and r2 . Let their buffers at this point be B1 and B2 (that may not be full) respectively, and assume max(B1 \ B2 ) ≤ min(B2 \ B1 ). If r1 and r2 are increasing then all elements in I written to r2 are also written to r1 . Similarly, if r1 and r2 are decreasing, all elements in I written to r1 are also written to r2 . Proof. It suffices to prove the first case where r1 and r2 are both increasing as the other case can be proven similarly. After r1 (t) and r2 (t) were written, let C1 (t) and C2 (t) be the set of elements in the buffers of A1 and A2 that will be written in r1 and r2 at some point in the future, i.e., the set of elements that are at least as large as r1 (t) or r2 (t) respectively. It is easy to prove by induction that the invariant max(C1 (t) \ C2 (t)) ≤ min(C2 (t) \ C1 (t)) always holds. We note that this invariant implies r1 (t + 1) ≤ r2 (t + 1). Therefore, if this invariant is true for all t < min{|r1 |, |r2 |}, any incoming element e ∈ I satisfies e ∈ r2 ⇒ e ∈ r1 as required. We now prove the invariant: The base case is true since C1 (0) = B1 , C2 (0) = B2 . Suppose the invariant holds for t, then r1 (t + 1) ≤ r2 (t + 1) and a new element e is read in. Case 1: if e ∈ C2 (t + 1) ⇒ e ≥ r2 (t + 1) ≥ r1 (t + 1) ⇒ e ∈ C1 (t + 1). Case 2: if e ∈ C1 (t + 1) and e ∈ / C2 (t + 1), then e < r2 (t) = min(C2 (t)) ≤ min(C2 (t + 1)). 19 Hence, the invariant holds for t + 1. Observation B. On an input I, let r1 ◦ . . . ◦ rk be the first k runs of an optimal output. If r1′ ◦ . . . ◦ rk′ be k runs that cover r1 ◦ . . . ◦ rk . Then, r1′ ◦ . . . ◦ rk′ are also the first k runs of an optimal output. Proof of Lemma 16. Without loss of generality, assume r1 and r2 are initial maximal increasing and decreasing runs respectively and |r1 | ≥ |r2 |. Suppose r2 ◦ r3 , where r3 is an increasing, is prefix of an optimal output SOPT (I). Consider the case |r1 | = |r2 |. Let their buffers at the end of these two runs be B1 , B2 and let j be the smallest index such that r1 (j + 1) > r2 (j + 1). Consider any new element e ∈ I that is read in before r1 (j + 1) is written. Obviously, we have: e ∈ B1 ⇒ e ≤ r1 (j), e ∈ B2 ⇒ e ≥ r2 (j). Consider any new element e which is read in after r1 (j + 1), r2 (j + 1) were written. It is easy to see the followings: e ∈ B1 \ B2 ⇒ e ≤ r1 (j + 1), e ∈ B2 \ B1 ⇒ e ≥ r2 (j + 1). Therefore, max(B1 \ B2 ) ≤ max(r1 (j), r2 (j + 1)) ≤ min(r2 (j), r1 (j + 1)) ≤ min(B2 \ B1 ). The situation can be visualized in Figure 2 as follows. If an incoming element cannot be written in the current run, it lies below or above (depending on whether the run is increasing or decreasing) the last element written. The regions are marked with their associated sets described above. rf (covered by r1 ) B2 \ Bf | {z H } B2 ∩ Bf Bf \ B2 r2 j j+1 Figure 2: Visualizing the buffer states. Consider r1 ◦ r4 , where r4 is a maximal increasing run. Every elements in r2 will be written in either r1 or r4 by Lemma 6. If e ∈ r3 , then we consider the cases where e ∈ B2 or e ∈ r3 \ B2 . If e ∈ B2 , then e is either in r1 or in B1 which means e is either in r1 or r4 . If e ∈ r3 \ B2 , e ∈ r4 by Lemma A using the fact that max(B1 \ B2 ) ≤ min(B2 \ B1 ). Thus, r1 ◦ r4 covers r2 ◦ r3 . As a result, r1 is also a prefix of an optimal output SOPT (I)′ by Observation B. If |r1 | > |r2 | and |r2 | = k. Then the simplest argument goes as follows. Instead of arguing based on r1 directly, we consider rf that is increasing but may not be maximal. Consider an algorithm Af that writes rf (1) = r1 (1), . . . , rf (k) = r1 (k). Then, without reading any new element in, it finishes its first 20 run rf by writing out all elements in its buffer that are larger than r1 (k) (the set of these elements is H in Figure 2). After this extra step, let the buffer be Bf . Use the exact same argument as above, we have that max(Bf \B2 ) ≤ max(rf (j), r2 (j +1)) ≤ min(r2 (j), rf (j +1)) ≤ min(B2 \Bf ). Using the same argument as in the first case, we have that rf followed by a maximal increasing run will cover r2 followed by a maximal increasing run. Hence, rf is a prefix of an optimal output. Since r1 covers rf , it is also a prefix of an optimal output by Observation B . Proof of Theorem 19. In any partition that is not the last one, let Ir be the unwritten-element sequence and let r1 , r2 be the two possible maximal initial runs where r1 is increasing and r2 is decreasing. Without loss of generality, suppose |r1 | ≥ |r2 |. We use the simulation technique of Theorem 17 to determine which run is longer. The algorithm writes r1 ◦ r3 ◦ r4 where r3 and r4 are maximal runs that have the same and opposite directions as r1 respectively. The algorithm stops when there is no element left to write. We break our analysis into cases based on what runs are in an optimal output. In each case, we show that Theorem 7 proves a competitive ratio of β = 3/2. If r2 is a prefix of a proper optimal output SOPT (Ir ), let r5 be the maximal run after r2 in SOPT (Ir ). After writing r3 , the algorithm already writes out all elements in r2 by Lemma 6. Let the unwritten-element sequence after writing r3 be I3 and let the unwritten-element sequence after writing r2 be I2 . By Lemma 6, I3 is a subsequence of I2 . According to Lemma 16, r5 has to be decreasing in order to possibly have fewer runs than writing r1 initially. Hence, applying Lemma 8 to r4 and r5 , we know that at the end of r4 , the algorithm has written all elements of r2 and r5 . Thus, r1 ◦ r3 ◦ r4 covers r2 ◦ r5 . If r1 ◦ r3 is a prefix of SOPT (IR ), then we are done as r1 ◦ r3 ◦ r4 trivially covers r1 ◦ r3 . If r1 is a prefix of SOPT (IR ) but r1 ◦ r3 is not a prefix of SOPT (IR ). Then, let r3′ be the opposite maximal run to r3 , i.e., r1 , r3′ are the first two runs of SOPT (IR ). We have I3 is a subsequence of I1 . Hence, applying Lemma 8 to r3′ on input I1 and r4 on input I3 , we have that at the end of r4 , the algorithm has written out all elements in r1 ◦ r3′ . Thus, r1 ◦ r3 ◦ r4 covers r1 ◦ r3′ . In the last partition, since A outputs at most 3 runs, it can only achieve a ratio worse than 3/2 if the optimal algorithm wrote out a single run. But then that run is longer, and A would choose it. Therefore, we have R(S) ≤ (3/2) · OPT. Proof of Theorem 20. In each partition, let the unwritten-element sequence be Ir and the optimal proper output of Ir be SOPT (Ir ). The algorithm randomly picks the direction of the next run and writes a maximal runs in that direction using M -size buffer. It uses the extra M buffer slots to simulate the buffer state of the maximal run in the other direction to check if the run it chose is at least as long as the other run. If the algorithm picked the run that is at least as long as the other run, it then writes a maximal run in the same direction followed by another maximal run in the opposite direction. The algorithm stops when there is no more element to write. In the proof of Theorem 19, we showed that these three runs will cover the first two runs of SOPT (Ir ). If the algorithm picked the shorter run, then it writes three more maximal runs with alternating directions. We know that the first two runs with alternating directions cover the first run of SOPT (Ir ) as argued in the proof of Theorem 9; hence, the next two runs with alternating directions cover the second run of SOPT (Ir ). In the last partition, if OPT(Ir ) = 2, the analysis is the same. If OPT(Ir ) = 1, then the optimal output must be the longer maximal run. The algorithm, if picked the shorter run, then will cover the longer run when it writes the next maximal run in the opposite direction as showed in the proof of Theorem 9. Therefore, we have E[xi ]/yi ≤ 1/2.(4/2) + 1/2.(3/2) = 7/4. Applying Theorem 7 with α = 1.75 and β = 2, we have: E[R(S)] ≤ (7/4)OPT and R(S) = 2OPT. Proof of Theorem 21. Suppose an algorithm has (4M − 3)-size buffer. Consider the input I1 ◦ e ◦ I2 where I1 = (1 ր M − 1) ◦ (2M − 1 ց M ) ◦ (3M ր 4M − 2) ◦ (−M ց −2M + 2). 21 If S first writes −2M + 2, then let e = −2M + 1. • Case 1: if S writes e = −2M + 1 next, then let I2 = (0 ց −(M − 1)). Thus, S has to spend at least two runs while an optimal output is one run: (4M − 2 ց 3M ) ◦ (2M − 1 ց −2M + 1). • Case 2: if S writes −2M + 3 next, let I2 = (−2M ց −10M ) ◦ (2M ր 3M − 1). Then S has to spend at least 3 runs while an optimal output has 2 runs: (4M −2 ց 3M )◦(2M −2 ց −10M )◦(2M ր 3M −1). Similarly, if S first writes 4M − 2, then let e = 4M − 1. • Case 1: if S writes e = 4M − 1 next, then let I2 = (2M ր 3M − 1). • Case 2: if S writes 4M − 3 next, let I2 = (4M ր 10M ) ◦ (0 ց −M + 1). If S first writes e′ ∈ / {−2M + 2, 4M − 2}, then let e = −2M + 1, I2 = (0 ց −(M − 1)). Thus, S has to spend at least two runs while an optimal output has the following output with one run: (4M − 2 ց 3M ) ◦ (2M − 1 ց −2M + 1). Proof of Theorem 22. We apply Theorem 7 with x = (⌈1/ε⌉ + 1) and y = ⌈1/ε⌉. In any partition except the last one, the algorithm chooses the combination of ⌈1/ε⌉ maximal runs r1 ◦ · · · ◦ r⌈1/ε⌉ whose output is longest (ties are broken arbitrarily) and writes out one extra run r(⌈1/ε⌉+1) . By Lemma 6, r1 ◦ · · · ◦ r⌈1/ε⌉+1 covers the first ⌈1/ε⌉ runs of a proper optimal output of the unwritten-element sequence Ir in (1 + ⌈1/ε⌉) runs. In the last partition, the algorithm chooses a combination of runs with the smallest number of runs. Therefore, we obtain an β = 1 + 1/⌈1/ε⌉ ≤ 1 + ε approximation. There are 2⌈1/ε⌉+1 combinations to consider (each run can be up or down). The length of a run can be calculated in O(Ni ) time by simulating it directly. where Ni is the length of the longest P output, namely, |r1 ◦ · · · ◦ r⌈1/ε⌉+1 |, Since Ni items are then written out, the total running time is O( ti=1 Ni 2⌈1/ε⌉ ) = O(N 21/ε ). Searching for the shortest way to write out the remaining elements (once Ir = ∅) takes O(N 21/ε ) time, which does not affect the running time. Proof of Theorem 23. In each partition, we restrict the search for the combination of (1/⌈ε⌉) consecutive runs that writes the longest sequence as described above. By Lemma 16, if d runs remain to be written out, we must examine one subcase with d − 1 runs remaining, and one with d − 2 runs remaining. Thus, the number of combinations we need to consider is Fd = Fd−1 + Fd−2 . Therefore, the running time of this step is O(F⌈1/ε⌉ Ni log Ni ). Thus, we have O(F⌈1/ε⌉ Ni log Ni ) = O(ϕ⌈1/ε⌉ Ni log Ni ). √ This is because F⌈1/ε⌉ = (ϕ⌈1/ε⌉ − ψ ⌈1/ε⌉ )/ 5 ≤ ϕ⌈1/ε⌉ . Proof of Theorem 24. At each decision time step, A flips a coin to pick a direction for the next run r. It begins writing an up or down run according to the coin flip. Meanwhile, A uses M additional space to simulate r′ , the run in the opposite direction. In particular, it simulates the contents of the buffer at each time step, as well as the last element written. Note that A does not need to keep track of the most recent element read when simulating r′ , as it is always the last element in the buffer. By Lemma 14, the run with the incorrect direction has length less than 3M and the run with the correct direction has length of 3M or more. Thus, A can tell if it picked the correct direction. With probability 1/2, A writes the longer run. Therefore, it knows it made the correct direction and repeats, flipping another coin. Now consider the case where A picks the wrong direction. When r ends (at time t), r′ is continuing. Then A must act exactly as if it had written r′ . Specifically, we cannot simply cover r′ and use an argument akin to Theorem 7, as then the unwritten-element sequence may not be 3-nearly-sorted. To simulate r′ , A has two tasks: (a) A must write all elements that were written by r′ that were not written by r, and (b) A must “undo” writing any element that was written during r that is not in r′ , in case 22 these elements are required to make a subsequent run have length 3M . Divide the buffer into two halves: Br is the buffer after writing r, and Br′ is the buffer being simulated when r ends. The first task is to ensure that A writes all elements written by r′ that were not written by r in the direction of r′ . These elements must be in Br since they were not written by r; and they must not be in Br′ since they were written by r′ . Thus, A can simply write out each element in Br that is not in Br′ and continue writing r′ from that time step. The second task is to ensure that all elements written during r that were not written during r′ cannot affect future run lengths. These elements must be in Br′ but not in Br . We mark these elements as special ghost elements. We can do this with O(1) additional space by moving them to the front of the buffer and keeping track of how many of them there are. During subsequent runs, these are considered to be a part of A’s buffer. However, when A would normally want to write one of these elements out, it instead simply deletes it from its buffer without writing any element. That said, A still counts these deletions towards the size of the run. Note that our buffer never overflows, as A continues to write (or delete) one element per time step. When this simulation is finished, the contents of A’s buffer are exactly what they would have been had it written out r′ in the first place—however some are ghost elements, and will be deleted instead of written. Then A repeats, flipping another coin. For each run in the optimal output, either: A writes that run exactly for cost 1 (with probability 1/2), or A writes another run, and makes up for its mistake by simulating the correct run exactly, for cost 2 (with probability 1/2). Thus A has expected cost (3/2)OPT. In the worst case, A guesses incorrectly each time for a total cost of 2OPT. Proof of Lemma 25. Suppose we are at a decision time step. Without loss of generality, assume this time step to be 0. Let the next two possible maximal runs be r1 and r2 that are up and down respectively. Without loss of generality, suppose |r1 | ≥ 5M . Let r3 be the maximal decreasing run that follows r2 . By Lemma 16, either writing r1 or writing r2 ◦ r3 is optimal on the unwritten-element sequence. Call the two outputs S1 and S2 respectively. B N B N B Let sB 1 , s1 , t1 , f1 , u1 , s2 , s2 , t2 , f2 , u2 be the same sets described in the proof of Lemma 14. Let r1 = (x1 , . . . , xk ), r2 = (y1 , . . . , yℓ ), r3 = (yℓ+1 , . . . , yq ). Let the buffers of S1 and S2 after time step t + ℓ be B1,ℓ , B2,ℓ . Let j ≥ ℓ be the smallest such that xj+1 ≥ yj+1 . Similar to the proof of Lemma 14, we let r1,2 = {xℓ+1 , . . . , xj } and t1,2 = {xj+1 , . . . , xk }. Let sN 1,2 be B the set of elements in r1,2 but not in B1,ℓ . Let s1,2 be the set of elements in r1,2 and also in B1,ℓ \ r2 . Let tB 1,2 be the set of elements in t1,2 and also in B1,ℓ \ r2 . Let u1,2 be the set of elements not in r1 and read in before xj+1 is written. Let f1,2 be the set of elements in r1 and read in after xj+1 is written. We define s3 = {yℓ+1 , . . . , yj } and t3 = {yj+1 , . . . , yq }. Let sN 3 be the set of elements in s3 but not in B2,ℓ . Let sB be the set of elements in s ∩ B . Let u be the set of elements that are not in r3 and read in 3 3 2,ℓ 3 before yj+1 is written. Let f3 be the set of elements in r3 and read in after yj+1 is written. B N B Since the buffer of S2 must keep all elements in sN 1,2 , s1,2 at time step j + 1, we have |s1,2 | + |s1,2 | ≤ M . We have, N B |r1 | = |r2 | + (|u3 | + |sN 3 |) + (|s1,2 | + |s1,2 |) + |f1,2 | ≤ (2M + |u1 | + |f2 |) + (M − |u1 | − |f2 |) + M + |f1,2 | = 4M + |f1,2 |. Since |f1,2 | ≥ M because of our assumption |r1 | ≥ 5M , r3 has to end before r1 using the same argument as in the proof of Lemma 14. Since |r1 | ≥ |r2 ◦r3 |, r1 followed by any maximal run will cover r2 ◦r3 . Therefore, r1 is an optimal prefix of the unwritten-element sequence. At every time step, the maximal run of length 5M or more is always a prefix of an optimal output on the unwritten-element sequence as required. 23