External Sorting: Demetris Zeinalipour
External Sorting: Demetris Zeinalipour
External Sorting: Demetris Zeinalipour
University of Cyprus
EPL446 Advanced Database Systems
Lecture 9
External Sorting
Chapter 13: Ramakrishnan & Gehrke
Demetris Zeinalipour
http://www.cs.ucy.ac.cy/~dzeina/courses/epl446
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
9-1
9-2
Lecture Outline
External Sorting
13.1) When does a DBMS sort Data.
13.2) A Simple Two-Way Merge-Sort
(
)
13.3) External Merge-Sort (E
)
Exclude 13.3.1: Minimizing the Number of
Runs.
Query Optimization
and Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management
DB
9-3
Idea Outline
Pass 0 (Sort Lists):For every page, read it, sort it, write it out
Pass 1, 2, , etc. (Merge Lists): see next page for merging concept
Buffer
Disk
Main
memory
OUTPUT
INPUT 2
Disk
Disk
Main memory
buffers
Passes 1,2,
Pass 0
Disk
9-5
Merging Lists
( 2 )
Merging Lists Outline (Phase 1,2,)
1. Load the next sorted runs R1 and R2 into main memory buffers B1 and B2 a
page-at-a-time (i.e., initially first page from each run) (see left figure)
If B1[i] was smallest item then i++ else j++ (see right figure)
If OUTPUT gets full, it is appended to the end of a file on DISK and cleared in RAM.
4. Repeat the above until either index i or j reaches the end of its buffer.
At this point write the remaining records to OUTPUT, flush it to disk and finish.
5. Repeat procedure from 1-4 until all runs have been traversed.
B1
B1
OUTPUT
16
30
B2
min(B1[i],B2[j])
B2
Disk
Disk
OUTPUT
15
28 60
9-6
Example
Execution
3,4
6,2
9,4
8,7
5,6
3,1
Input file
PASS 0
3,4
2,6
4,9
7,8
5,6
1,3
1-page runs
PASS 1
4,7
8,9
2,3
4,6
1,3
5,6
2-page runs
PASS 2
2,3
4,4
6,7
8,9
1,2
3,5
6
4-page runs
PASS 3
1,2
2,3
3,4
4,5
6,6
7,8
9
PASS 4
9-7
log 2 N 1
log10 7
1 2.8 1 4
log10 2
log 2 7 1
log 2 5 1 2.3 1 4
log 2 4 1 2 1 3
2 N * (# passes)
3,4
3,4
6,2
2,6
9,4
4,9
8,7
7,8
5,6
5,6
3,1
1,3
Input file
PASS 0
1-page runs
PASS 1
4,7
8,9
2,3
4,6
1,3
5,6
2-page runs
2
PASS 2
2,3
4,4
6,7
8,9
1,2
3,5
6
4-page runs
log 2 N
PASS 3
1,2
2,3
2 * 7 * (log 2 7 1) 2 * 7 * 4 56
3,4
8-page runs
i.e., (read+write) * 7 pages * 4 passes
4,5
That can be validated on the right figure
6,6
Pass#0=2*7
Pass#1=2*7
7,8
Pass#2=2*7
Pass$3=2*7
9
9-8
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Phase 0
We
want to
end up
with 1 Merging
Lists
final run
9-9
9-10
Idea Outline
Pass 0 (Sort): Sort the N pages using B buffer pages
Pass 0
Passes 1,2,
9-11
2-way Sort
1log 2 N
2
0
Pass
3: Database
Sorted file
of 108
pages
EPL446:
Advanced
Systems
- Demetris
Zeinalipour (University of Cyprus)
9-12
B=3
100
7
1,000
10
10,000
13
100,000
17
1,000,000
20
10,000,000
23
100,000,000
26
1,000,000,000 30
B=5
4
5
7
9
10
12
14
15
B=9
3
4
5
6
7
8
9
10
1 log
/ B of Cyprus)
EPL446:
Advanced
Database
Zeinalipour
(University
* Results
generated
withSystems
formula:- Demetris
B 1 N
9-13
External Merge-Sort
( )
External MergeSort Pseudocode
Phase 0
9-14
Double Buffering
( )
An I/O request takes time to complete. Only think about all the
involved layers and delays (Disk Delays, Buffer Manager, etc)
To reduce wait time of CPU for I/O request to complete, can prefetch
() into `shadow block ( )
Main Idea: When all tuples of INPUTi have been consumed, the CPU
can process INPUTi which is prefetched into main memory instead of
waiting for INPUTi to refill. same idea applies to OUTPUT.
INPUT 1
INPUT 1'
INPUT 2
INPUT 2'
OUTPUT
OUTPUT'
Disk
INPUT k
block size
Disk
INPUT k'
9-15
B+ tree is clustered
9-16
Index
(Directs search)
Data Entries
("Sequence set"
Data Records
9-17
Data Entries
("Sequence set")
9-18