Nothing Special   »   [go: up one dir, main page]

An Object-Oriented Model For Implementing Cosequential Processes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Module 3

Cosequential operations involve the coordinated processing of two or more sequential lists to
produce a single output list. Sometimes the processing results in a merging, or union,

AN OBJECT-ORIENTED MODEL FOR IMPLEMENTING COSEQUENTIAL


PROCESSES
We present a Single, Simple model that can be the basis for the construction of any kind of
cosequential process.
Matching Names in Two Lists
Suppose we want to output the names common to the two lists Shown in Fig. 8.1. This
operation is usually called a match operation, or an intersection. We assume, for the moment,
that we will not allow duplicate names within a list and that the lists are sorted in ascending
order.

At each step in the processing of the two lists, we can assume that we have two items to
compare: a current item from List 1 and a current item from List 2. Let's call these two
current items Item (l) and Item (2). We can compare the two items to determine whether
Item (i) is less than, equal to, or greater than Item (2):

 If Item (1) is less than Item (2), we get the next item from List I;
 If Item (1) is greater than Item (2), we get the next item from List 2; and
 If the items are the same we output the item and get the next items from the two lists.

Page 1
It turns out that this can be handled very cleanly with a single loop containing one three-way
conditional statement, as illustrated in the algorithm of Fig. 8.2. (Also can refer program 7)

Although the match procedure appears to be quite simple, there are a number of matters that
have to be dealt with to make it work reasonably well.
Initializing: We need to arrange things in such a way that the procedure gets going properly.
Getting and accessing the next list item: we need simple methods that support getting the
next list element and accessing it.
Synchronizing: we have to make sure that the current item from one list is never so far ahead
of the current item on the other list that a match will be missed. Sometimes this means getting
the next item from List I, sometimes from List 2, sometimes from both lists.
Handling end-of-file conditions: when we get to the end of either List I or List 2, we need to
halt the program.
Recognizing errors: when an error occurs in .the data (for example, duplicate items or items
out of sequence), we want to detect it and take some action.
Merging Two Lists
The three-way-test, single- loop model for cosequential processing can easily be modified to
handle merging of lists simply by producing output for every case of the if-then-else
construction since a merge is a union of the list contents. Method Merge2Lists is given in Fig.
8.5. Once again, you should use this logic to work, step by step, through the lists provided in

Page 2
Fig. 8.1 to see how the resynchronization is handled and how the use of the HighVaIues
forces the procedure to finish both lists before terminating.

Summary of the Cosequential Processing Model


It is important to understand that the model makes certain general assumptions about the
nature of the data and type of problem to be solved. Following is a list of the assumptions,
together with clarifying comments

Page 3
Given these assumptions, the essential components of the model are:
 Initialization. Previous item values for all files are Set to the low value; then current
records for all files are read from the first logical records in the respective files.
 One main Synchronization loop is used, and the loop continues as long as relevant
records remain.
 Within the body of the main synchronization loop is a selection based on comparison
of the record keys from respective input file records. If there are two input files, the
selection takes the form given in function Match of Fig. 8.2.
 Input files and output files are sequence checked by comparing the previous item
value with the new item value when a record is read. After a successful sequence
check, the previous item value is set to the new item value to prepare for the next
input operation on the corresponding file.
 High values are substituted for actual key values when end-of- file occurs. The main
processing loop terminates when high values have occurred for all relevant input files.
The use of high values eliminates the need to add special code to deal with each end-
of- file condition.
 AIl possible I/O and error detection activities are to be relegated to supporting
methods so the details of these activities do not obscure the principal processing logic.
APPLICATION OF THE MODEL TO A GENERAL LEDGER PROGRAM
The Problem
Suppose we are given the problem of designing a general ledger posting program as part of
an accounting system. The systern includes a journal file and a ledger file. The ledger
contains month-by- month summaries of the values associated with each of the bookkeeping
accounts. A sample portion of the ledger, containing only checking and expense accounts, is
illustrated in Fig. 8.6.
The journal file contains the monthly transactions that are ultimately to be posted to the
ledger file. Figure 8.7 shows these journal transactions.

Page 4
Once the journal file is complete for a given month, meaning that it contains all of the
transactions for that month, the journal must be posted to the ledger. Posting involves
associating each transaction with its account. in the ledger.
How is the posting process implemented? Clearly, it uses the account number as a key to
relate the Journal transactions to the ledger records.
A better solution is to begin by collecting all the journal transactions that relate to a given
account. This involves sorting the journal transactions by account number, producing a list
ordered as in Fig. 8.9. Now we can create our output list by working through both the ledger
and the sorted journal cosequentially, meaning that we process the two lists sequentially and
in parallel. This concept is illustrated in Fig. 8.10.

Page 5
Application of the Model to the Ledger Program
The monthly ledger posting program must perform two tasks:
 It needs to update the ledger file with the, correct balance for each account for the
current month.
 It must produce a printed version of the ledger that not only shows the beginning and
current balance for each accounts but also lists all the journal transactions for the
month.
As you can see in figure 8.11, the printed output from the monthly ledger posting program
shows the balances of all ledger accounts, whether or not there were transactions for the
account.

Page 6
In summary, there are three different steps in processing the ledger entries:
1. Immediately after reading a new ledger object, we need to print the header line and
initialize the balance for the next month from the previous month's balance.
2. For each transaction object that matches, we need to update the account balance.
3. After the last transaction for the account, the balance line should be printed. This is {he
place where a new ledger record could be written to create a new ledger file.

Figure 8.13 has the code for the three-way-test loop of method PostTransactions.

Page 7
The reasoning behind the three-way test is as follows:
1. If 1th. e ledger (master) account number (Item [1]) is Ièss-than the journal (transaction)
account number (Item[2]), then there are no more transactions to add to the ledger account
this month (perhaps there were none at all), so we print the. ledger account balances
(ProcessEndMaster) and read in the next ledger account (NextItemlnList(I)). If the account
exists (MoreMasters is true), we print the title line for the new account (ProcessNewMaster).

2. If the account numbers match, then we have a journal transaction that is to be posted to the
current ledger account. We add the transaction amount to the account balance for the new
month (ProcessCurrentMaster), print the description of the transaction (ProcessItem(2)), then
read the next journal entry (NextItemlnList(1)).

3. If the journal account is less- than the ledger account, then it is an unmatched journal
account, perhaps due to-an input error. We print an error message (ProcessTransactionerror)
and continue with the next transaction.

EXTENSION OF THE MODEL TO INCLUDE MULTIWAY MERGING


The most common application of cosequential processes requiring more than two input files
is a K-way merge, in which we want to merge K input lists to create a single, sequentially
ordered output list. K is often referred to as the order of a K-way merge.
A K-way Merge Algorithm (Also can refer program 8)
Suppose we keep an array of lists and array of the items (or keys) that are being used from
each list in the cosequential process:
list [ 0 ] , list [I ] , list [ 2 ] , list [k-I]
Item[ O ] , Item[I ] , Item[ 3 ] , Item [k-I]
The main loop for the merge processing requires a call to a Minlndexfunction to find the
index of item with the minimum collating sequence value and an inner loop that finds all lists
that are using that item:

int minltem = Minlndex(Item,k); // find an index of minimum item


Processltem(minltem); // Item(minltem) is the next Output
for (i = 0; i<k; i++) // 1O0`k at each list
if (Item(minltem) == ItemCi)) // advance list i
Moreltems[i] = NextltemlnList(i);
Clearly, the expensive parts of this procedure are finding the minim um and testing to see in
which lists the item occurs and which files therefore need to be read. Note that because the
item can occur in several lists, every one of these if tests must be executed on every cycle

Page 8
through the loop. However, it is often possible to guarantee that a Single item, or Key, occurs
in only one list. In this case, the procedure becomes simpler and more efficient.

A Selection Tree for Merging Large Numbers of Lists


The K-way merge described earlier works nicely if K is no larger than 8 or so. When we
begin merging a larger number of lists, the set of sequential comparisons to find the key with
the minimum value becomes noticeably expensive.
The use of a selection tree is an example of the classic time- versus space trade-off we so
often encounter in computer science. The concept underlying a selection tree can be readily
communicated through a diagram such as that in Fig. 8.15.

The selection tree is a kind of tournament tree in which each higher level node represents the
"winner" (in this case the minimum key value) of the comparison between the two descendent
keys. The minimum value is always at the root node of the tree. If each key has an associated
reference to the list from which it came, it is a simple matter to take the key at the root, read
the next element from the associated list, then run the tournament again. Since the tournament
tree is a binary tree, its depth is ceiling function ┌log2Kl

for a merge of K lists. The number of comparisons required to establish a new tournament
winner is, of course, related to this depth rather than being a linear function of K.

Page 9
A SECOND LOOK AT SORTING IN MEMORY
Consider the problem of sorting a disk file that is small enough to fit in memory. The
operation we described involves three separate steps:
1. Read the entire file from disk into memory.
2. Sort the records using a standard sorting procedure, such as shell sort.
3. Write the file back to disk.
The total time taken to sort the file is the sum of the times for the three steps. Can we
improve on the time that it takes for this memory sort?
Overlapping Processing and I/O: Heap sort
Most of the time when we use an internal sort, we have to wait until we have the whole file in
memor, y before we can start sorting. Is therein vernal sorting algorithm that is reasonably fast
and that can begin sorting numbers immediately as they are read rather than waiting for the
whole file to be in memory?
In fact there is it is called heapsort.
Heapsort keep_ s all of the keys in a structure called a heap. A heap is a binary tree with the
following properties:
1. Eac h node has a single key, and that key is greater than or equal to the key at its parent
_ - .
node.
2. It is a complete binary tree, which means that all of its leaves are on no more than two
levels and that all of the keys on the lower level are in the leftmost position.
3. Because of properties 1 and 2,.storage for the tree can be allocated Sequentially as an array
in such a way that the root node is index 1 and the indexes of the left and right children of
node i are 2i and 2i + 1, respectively.
Figure 8.16 shows a heap in both its tree form and as it would be Stored in an array.

Page 10
Building the Heap While Reading the File
The algorithm for heap sort has two parts. First we build the heap; then we output the keys in
sorted order. The first stage can occur at virtually the same time that we read the data, so in
terms of elapsed time it comes essentially free. Insert method that adds a string to the heap is
shown in Fig.8.17. Figure 8.18 contains a sample application of this algorithm.

Figure 8.17: Insert method to insert key in heap

Page 11
How to make the input overlap with the heap-building procedure? We read a block of records
at a time into an input buffer and then operate on all of the records in the block before going
on to the next block. Each time we read a new block, we just append it to the end of the heap
(that is, the input buffer "moves" as the heap gets larger). The first new record is then at the
end of the heap array, as required by the Insert function (Fig. 8.17). Once that record ss
absorbed into the heap, the next new record is at the end of the heap array, ready to be
absorbed into the heap, and so forth.
Use of an input buffer avoids an excessive number of seeks, but it still doesn't let input occur
at the same time that. We build the heap. The way to make processing overlap with I/O is to
use more than one buffer. With multiple buffering, as we process the keys in one block from
the file, we can simultaneously read later blocks from the file.
Figure 8.19 illustrates the technique that we have just described, in which we append each
new block of records to the end of the heap, thereby employing a memory-sized set of input
buffers. Now we read new blocks as fast as we can, never having to wait for processing
before reading a new block.

Page 12
Sorting While Writing to the File
The second and final step involves writing the heap in sorted order again, it is possible to
overlap I/O (in this case writing), with processing. First, let's look at how to output the sorted
keys. Retrieving the keys in order is $imply a repetition of the following steps:

1. Determine the value of the key in the first position of the heap. This is the smallest value in
the heap.

2. Move the largest value in the heap into the first position, and decrease the number of
elements by one. The heap is now out of order at its root.

3. Reorder the heap by exchanging the largest element with the smaller of its children and
moving down the tree to the new position of the largest element until the heap is back in
order.

Each time these three steps are executed, the smallest value is retrieved and removed from the
heap. Figure 8.20 contains the code for method Remove that implements these steps.
,-

First we see that we know immediately which record will be written first in the sorted file;
next, we know what will come second; and so forth. So as soon as we have identified a block

Page 13
of records, we can write that block, and while we are writing that block, we can identify the
next block, and so forth.

MERGING AS A WAY OF SORTING LARGE FILES ON DISK


The multiway merge algorithm discussed in Section 8.3 provides the beginning of an
attractive solution to the problem of sorting large files.
As example consider a file with 8000000 records, each of which is 100 bytes long and
contains a key field that is 10 bytes long. we can create a sorted subset of our full file by
reading records into memory until the memory work area is almost full, sorting the records in
this work area, then writing the sorted records back to disk as a sorted sub file. We call such a
sorted sub file a run.
A schematic VIEW of this run creation and merging process is provided in Fig. 8.21. This
solution to our sorting problem has the following features:
1. It can, in fact, sort large files and can be extended to files of any size.
2. Reading of the input file during the run-creation step is sequential and hence is much faster
than input that requires seeking for every record individually (as in a keysort.)
3. Reading through each run during merging and writing the sorted records is also sequential.
Random accesses are required only as we switch from run to run during the merge operation.
4. If a heap sort is used for the in- memory part of the merge, as described in Section 8.4, we
can overlap these operations with I/O so the in memory part does not add appreciably to the
total time for the merge.
5. Since I/O is largely sequential, tapes can be used if necessary for both input and output
operations.
How Much Time Does a Merge Sort Take?
We See in Fig. 8.21 that there are four times when I/O is performed. During the sort phase:
1. Reading all records into memory for sorting and forming runs, and
2. Writing sorted runs to disk.
During the merge phase:
3. Reading sorted runs into memory for merging, and
4. Writing sorted file to disk.
Let's look at each of these in order.
Step 1: Reading Records into Memory for Sorting and Forming Runs
Since we sort the file in 10- megabyte chunks, we read 10 megabytes at a time from the file.
In a sense, memory is a 10- megabyte input buffer that we fill up eighty times to form the

Page 14
eighty runs. In computing the total time to input each run, we need to include the amount of
time it takes to access each block (seek time + rotational delay), plus the amount of time it
takes to transfer each block. Seek and rotational delay times are 8 msec and 3 msec,
respectively, so total time per Seek is 11 msec. The transmission rate is approximately 14,500
bytes per msec. Total input time for the sort phase consists of the time required for 80 seeks,
plus the time required to transfer 800 megabytes:
Access: 80 seeks x 11 msec = 1 Sec
Transfer: 800 megabytes @ 14500 bytes/msec= 60 sec
Total: 61 Sec
Step 2: Writing Sorted Runs to Disk
In this case, writing is just the reverse of reading-the same number of seeks and the same
amount of data to transfer. So it takes another 61 seconds to write the 80 sorted runs.
Step 3: Reading Sorted Runs Into Memory for Merging -
Since we have 10 megabytes of memory for storing runs, we divide 10 megabytes into 80
parts for buffering the 80 runs. In a sense, we are reallocating our 10 megabytes of memory
as 80 input buffers. Each of the 80 buffers then holds 1/80th of a run (125 000 bytes), so we
have to access each run 80 times to read all of it. Because there are 80 runs, in order to
complete the merge operation (Fig. 8.22) we end up making
80 runs x 80 seeks = 6400 Seeks.
Total seek and rotation time is then 6400 x 11 msec = 70 seconds.
Since 800 megabytes is still transferred, transfer time is still 60 seconds.

Step 4: Writing Sorted File to Disk


To compute the time for writing the file, we need to know how big our output buffers are. To
keep matters simple, let us assume that we can allocate two 200 000-byte output buffers.
With 200000 bytes buffers, we need to make 4000 seeks.

Page 15
Total seek and rotation time is then 4000 x 11 msec = 44 seconds. Transfer time is still 60
seconds.

MULTILEVEL INDEXING AND B-TREES


Introduction
Limitation of Indexing i.e if index file itself is so voluminous that only rather small parts of it
can be kept in main memory at one time, led to utilization of trees concept to file structures.
Evolution of the trees for file structures started with applying Binary search tree (BST) for
file structures.
The limitation was unbalanced growth of the BST.
This gave rise to applying of AVL tree concept to File structures. With this the problem of
height balance was addressed but to make the tree height balance lot of local rotations was
required. This limitation gave rise to development of B- Trees.
Binary tree: A tree in which each node has at most two children.
Leaf: A node at the lowest level of a tree.
Height-balanced tree: A tree in which the difference between the heights of subtrees in
limited.
Statement of the problem
 Searching an index must be faster than binary searching.
 Inserting and deleting must be as fast as searching.
Indexing with Binary Search Trees
Given the sorted list below we can express a binary search of this list as a binary search tree,
as shown in the figure below.
AX, CL, DE, FB, FT, HN, JD, KF, NR, PA, RF, SD, TK, WS, YJ

KF

FB SD

CL HN PA WS

AX DE FT JD NR RF TK YJ

Using elementary data structure techniques, it is a simple matter to create node that contains
right and left link fields so the binary search tree can be constructed as a linked structure.

Page 16
KF

FB SD

The figure below shows record contents for a linked representation of the above binary tree.

ROOT 9
Key Left Right Key Left Right
Child Child Child Child
0 FB 10 8 8 HN 7 1
1 JD 9 KF 0 3
2 RF 10 CL 4 12
3 SD 6 13 11 NR 15 --
4 AX 12 DE
5 YJ 13 WS 14 5
6 PA 11 2 14 TK
7 FT 15 LV

The records in the file illustrated in the figure above appear in random rather than sorted
order. The sequence of the records in the file has no necessary relation to the structure of the
tree: all the information about the logical structure is carried in the link fields. The very
positive consequence that follows from this is that if we add a new key to the file such as LV,
we need only link it to the appropriate leaf node to create a tree that provides search
performance that is as good as we would get with a binary search on a sorted list. (Showed in
Bold)
AVL Trees ( G. M Adel’son, Velskii, E. M Landis)
A binary tree which maintains height balance (to HB(1)) by means of localized
reorganizations of the nodes.
 No two subtree’s height of any root differs by more than one level.
 AVL trees maintain HB(1) with local rotations of the nodes.
 AVL trees are not perfectly balanced.
 Worst case search with an AVL tree is 1.44 log2 (N + 2) compares.
Examples of AVL trees: Examples of Non AVL Tress:

Page 17
Paged Binary Trees
Disk utilization of a binary search tree is extremely inefficient. That is when a node of binary
search tree is read, there are only three useful information. They are the key value, the
address of the left and right sub trees. Each disk read produces a minimum of single page.
The paged binary tree attempts to address this problem by locating multiple binary nodes on
the same disk page. Paging divides a binary tree in to pages and then storing each page in a
block of contiguous locations on disk, so that reduces the number of seeks associated with
search.
 Worst case search with a balanced paged binary tree with page size M is logM + 1 (N +
1) compares.
 Balancing a paged binary tree can involve rotations across pages, involving physical
movement of nodes.

Page 18
The Problem with Paged Binary Trees
 Only valid when we have the entire set of keys in hand before the tree is built.
 Problems due to out of balance.
Multilevel Indexing: A Better Approach to Tree Indexes
Consider a 80 MB file which has 80,00,000 (80 lakhs) records. Each record is 100 bytes &
each key is 10 – byte keys. An index of this file has 80,00,000 key – reference pairs divided
among a sequence of index records. Let’s suppose that we can put 100 key – reference pairs
in a single index record. Hence there are 80,000 records in the index.
The 100 largest keys are inserted into an indexed record, and that record is written to the
index file. The next largest 100 keys go into the next record of the file, and so on. This
continues until we have 80,000 index records in the index file. We must find a way to speed

Page 19
up the search of this 80000 record file. So we construct an index to index file. Then an index
to index to index file is constructed. This is as shown in the below figures.
Data file (80 MB) Index file to data file
1 1 2 3 4 .....100 Like this we will be
2 101, 102, 103, .....200 having 80lakhs keys
3 with 80,000 records.
201, 202, 203,......300 Each record contains
.. .... 100 keys.
8000000 7999901, 7999902,....8000000

Index to Index File Index to Index to Index File

100, 200, 300, 400,....8000


8100, 8200, 8300,....16000
Like this we will We will be
16100, 16200,.........24000 be having 80k having 800
.... records with keys with 8
800 records. records.
.......7999900, ...8000000
Each record Each record
contains 100 contains 100
keys. keys.
B-Trees: Working up from the Bottom
B – Trees are multileveled indexes that solve the problem of linear cost of insertion and
deletion.
In B – trees the overflow record is split into two records, each half full. Deletion takes a
similar strategy of merging two records into a single record when necessary. Each node of a
B – tree is index record.
Each record has same maximum number of key reference pairs, called the order of the B –
tree. The records also have a minimum number of key-reference pairs, typically half of the
order.
 An attempt to insert a new key into an index record that is not full is simply done by
updating the index record.
 If the new key is the new largest key in the index record, it is the new higher level key
of that record, and the next higher level of the index must be updated.
 When insertion into an index record causes it to be overflow, it is split into two
records, each with half of the keys. Since a new index node has been created at this
level, the largest key in this new node must be inserted into the next higher level node.
This is called promotion of the key.
 Sometimes promotion of key may cause an overflow at that level. This in turn causes
that node to be split, and a key promoted to the next level. This continues as far as

Page 20
necessary. This causes another level to be added to the multilevel index.

Page 21
Example of Creating a B – Tree. (Also refer to the example solved in the class).
Construct a B – Tree of order 4, for the following set of keys
C GJXN S UO A E BH IFK LQ R TV
Step 1: After Insertion of C G J X
C G J X

Step 2: After Insertion of N S U


J X

C G J N S U X

Step 3: After insertion of O A

J S X

A C G J N O S U X

Step 4: After insertion of E B H I


E J S X

A B C E N O S U X
G H I J
Step 5: After insertion of F K

J X

E H J S X

A B C E F G H I J K N O S U X

Step 6: After insertion of L Q R T V

J X

E H J N S X

A B C E F G H K L N
I J T U V X
O Q R S

Page 22
B – Tree Nomenclature.
 Order of a B – tree:
 According to Bayer, Mc Creight & Comer order of the B – tree is
minimum number of keys that can be in a page of the tree.
 Knuth definition of order of the B- tree is maximum number of descendants
that a page can have.
 LEAF
 Bayer and Mc Creight refer to the lowest level of keys in a B – tree as the
leaf level.
 Knuth considers the leaf of a B – tree to be one level below the lowest level
of keys. In other words, he considers the leaves to be the actual data records
that might be pointed to by the lowest level of the keys in the tree.
Formal Definition of B-Tree Properties
For a B-Tree of Order m,
Every page has a maximum of m descendants.
 Every page, except for the root and the leaves, has a minimum of m/2 descendants.
 The root has a minimum of 2 descendants (unless it is a leaf.)
 All of the leaves are on the same level. 
 The leaf level forms a complete, ordered index of the associated data file. 
Worst Case Search Depth
The maximum number of disk accesses required to locate a key in the tree is nothing but
worst case search. Every key appears in the leaf level. So finding the depth of the tree is
nothing but worst case search. The worst case occurs when every page of the tree has only the
minimum number of descendants. In such case the keys are spread over a maximal height for
the tree and a minimal breadth.
For a B – tree of order ‘m’ the minimum number of descendants from the root page is two, so
the second level of the tree contains only two pages. Each of these pages in turn has at least
ceiling function of m/2 given by “┌m/2┐” descendants. The general pattern of the relation
between depth and the minimum number of descendants takes the following form:
Level Minimum Number of Descendants
1 (root) 2
2 2*┌m/2┐
3 2*(┌m/2┐)2
4 2*(┌m/2┐)3
d 2*(┌m/2┐) d-1

Page 23
So in general for any level d of B – tree, the minimum number of descendants extending from
that level is
2*(┌m/2┐) d-1
For a tree with ‘N’ keys in its leaves, we express the relationship between keys and the
minimum height d as
N ≥ 2*┌m/2┐d-1
Solving for d we arrive at the expression:
d ≤ 1+ log ┌m/2┐ (N/2)
Problem based on the depth calculation refer class notes.
Deletion merging and redistribution
Deletion of a key can result in several different situations. Consider the B – tree below. What
happens when we delete some of its keys.
The simplest situation is illustrated by deleting key ‘C’

J X

E H J N S X

A B C E F G H K L N
I J T U V X
O Q R S

Removal of key ‘C’, change occurs only in leaf node.

J X

E H J N S X

A B E F G H I J K L N
O Q R S T U V X
Removal of key ‘X’ change occurs in level 2 which needs to be updated in level 1 also.

J V

E H J N S V

A B E F G H I J K L N
O Q R S T U V
Removal of key ‘J’ causes underflow of that leaf. So it is merged.

Page 24
I V

E I N S V

A B E F G H I K L N
O Q R S T U V

Deletion of keys ‘U’ & ‘V’ causes the right most leaf node to underflow. Here redistribution
of keys among the nodes needs to be done. This redistribution leads to be updated in level 2
which in turn needs to be updated in level 1.

H T

E H N R T

A B E F G H K L N
O Q R S T

Redistribution during Insertion: A Way to Improve Storage


Utilization
Redistribution during insertion is a way of avoiding, or at least postponing, the creation of
new pages. Rather than splitting a full page and creating two approximately half- full pages,
redistribution lets us place some of the overflowing keys in to another page. This indeed
makes B- Tree more efficient in space utilization. For example consider the B - tree below

J X

E H J N S X

A B C E F G H I J K L N
O Q R S T U V X
Now if key ‘D’ needs to be inserted, the left most page overflow and it needs to be split
which results in one more leaf node added to B - Tree. Which results in approximately two
half full pages. Instead of splitting the nodes if we can redistribute the keys among the nodes,
we can postpone the splitting. The B – Tree below shows after insertion of key ‘D’ and
redistribution.

Page 25
J X

D H J N S X

A B C D E F G H I J K L N
O Q R S T U V X
Buffering of Pages: Virtual B – Trees
Consider an index file which is 1 MB of records and we have memory of 256 KB. Given a
page size of 4 KB, holding around 64 keys per page, the B – tree can be contained in three
levels. We may need at the max 2 disk seeks to retrieve any key from the B – Tree. To still
improve on this we create a page – buffer to hold some number of B – tree pages, perhaps 5,
10 or more. As we read pages in from the disk in response to user request, we fill up the
buffer. Then when a page is requested, we access it from the memory if we can, thereby
avoiding a disk access. If the page is not in the memory, then we read it in to the buffer from
secondary storage, replacing one of the pages that were previously there. A B- tree that uses a
memory buffer in this way is sometimes referred to as a Virtual B – Tree. The page
replacement algorithms are:
 LRU- Least recently used.
 Replacement based on the page height.—Here we always retain the pages that occur
at the highest levels of the tree.

Page 26

You might also like