Nothing Special   »   [go: up one dir, main page]

Unit-3 Part 2 Indexing and Hashing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Contents:

● Basic Concepts
● Ordered Indices
● Static Hashing
● Dynamic Hashing
INDEXING
Ordered Indices
1)Primary Index
i) Dense Index
ii) Sparse Index
iii) Multilevel Index
2)Secondary Index
Basic Concepts
● Indexing mechanisms used to speed up access to desired data.
● E.g., Index in a book to find a particular topic.
● An index file consists of records (called index entries) of the
form
search-key pointer

● Search Key - attribute value to find records in a file.


● Pointer- Location of the disc block where the record is located.

● Index files are typically much smaller than the original file.
● Two basic kinds of indices:
● Ordered indices: search keys are stored in sorted order
● Hash indices: search keys are distributed uniformly across
“buckets” using a “hash function”.
Index Evaluation Metrics
● Access types
● records with a specified value in the attribute (or)
● records with an attribute value falling in a specified
range of values (e.g. 10000 < salary < 40000)
● Access time
● Insertion time
● Deletion time
● Space overhead
Ordered Indices
● In an ordered index, index entries are stored in the sorted
order of the search key values.

● Primary index (clustering index ): an index whose search


key specifies an order same as the sequential order of the
file.
● Secondary index(non-clustering index ): an index whose
search key specifies an order different from the sequential
order of the file.

● Index-sequential file: ordered sequential file with a primary


index.
Dense Index Files
● Dense index — Index record appears for every search-key value
in the file.
Sparse Index Files
● Sparse Index: contains index records for only some search-key values.
● Applicable when records are sequentially ordered on search-key
● To locate a record with search-key value K we:
● Find index record with largest search-key value < K
● Search file sequentially starting at the record to which the index record
points
Sparse Index Files (Cont.)
● Compared to dense indices:
● Less space and less maintenance overhead for
insertions and deletions.
● Generally slower than dense index for locating records.
Multilevel Index
● If primary index does not fit in memory, access becomes
expensive.
● Solution: treat primary index kept on disk as a sequential
file and construct a sparse index on it.
● outer index – a sparse index of primary index
● inner index – the primary index(Dense Index) file

● If even outer index is too large to fit in main memory, yet


another level of index can be created, and so on.
● Indices at all levels must be updated on insertion or
deletion from the file.
Multilevel Index (Cont.)
Index Update: Record Deletion
● If deleted record was the only record in the file with its particular search-key
value, the search-key is deleted from the index also.
● Single-level index deletion:
● Dense indices – deletion of search-key: similar to file record deletion.
● Sparse indices –
● if deleted key value exists in the index, the value is replaced by the
next search-key value in the file (in search-key order).
● If the next search-key value already has an index entry, the entry is
deleted instead of being replaced.
Index Update: Record Insertion
● Single-level index insertion:
● Perform a lookup using the key value from inserted
record
● Dense indices – if the search-key value does not
appear in the index, insert it.
● Sparse indices – if index stores an entry for each
block of the file, no change needs to be made to the
index unless a new block is created.
● If a new block is created, the first search-key value
appearing in the new block is inserted into the index.
● Multilevel insertion (as well as deletion) algorithms are
simple extensions of the single-level algorithms
Secondary Indices

Secondary index on balance field of account

● Index record points to a bucket that contains pointers to all the


actual records with that particular search-key value.
● Secondary indices have to be dense.
HASHING
1) Static Hashing
i) Hash File Organisation
ii) Hash Indicies
2) Dynamic Hashing
Static Hashing
● A bucket is a unit of storage containing one or more
records (a bucket is typically a disk block).
● In a hash file organization we obtain the bucket of a
record directly from its search-key value using a hash
function.
● Hash function h is a function from the set of all search-
key values K to the set of all bucket addresses B.
● Hash function is used to locate records for access,
insertion as well as deletion.
● Records with different search-key values may be
mapped to the same bucket; thus entire bucket has to be
searched sequentially to locate a record.
Example of Hash File Organization

Hash file organization of account file, using branch_name


as key (See figure in next slide.)

● There are 10 buckets,


● The binary representation of the ith character is
assumed to be the integer i.
● The hash function returns the sum of the binary
representations of the characters modulo 10
● E.g. h(Perryridge) = 5 h(RoundHill) = 3
h(Brighton) = 3
Example of Hash File Organization
Hash file
organization of
account file, using
branch_name as key
Hash Functions
● Worst hash function maps all search-key values to the same
bucket; this makes access time proportional to the number of
search-key values in the file.
● An ideal hash function is uniform, i.e., each bucket is
assigned the same number of search-key values from the set
of all possible values.
● Typical hash functions perform computation on the internal
binary representation of the search-key.
● For example, for a string search-key, the binary
representations of all the characters in the string could be
added and the sum modulo the number of buckets could
be returned.
Handling of Bucket Overflows
● Bucket overflow can occur because of
● Insufficient buckets
● Skew in distribution of records. This can occur due
to two reasons:
● multiple records have same search-key value
● chosen hash function produces non-uniform
distribution of key values
● Although the probability of bucket overflow can be
reduced, it cannot be eliminated; it is handled by
using overflow buckets.
Handling of Bucket Overflows
● Overflow chaining – the overflow buckets of a given
bucket are chained together in a linked list.
● This technique is called open hashing.
Hash Indices
● Hashing can be used not only for file organization, but
also for index-structure creation.
● A hash index organizes the search keys, with their
associated record pointers, into a hash file structure.
● Strictly speaking, hash indices are always secondary
indices
● if the file itself is organized using hashing, a separate
primary hash index on it using the same search-key is
unnecessary.
● However, we use the term hash index to refer to both
secondary index structures and hash organized files.
Example of Hash Index
Deficiencies of Static Hashing
● In static hashing, function h maps search-key values to a
fixed set of B of bucket addresses. Databases grow or shrink
with time.
● If initial number of buckets is too small, and file grows,
performance will degrade due to too much overflows.
● If space is allocated for anticipated growth, a significant
amount of space will be wasted initially (and buckets will
be underfull).
● If database shrinks, again space will be wasted.
● One solution: periodic re-organization of the file with a new
hash function
● Expensive, disrupts normal operations
● Better solution: allow the number of buckets to be modified
dynamically.
Dynamic Hashing
● Good for database that grows and shrinks in size
● Allows the hash function to be modified dynamically
● Extendable hashing – one form of dynamic hashing
● Hash function generates values over a large range — typically b-bit
integers, with b = 32.
● At any time use only a prefix of the hash function to index into a
table of bucket addresses.
● Let the length of the prefix be i bits, 0  i  32.
● Bucket address table size = 2i. Initially i = 0
● Value of i grows and shrinks as the size of the database grows
and shrinks.
● Multiple entries in the bucket address table may point to a bucket.
● Thus, actual number of buckets is < 2i
● The number of buckets also changes dynamically due to
combining and splitting of buckets.
In this structure, i2 = i3 = i, whereas i1 = i – 1 (see next
slide for details)
Use of Extendable Hash Structure
● Each bucket j stores a value ij
● All the entries that point to the same bucket have the same
values on the first ij bits.
● To locate the bucket containing search-key Kj:
1. Compute h(Kj) = X
2. Use the first i high order bits of X as a displacement into
bucket address table, and follow the pointer to appropriate bucket
● To insert a record with search-key value Kj
● follow same procedure as look-up and locate the bucket, say j.
● If there is room in the bucket j insert record in the bucket.
● Else the bucket must be split and insertion re-attempted.
● Overflow buckets used instead in some cases.
Insertion in Extendable Hash Structure (Cont)
To split a bucket j when inserting record with search-key value Kj:
● If i > i j (more than one pointer to bucket j)
● allocate a new bucket z, and set i j = i z = (i j + 1)
● Update the second half of the bucket address table entries originally
pointing to j, to point to z
● remove each record in bucket j and reinsert (in j or z)
● recompute new bucket for Kj and insert record in the bucket (further
splitting is required if the bucket is still full)
● If i = ij (only one pointer to bucket j)
● If i reaches some limit b, or too many splits have happened in this
insertion, create an overflow bucket
● Else
● increment i and double the size of the bucket address table.
● replace each entry in the table by two entries that point to the same
bucket.
● recompute new bucket address table entry for Kj
Now i > ij so use the first case above.
Deletion in Extendable Hash Structure
● To delete a key value,
● locate it in its bucket and remove it.
● The bucket itself can be removed if it becomes empty
(with appropriate updates to the bucket address table).
● Combining of buckets can be done (can combine only
with a “buddy” bucket having same value of ij and
same ij –1 prefix, if it is present)
● Decreasing bucket address table size is also possible
● Note: decreasing bucket address table size is an
expensive operation and should be done only if
number of buckets becomes much smaller than the
size of the table
Initial Hash structure, bucket size = 2
Example (Cont.)
● Hash structure after insertion of one Brighton and two
Downtown records
Hash structure after insertion of Mianus record
Hash structure after insertion of three Perryridge records
Example (Cont.)
● Hash structure after insertion of Redwood and Round Hill records
THANK YOU

You might also like