Hashing RPK
Hashing RPK
Hashing RPK
Hashing
Hashing
• Is an effective way to store the elements in some
data structure.
• It allows to reduce the no. of comparisons.
• It can be used to access the stored record
directly.
• 2 Concepts of hashing
– Hash Table
– Hash function
2
Hash Table
• Data structure used for storing and
retrieving the data very quickly.
Empl Record
• Insertion of the data using key value. oyee
Hence every entry in the hash table is ID
Employee ID Record
[0] 1000
[1] 1001
[2] 1022
[3] 1023
4
Hashing Techniques
• Various types of hash functions that are used to
place the record in the hash table.
• 1.Division method
• 2.Mid square method
• 3.Multiplicative hash function
• 4.Digit folding
• 5. Digit analysis
5
1.Division method
• The hash function depend upon the remainder
of division.
• {54,72,89,37} is to be placed in the hash table.
• H(key)=record % table_size Array Numbers
Index
• 4=54%10, 2=72%10, 9=89%10,
[1]
• 7=37%10 [2] 72
[3]
[4] 54
Array Employee ID
Index [5]
[6]
[0] 1000 [7]
[1] 1001 [8] 37
[2] 1022 [9] 89
[3] 1023 [10]
6
2.Mid square method
7
3.Multiplicative hash function
8
4.Digit folding
9
5. Digit analysis
10
Applications
• Keeping track of customer account information at
a bank
– Search through records to check balances and perform
transactions
• Keep track of reservations on flights
– Search to find empty seats, cancel/modify reservations
• Search engine
– Looks for all documents containing a given word
11
Special Case: Dictionaries
• Dictionary = data structure that supports mainly
two basic operations: insert a new item and
return an item with a given key
• Queries: return information about the set S:
– Search (S, k)
– Minimum (S), Maximum (S)
– Successor (S, x), Predecessor (S, x)
• Modifying operations: change the set
– Insert (S, k)
– Delete (S, k) – not very often
12
Direct Addressing
• Assumptions:
– Key values are distinct
– Each key is drawn from a universe U = {0, 1, . . . , m - 1}
• Idea:
– Store the items in an array, indexed by keys
13
Direct Addressing (cont’d)
14
Operations
Alg.: DIRECT-ADDRESS-SEARCH(T, k)
return T[k]
Alg.: DIRECT-ADDRESS-INSERT(T, x)
T[key[x]] ← x
Alg.: DIRECT-ADDRESS-DELETE(T, x)
T[key[x]] ← NIL
• Running time for these operations: O(1)
15
Comparing Different Implementations
Insert Search
direct addressing O(1) O(1)
ordered array O(N) O(lgN)
ordered list O(N) O(N)
unordered array O(1) O(N)
unordered list O(1) O(N)
16
Examples Using Direct Addressing
Example 1:
Example 2:
17
Hash Tables
• When K is much smaller than U, a hash table
requires much less space than a direct-address
table
– Can reduce storage requirements to |K|
– Can still get O(1) search time, but on the average
case, not the worst case
18
Hash Tables
Idea:
– Use a function h to compute the slot for each key
– Store the element in slot h(k)
19
Example: HASH TABLES
0
U
(universe of keys) h(k1)
h(k4)
K k1
h(k2) = h(k5)
(actual k4 k2
keys)
k5 k3 h(k3)
m-1
20
Revisit Example 2
21
Do you see any problems
with this approach?
0
U
(universe of keys) h(k1)
h(k4)
K k1
h(k2) = h(k5)
(actual k4 k2
keys) Collisions!
k5 k3 h(k3)
m-1
22
Collisions
• Two or more keys hash to the same slot!!
• For a given set K of keys
– If |K| ≤ m, collisions may or may not happen,
depending on the hash function
– If |K| > m, collisions will definitely happen (i.e., there
must be at least two keys that have the same hash
value)
• Avoiding collisions completely is hard, even with
a good hash function
23
Handling Collisions
• We will review the following methods:
– Chaining
– Open addressing
• Linear probing
• Quadratic probing
• Double hashing
24
Handling Collisions Using Chaining
• Idea:
– Put all elements that hash to the same slot into a
linked list
25
Collision with Chaining - Discussion
• Choosing the size of the table
– Small enough not to waste space
– Large enough such that lists remain short
– Typically 1/5 or 1/10 of the total number of elements
26
Insertion in Hash Tables
Alg.: CHAINED-HASH-INSERT(T, x)
insert x at the head of list T[h(key[x])]
28
Searching in Hash Tables
Alg.: CHAINED-HASH-SEARCH(T, k)
29
Analysis of Hashing with Chaining:
Worst Case
• How long does it take to T
0
search for an element with a
given key?
• Worst case:
– All n keys hash to the same slot
– Worst-case time to search is
chain
(n), plus time to compute the
hash function m-1
30
Analysis of Hashing with Chaining:
Average Case
• Average case
– depends on how well the hash function
distributes the n keys among the m slots T
• Simple uniform hashing assumption: n0 = 0
– Any given element is equally likely to hash
into any of the m slots (i.e., probability of n2
collision Pr(h(x)=h(y)), is 1/m) n3
• Length of a list:
nj
T[j] = nj, j = 0, 1, . . . , m – 1
• Number of keys in the table: nk
n = n0 + n1 +· · · + nm-1
nm – 1 = 0
• Average value of nj:
E[nj] = α = n/m
31
Load Factor of a Hash Table
• Load factor of a hash table T: T
= n/m 0
32
Case 1: Unsuccessful Search
(i.e., item not stored in the table)
Theorem
An unsuccessful search in a hash table takes expected time )
(1under
the assumption of simple uniform hashing
(i.e., probability of collision Pr(h(x)=h(y)), is 1/m)
Proof
• Searching unsuccessfully for any key k
– need to search to the end of the list T[h(k)]
• Expected length of the list:
– E[nh(k)] = α = n/m
• Expected number of elements examined in an unsuccessful search is α
• Total time required is:
– O(1) (for computing the hash function) + α (1 )
33
Case 2: Successful Search
34
Analysis of Search in Hash Tables
• If m (# of slots) is proportional to n (# of
elements in the table):
• n = O(m)
• α = n/m = O(m)/m = O(1)
Searching takes constant time on average
35
Hash Functions
• A hash function transforms a key into a table
address
• What makes a good hash function?
(1) Easy to compute
(2) Approximates a random function: for every input,
every output is equally likely (simple uniform hashing)
• In practice, it is very hard to satisfy the simple
uniform hashing property
– i.e., we don’t know in advance the probability
distribution that keys are drawn from
36
Good Approaches for Hash Functions
37
The Division Method
• Idea:
– Map a key k into one of the m slots by taking
the remainder of k divided by m
h(k) = k mod m
• Advantage:
– fast, requires only one operation
• Disadvantage:
– Certain values of m are bad, e.g.,
• power of 2
• non-prime numbers
38
Example - The Division Methodm
m
97 100
• If m = 2 , then h(k) is just the least
p
significant p bits of k
– p=1m=2
h(k) = {0, 1} , least significant 1 bit of k
– p=2m=4
h(k) ={0, 1, 2, 3} , least significant 2 bits of
k
Choose m to be a prime, not close to a
power of 2 k mod 97
Column 2:
k mod 100
Column 3:
39
The Multiplication Method
Idea:
• Multiply key k by a constant A, where 0 < A < 1
• Extract the fractional part of kA
• Multiply the fractional part by m
• Take the floor of the result
h(k) = = m (k A mod 1)
41
Universal Hashing
• In practice, keys are not randomly distributed
• Any fixed hash function might yield Θ(n) time
• Goal: hash functions that produce random
table indices irrespective of the keys
• Idea:
– Select a hash function at random, from a designed
class of functions at the beginning of the execution
42
Universal Hashing
43
Definition of Universal Hash Functions
H={h(k): U(0,1,..,m-1)}
44
How is this property useful?
Pr(h(x)=h(y))=
45
Universal Hashing – Main Result
46
Designing a Universal Class
of Hash Functions
• Choose a prime number p large enough so that every
possible key k is in the range [0 ... p – 1]
Zp = {0, 1, …, p - 1} and Zp* = {1, …, p - 1}
• Define the following hash function
E.g.: p = 17, m = 6
= 11 mod 6
=5
48
Advantages of Universal Hashing
49
Open Addressing
• If we have enough contiguous memory to store all the keys
(m > N) store the keys in the table itself e.g., insert 14
h(k,p), p=0,1,...,m-1
• Probe sequences
<h(k,0), h(k,1), ..., h(k,m-1)>
• Linear probing
• Quadratic probing
• Double hashing
52
Linear probing: Inserting a key
• Idea: when there is a collision, check the next available
position in the table (i.e., probing)
54
Linear probing: Deleting a key
• Problems
– Cannot mark the slot as empty 0
– Impossible to retrieve keys inserted
after that slot was occupied
• Solution
– Mark the slot with a sentinel value
DELETED
• The deleted slot can later be used
for insertion m-1
• Searching will be able to find all the
keys
55
Primary Clustering Problem
• Some slots become more likely than others
• Long chunks of occupied slots are created
search time increases!!
Slot b:
2/m
Slot d:
4/m
Slot e:
5/m
56
Quadratic probing
i=0,1,2,...
57
Double Hashing
(1) Use one hash function to determine the first slot
(2) Use a second hash function to determine the
increment for the probe sequence
h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,...
• Initial probe: h1(k)
• Second probe is offset by h2(k) mod m, so on ...
• Advantage: avoids clustering
• Disadvantage: harder to delete an element
• Can generate m2 probe sequences maximum
58
Double Hashing: Example
h1(k) = k mod 13 0
1 79
h2(k) = 1+ (k mod 11) 2
h(k,i) = (h1(k) + i h2(k) ) mod 13 3
4 69
• Insert key 14: 5 98
6
h1(14,0) = 14 mod 13 = 1 72
7
h(14,1) = (h1(14) + h2(14)) mod 13 8
9 14
= (1 + 4) mod 13 = 5 10
h(14,2) = (h1(14) + 2 h2(14)) mod 13 11 50
12
= (1 + 8) mod 13 = 9
59
Analysis of Open Addressing
a (load factor)
1 a
k=0
60
Analysis of Open Addressing (cont’d)
Unsuccessful retrieval:
a=0.5 E(#steps) = 2
a=0.9 E(#steps) = 10
Successful retrieval:
a=0.5 E(#steps) = 3.387
a=0.9 E(#steps) = 3.670
61