Nothing Special   »   [go: up one dir, main page]

Final Hashing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Data Structure and Algorithm

Hashing
Hashing
• Find items with keys matching a given search key
– Given an array A, containing n keys, and a search key
x, find the index i such as x=A[i]
– As in the case of sorting, a key could be part of a
large record.

2
Hashing
• The search time of each algorithm discussed so
far depends on the number n of elements in the
collection S of data.
• A searching technique, which is essentially
independent of the number n is called hashing or
hash addressing. The terminology which we use
in our presentation of hashing will be oriented
toward file management.

3
Hashing
• First of all, we assume that there is a file F of n
records with a set K of keys which uniquely
determine the records in F.
• Secondly, we assume that F is maintained in
memory by a table T of m memory locations and
that L is the set of memory addresses of the
locations in T.
• For notational convenience, we assume that the
keys in K and the addresses in L are (decimal)
integers.

4
Hashing
• The subject of hashing will be introduced by the following
example
– Suppose a company with 68 employees assigns a 4-digit employee number
to each employee which is used as the primary key in the company's
employee file. We can, in fact, use the employee number as the address of
the record in memory. The search will require no comparisons at all.
Unfortunately, this technique will require space for 10 000 memory
locations, whereas space for fewer than 30 such locations would actually be
used. Clearly, this tradeoff of space for time is not worth the expense.

5
Hashing
• The general idea of using the key to determine the address
of a record is an excellent idea, but it must be modified so
that a great deal of space is not wasted, This modification
takes the form of a function H from the set K of keys into the
set L of memory addresses. Such a function.
H: K →L
is called a hash function or hashing function.
• Unfortunately, such a function H may not yield distinct
values: it is possible that two different keys k and k, will yield
the same hash address.

6
Hashing
• This situation is called collision, and some method must be
used to resolve it. Accordingly, the topic of hashing is
divided into two parts: (1) hash functions and (2) collision
resolutions. We discuss these two parts separately

7
Hash Functions
• The two principal criteria used in selecting a hash function
H: K →L are as follows:
– First of all the function H should be very easy and quick to compute.
– Second the function H should, as far as possible, uniformly distribute
the hash addresses throughout the set L so that there are a
minimum number of collisions.
– Naturally, there is no guarantee that the second condition can be
completely fulfilled without actually knowing beforehand the keys and
addresses.
– However, certain general techniques do help. One technique is to
"chop" a key k into pieces and combine the pieces in some way to
form the hash address H(k). (The term "hashing" comes from this
technique of "chopping" can be easily and quickly evaluated by the
computer.

8
Hash Functions
• The two principal criteria used in selecting a hash function
H: K →L are as follows:
– First of all the function H should be very easy and quick to compute.
– Second the function H should, as far as possible, uniformly distribute
the hash addresses throughout the set L so that there are a
minimum number of collisions.
– Naturally, there is no guarantee that the second condition can be
completely fulfilled without actually knowing beforehand the keys and
addresses.
– However, certain general techniques do help. One technique is to
"chop" a key k into pieces and combine the pieces in some way to
form the hash address H(k). (The term "hashing" comes from this
technique of "chopping" can be easily and quickly evaluated by the
computer.

9
Hash Functions
• Some popular hash functions are below. We emphasize that
each of these hash functions key into pieces.)
– Division method.
• Choose a number m larger than the number n of keys in K. (The
number m is usually chosen to be a prime number or a number
without small divisors, since this frequently minimizes the number
of collisions.) The hash function H is defined by
H(k) = k (mod m) or H(k) = k (mod m) + 1
Here k (mod m) denotes the remainder when k is divided by m.
The second formula is used when we want the hash addresses
to range from 1 to m rather than from 0 to m- 1.

10
Hash Functions

– Midsquare method.
• The key k is squared. Then the hash function H is defined by
H(k) = l
where l is obtained by deleting digits from both ends of k
We emphasize that the same positions of k must be used
for all of the keys.

– Folding method.
• The key k is partitioned into a number of parts, k. except possibly
the last, has the same number of digits as the required address.
Then the parts are added together, ignoring the last carry. That
is, k, where each part.
H(k) = k + k, + + k
where the leading-digit carries, if any, are ignored. Sometimes,
for extra "milling," the even-numbered parts, k2. are each
reversed before the addition. 11
Hash Functions

– Midsquare method.
• The key k is squared. Then the hash function H is defined by
H(k) = l
where l is obtained by deleting digits from both ends of k
We emphasize that the same positions of k must be used
for all of the keys.

– Folding method.
• The key k is partitioned into a number of parts, k. except possibly
the last, has the same number of digits as the required address.
Then the parts are added together, ignoring the last carry. That
is, k, where each part.
H(k) = k + k, + + k
where the leading-digit carries, if any, are ignored. Sometimes,
for extra "milling," the even-numbered parts, k2. are each
reversed before the addition. 12
Hash Functions : Example

– Consider the company in Example. 9.9, each of whose 68


employees is assigned a unique 4-digit employee number. Suppose
L consists of 100 two-digit addresses: 00, 01, 02, ..., 99. We apply
the above hash functions to each of the following employee
numbers:
3205, 7148, 2345 .

– Division method.
• Choose a prime number m close ta 99, such as m - 97, then
H(3205) = 4, H(7148) = 67, H(2345)= 17
That is, dividing 3205 by 97 gives a remainder of 4, dividing 7148
by 97 gives a remainder of 67, and dividing 2345 by 97 gives a
remainder of 17.

13
Hash Functions : Example

In the case that the memory addresses begin with 01 rather than
00, we choose that the function H(k) = k(mod m) + 1 to obtain:
H(3205)= 4 + 1 = 5, H(7148) = 67 +1=68, H(2345) = 17 + 1 = 18.

– Midsquare method.
• The following calculations are performed:
k: 3205 7148 2345
k2: 10 272 025 51 093 904 5 499 025
H(k): 72 93 99

Observe that the fourth and fifth digits, counting from the right, are
chosen for the hash address.

14
Hash Functions : Example

– Folding method.
• Chopping the key k into two parts and adding yields the following
hash addresses:
H(3205) = 32 + 05 = 37, H(7148) = 71 + 48 =19,
H(2345) = 23 + 45 =68

Observe that the leading digit 1 in H(7148) is ignored. Alternatively,


one may want to reverse the second part before adding, thus
producing the following hash addresses:
H(3205) = 32 + 50 = 82, H(7148) = 71 + 84 =55, H(2345) =23 + 54
=77 .

15
Hash Tables
• When K is much smaller than U, a hash table
requires much less space than a direct-address
table
– Can reduce storage requirements to |K|
– Can still get O(1) search time, but on the average
case, not the worst case

16
Hash Tables
Idea:
– Use a function h to compute the slot for each key
– Store the element in slot h(k)

• A hash function h transforms a key into an index in a


hash table T[0…m-1]:
h : U → {0, 1, . . . , m - 1}
• We say that k hashes to slot h(k)
• Advantages:
– Reduce the range of array indices handled: m instead of |U|
– Storage is also reduced

17
Example: HASH TABLES
0

U
(universe of keys) h(k1)
h(k4)

K k1
h(k2) = h(k5)
(actual k4 k2
keys)
k5 k3 h(k3)

m-1

18
Do you see any problems
with this approach?
0

U
(universe of keys) h(k1)
h(k4)

K k1
h(k2) = h(k5)
(actual k4 k2
keys) Collisions!
k5 k3 h(k3)

m-1

19
Collisions
• Two or more keys hash to the same slot!!
• For a given set K of keys
– If |K| ≤ m, collisions may or may not happen,
depending on the hash function
– If |K| > m, collisions will definitely happen (i.e., there
must be at least two keys that have the same hash
value)
• Avoiding collisions completely is hard, even with
a good hash function

20
Collision Resolution
• Suppose we want to add a new record R with
key k to our file F but suppose the memory
location address H(k) is already occupied. This
situation is called collision.
• This subsection discusses two general ways of
resolving collisions.

21
Collision Resolution
• The particular procedure that one chooses
depends on many factors.
• One important factor is the ratio of the number e
of keys in K (which is the number of records in
F) to the number m of hash addresses in L This
ratio, A= n/m, is called the load factor.

22
Handling Collisions
• We use the following methods:
– Chaining
– Open addressing
• Linear probing
• Quadratic probing
• Double hashing

• We discuss chaining first.

23
Handling Collisions Using Chaining
• Idea:
– Put all elements that hash to the same slot into a
linked list

– Slot j contains a pointer to the head of the list of all


elements that hash to j 24
Collision with Chaining - Discussion
• Choosing the size of the table
– Small enough not to waste space
– Large enough such that lists remain short
– Typically 1/5 or 1/10 of the total number of elements

• How should we keep the lists: ordered or not?


– Not ordered!
• Insert is fast
• Can easily remove the most recently inserted elements

25
Insertion in Hash Tables
Alg.: CHAINED-HASH-INSERT(T, x)
insert x at the head of list T[h(key[x])]

• Worst-case running time is O(1)


• Assumes that the element being inserted isn’t
already in the list
• It would take an additional search to check if it
was already inserted
26
Deletion in Hash Tables
Alg.: CHAINED-HASH-DELETE(T, x)
delete x from the list T[h(key[x])]

• Need to find the element to be deleted.


• Worst-case running time:
– Deletion depends on searching the corresponding list

27
Searching in Hash Tables
Alg.: CHAINED-HASH-SEARCH(T, k)

search for an element with key k in list T[h(k)]

• Running time is proportional to the length of the

list of elements in slot h(k)

28
Analysis of Hashing with Chaining:
Worst Case
• How long does it take to T
search for an element with a 0

given key?
• Worst case:
– All n keys hash to the same slot
– Worst-case time to search is
chain
(n), plus time to compute the
hash function m-1

29
Analysis of Hashing with Chaining:
Average Case
• Average case
– depends on how well the hash function
distributes the n keys among the m slots T
• Simple uniform hashing assumption: n0 = 0
– Any given element is equally likely to hash
into any of the m slots (i.e., probability of n2
collision Pr(h(x)=h(y)), is 1/m) n3

• Length of a list: nj
T[j] = nj, j = 0, 1, . . . , m – 1
• Number of keys in the table: nk

n = n0 + n1 +· · · + nm-1
nm – 1 = 0
• Average value of nj:
E[nj] = α = n/m
30
Load Factor of a Hash Table
• Load factor of a hash table T:
T
 = n/m 0

– n = # of elements stored in the table


chain
– m = # of slots in the table = # of linked lists chain

•  encodes the average number of chain


elements stored in a chain
chain
•  can be <, =, > 1
m-1

31
Analysis of Search in Hash Tables
• If m (# of slots) is proportional to n (# of
elements in the table):

• n = O(m)

• α = n/m = O(m)/m = O(1)

 Searching takes constant time on average

32
Open Addressing
• If we have enough contiguous memory to store all the keys
(m > N)  store the keys in the table itself e.g., insert 14

• No need to use linked lists anymore


• Basic idea:
– Insertion: if a slot is full, try another one,
until you find an empty one
– Search: follow the same sequence of probes
– Deletion: more difficult ... (we’ll see why)

• Search time depends on the length of the


probe sequence!
33
Generalize hash function notation:
• A hash function contains two arguments now:
(i) Key value, and (ii) Probe number insert 14

h(k,p), p=0,1,...,m-1

• Probe sequences
<h(k,0), h(k,1), ..., h(k,m-1)>

– Must be a permutation of <0,1,...,m-1>


– There are m! possible permutations
– Good hash functions should be able to
Example
produce all m! probe sequences <1, 5, 9>
34
Common Open Addressing Methods

• Linear probing
• Quadratic probing
• Double hashing

• Note: None of these methods can generate


more than m2 different probing sequences!

35
Linear probing: Inserting a key
• Idea: when there is a collision, check the next available
position in the table (i.e., probing)

h(k,i) = (h1(k) + i) mod m


i=0,1,2,...
• First slot probed: h1(k)
• Second slot probed: h1(k) + 1
• Third slot probed: h1(k)+2, and so on

probe sequence: < h1(k), h1(k)+1 , h1(k)+2 , ....>

• Can generate m probe sequences maximum, why?


wrap around
36
Linear probing: Searching for a key
• Three cases:
(1) Position in table is occupied with an
0
element of equal key
(2) Position in table is empty
h(k1)
(3) Position in table occupied with a h(k4)
different element
• Case 2: probe the next higher index h(k2) = h(k5)

until the element is found or an h(k3)


empty position is found
• The process wraps around to the m-1

beginning of the table


37
Linear probing: Deleting a key
• Problems
– Cannot mark the slot as empty 0
– Impossible to retrieve keys inserted
after that slot was occupied
• Solution
– Mark the slot with a sentinel value
DELETED
• The deleted slot can later be used
for insertion m-1
• Searching will be able to find all the
keys
38
Quadratic probing

i=0,1,2,...

39
Double Hashing
(1) Use one hash function to determine the first slot
(2) Use a second hash function to determine the
increment for the probe sequence
h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,...
• Initial probe: h1(k)
• Second probe is offset by h2(k) mod m, so on ...
• Advantage: avoids clustering
• Disadvantage: harder to delete an element
• Can generate m2 probe sequences maximum

40
Double Hashing: Example
h1(k) = k mod 13 0
1 79
h2(k) = 1+ (k mod 11) 2
h(k,i) = (h1(k) + i h2(k) ) mod 13 3
4 69
• Insert key 14: 5 98
h1(14,0) = 14 mod 13 = 1 6
7 72
h(14,1) = (h1(14) + h2(14)) mod 13 8
= (1 + 4) mod 13 = 5 9 14
10
h(14,2) = (h1(14) + 2 h2(14)) mod 13 11 50
= (1 + 8) mod 13 = 9 12

41

You might also like