Nothing Special   »   [go: up one dir, main page]

DBMS - Unit 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

MAILAM ENGINEERING COLLEGE, MAILAM

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


UNIT IV
IMPLEMENTATION TECHNIQUES : RAID – File Organization – Organization of
Records in Files – Indexing and Hashing –Ordered Indices – B+ tree Index
Files – B tree Index Files – Static Hashing – Dynamic Hashing – Query
Processing Overview – Algorithms for SELECT and JOIN operations – Query
optimization using Heuristics and Cost Estimation.

1. What are the advantages and disadvantages of indexed sequential file? MAY-
2011
The advantage of ordering records in a sequential file according to a key is that you
can then search the file more quickly. If you know the key value that you want, you
can use one of the relatively fast searches.
The disadvantage is that when you insert, you need to rewrite at least everything
after the insertion point, which makes inserts very expensive unless they are done at
the end of the file.
2. What is database tuning? MAY-2011
Database tuning describes a group of activities used to optimize and
homogenize the performance of a database. It usually overlaps with query tuning, but
refers to design of the database files, selection of the database management system
(DBMS), operating system and CPU the DBMS runs on.
3. Give the measures of quality of a disk.
 Capacity
 Access time
 Seek time
 Data transfer rate
 Reliability
 Rotational latency time.
4. Compare sequential access devices versus random access devices with an
example.
Sequential access devices Random access devices
Must be accessed from the It is possible to read data
beginning from any location

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 2


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Ex: tape storage Ex: tape storage
Access to data is much slower Access to data is faster
Cheaper than disk Expensive when compared
with disk

5. What are the types of storage devices?


 Primary storage
 Secondary storage
 Tertiary storage
6. Draw the storage device hierarchy according to their speed and their cost.
 Cache
 Main memory
 Flash memory
 Magnetic disk
 Optical disk
 Magnetic tapes
7. What are called jukebox systems?
Jukebox systems contain a few d rives and numerous disks that can be
loaded into one of the drives automatically.
8. What is called remapping of bad sectors?
If the controller detects that a sector is damaged when the disk is initially
formatted, or when an attempt is made to write the sector, it can logically map
the sector to a different physical location.
9. Define access time.
Access time is the time from when a read or write request is issued to
when data transfer begins.
10. Define seek time.
The time for repositioning the arm is called the seek time and it
increases with the distance that the arm is called the seek time.
11. Define average seek time.
The average seek time is the average of the seek times, measured over a
sequence of random requests.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 3


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
12. Define rotational latency time.
The time spent waiting for the sector to be accessed to appear under the
head is called the rotational latency time.
13. Define average latency time.
The average latency time of the disk is one-half the time for a full
rotation of the disk.
14. What is meant by data-transfer rate?
The data-transfer rate is the rate at which data can be retrieved from or
stored to the disk.
15. What is meant by mean time to failure?
The mean time to failure is the amount of time that the system could run
continuously without failure.
16. What are a block and a block number?
A block is a contiguous sequence of sectors from a single track of one
platter. Each request specifies the address on the disk to be referenced. That
address is in the form of a block number.
17. What are called journaling file systems?
File systems that support log disks are called journaling file systems.
18. What is the use of RAID? (MAY/June 2013)
A variety of disk-organization techniques, collectively called redundant
arrays of independent disks are used to improve the performance and
reliability.
19. Explain how reliability can be improved through redundancy?
 The simplest approach to introducing redundancy is to duplicate every
disk. This technique is called mirroring or shadowing. A logical disk then
consists of two physical disks, and write is carried out on both the disk.

 If one of the disks fails the data can be read from the other. Data will be
lost if the second disk fails before the first failed disk is repaired.
20. What is called mirroring?
The simplest approach to introducing redundancy is to duplicate every
disk. This technique is called mirroring or shadowing.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 4


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
21. What is called mean time to repair?
The mean time to failure is the time it takes to replace a failed disk and
to restore the data on it.
22. What is called bit-level striping?
Data striping consists of splitting the bits of each byte across multiple
disks. This is called bit-level striping.
23. What is called block-level striping?
Block level striping stripes blocks across multiple disks. It treats the array of
disks as a large disk, and gives blocks logical numbers.
24. What are the two main goals of parallelism?
Load –balance multiple small accesses, so that the throughput of such
accesses increases. Parallelize large accesses so that the response time of large
accesses is reduced
25. What are the factors to be taken into account when choosing a RAID level?
 Monetary cost of extra disk storage requirements.
 Performance requirements in terms of number of I/O operations
 Performance when a disk has failed.
 Performances during rebuild.
26. What is meant by software and hardware RAID systems?
RAID can be implemented with no change at the hardware level, using
only software modification. Such RAID implementations are called software
RAID systems and the systems with special hardware support are called
hardware RAID systems.
27. Define hot swapping?
Hot swapping permits the removal of faulty disks and replaces it by new
ones without turning power off. Hot swapping reduces the mean time to repair.
28. Which level of RAID is best? Why?
RAID level 1 is the RAID level of choice for many applications with
moderate storage requirements and high I/O requirements. RAID 1 follows
mirroring and provides best write performance.
29. Distinguish between fixed length records and variable length records?
Fixed length records
Every record has the same fields and field lengths are fixed.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 5


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Variable length records
File records are of same type but one or more of the fields are of varying
size.
30. What are the ways in which the variable-length records arise in database
Systems?
 Storage of multiple record types in a file.
 Record types that allow variable lengths for one or more fields.
 Record types that allow repeating fields.
31. Explain the use of variable length records.
 They are used for Storing of multiple record types in a file.
 Used for storing records that has varying lengths for one or more
fields.
 Used for storing records that allow repeating fields
32. What is the use of a slotted-page structure and what is the information
present in the header?
The slotted-page structure is used for organizing records within a single block. The
header contains the following information.
 The number of record entries in the header.
 The end of free space
 An array whose entries contain the location and size of each record.
33. What are the two types of blocks in the fixed –length representation?
Define them.
• Anchor block: Contains the first record of a chain.
• Overflow block: Contains the records other than those that are the
first record of a chain.
34. What is known as heap file organization?
In the heap file organization, any record can be placed anywhere in the file
where there is space for the record. There is no ordering of records. There is a
single file for relation.
35. What is known as sequential file organization?
In the sequential file organization, the records are stored in sequential
order, according to the value of a “search key” of each record.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 6


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
36. What is hashing file organization?
In the hashing file organization, a hash function is computed on some
attribute of each record. The result of the hash function specifies in which
block of the file the record should be placed.
37. What is known as clustering file organization?
In the clustering file organization, records of several different relations
are stored in the same file.
38. What is an index?
An index is a structure that helps to locate desired records of a relation
quickly, without examining all records.
39. What are the two types of ordered indices? DEC-2009
 Primary index
 Secondary index
 Clustered index
40. What are the types of indices?
 Ordered indices – Search keys are stored in sorted order.
 Hash indices – Search keys are distributed uniformly across
buckets using hash function.
41. What are the techniques to be evaluated for both ordered indexing and
hashing?
 Access types
 Access time
 Insertion time
 Deletion time
 Space overhead
42. What is known as a search key?
An attribute or set of attributes used to look up records in a file is called
a search key.
43. What is a primary index?
A primary index is an index whose search key also defines the sequential
order of the file.
44. What are called index-sequential files?
The files that are ordered sequentially with a primary index on the
search key are called index-sequential files.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 7


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
45. What are the two types of indices?
 Dense index
 Sparse index
46. What are called multilevel indices?
Indices with two or more levels are called multilevel indices.
47. What are called secondary indices?
Indices whose search key specifies an order different from sequential
order of the file are called secondary indices. The pointers in secondary index
do not point directly to the file. Instead each points to a bucket that contains
pointers to the file.
48. What are the disadvantages of index sequential files?
The main disadvantage of the index sequential file organization is that
performance degrades as the file grows. This degradation is remedied by
reorganization of the file.
49. What is a B+ Tree index?
A B+ Tree index takes the form of a balanced tree in which every path
from the root of the root of the root of the tree to a leaf of the tree is of the same
length.
50. What is B-Tree?
A B-tree eliminates the redundant storage of search-key values .It allows
search key values to appear only once.
51. What is hashing?
Hashing allows us to find the address of a data item directly by
computing a hash function on the search key value of the desired record.
52. How do you create index in SQL?
We create index by he create index command.
Create index <index name> on <relation name> (<attribute list>)
53. Distinguish between static hashing and dynamic hashing?
Static hashing
Static hashing uses a hash function in which the set of bucket adders is
fixed. Such hash functions cannot easily accommodate databases that grow
larger over time.
Dynamic hashing

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 8


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Dynamic hashing allows us to modify the hash function dynamically.
Dynamic hashing copes with changes in database size by splitting and
coalescing buckets as the database grows and shrinks.
54. What is a hash index?
A hash index organizes the search keys, with their associated pointers,
into a hash file structure.
55. What can be done to reduce the occurrences of bucket overflows in a
hash file organization?
To reduce bucket overflow the number of bucket is chosen to be
(nr/fr)*(1+d).
We handle bucket overflow by using
• Overflow chaining (closed hashing)
• Open hashing
56. Differentiate open hashing and closed hashing (overflow chaining)
Closed hashing (overflow chaining)
If a record must be inserted in to a bucket b, and b is already
full, the system provides an overflow bucket for b, and inserts the record in
to the overflow bucket. If the overflow bucket is also full, the system provides
another overflow bucket, and so on. All the overflow buckets of a given
buckets are chained together in a linked list, overflow handling using
linked list is known as closed hashing.
Open hashing
The set of buckets is fixed, and there are no overflow chains. Instead, if a
bucket is full, the system inserts records in some other bucket in the initial set
of buckets.
57. What is linear probing?
Linear probing is a type of open hashing. If a bucket is full the system
inserts records in to the next bucket that has space. This is known as linear
probing.
58. What is called query processing?
Query processing refers to the range of activities involved in extracting
data from a database.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 9


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
59. What are the steps involved in query processing?
The basic steps are:
 Parsing and translation
 Optimization
 Evaluation
60. What is called an evaluation primitive?
A relational algebra operation annotated with instructions on how to
evaluate is called an evaluation primitive.
61. Define query optimization.
Query optimization refers to the process of finding g the lowest –cost
method of evaluating a given query.
62. What is called a query –execution engine?
The query execution engine takes a query evaluation plan, executes that
plan, and returns the answers to the query.
63. How do you measure the cost of query evaluation?
The cost of a query evaluation is measured in terms of a number of
different resources including disk accesses, CPU time to execute a query, and
in a distributed database system the cost of communication
64. List out the operations involved in query processing
 Selection operation
 Join operations.
 Sorting.
 Projection
 Set operations
 Aggregation
65. What are called as index scans?
Search algorithms that use an index are referred to as index scans.
66. What is called as external sorting?
Sorting of relations that do not fit into memory is called as external sorting.
67. Explain nested loop join?
Nested loop join consists of a pair of nested for loops.
Example: r| | s
T s r is the outer relation and s is the inner relation.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 10


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
68. What is meant by block nested loop join?
Block nested loop join is the variant of the nested loop join where every
block of the inner relation is paired with ever y block of the outer
relation. Within each pair of blocks ever y tuple in one block is paired
with every tuple in the other blocks to generate all pairs of tuples.
69. What is meant by hash join?
In the hash join algorithm a hash function h is used to implement
partition tuples of both relations.
70. What is called as recursive partitioning?
The system repeats the splitting of the input until each partition of the
build input fits in the memory. Such partitioning is called recursive
partitioning.
71. What is called as an N-way merge?
The merge operation is a generalization of the two-way merge used by the
standard in-memory sort-merge algorithm. It merges N runs, so it is called an
N-way merge.
72. What is known as fudge factor?
The number of partitions is increased by a small value called the fudge
factor, which is usually 20 percent of the number of hash partitions computed.
73. What are ordered indices? DEC-2011
Ordered Indices is a types of index in which search keys are stored in
sorted order.
74. Distinguish between sparse index and dense index. DEC-2011,Apr/may
2019
Dense Index:
 An index record appears for every search key value in file.
 This record contains search key value and a pointer to the actual record.
Sparse Index:
 Index records are created only for some of the records.
 To locate a record, we find the index record with the largest search key
value less than or equal to the search key value we are looking for.
 We start at that record pointed to by the index record, and proceed along
the pointers in the file (that is, sequentially) until we find the desired

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 11


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
record.
75. What is the difference between primary index and secondary index. MAY-
2010
Primary index:
 In a sequentially ordered file, the index whose search key specifies
the sequential order of the file. Also called clustering index.
 The search key of a primary index is usually but not necessarily
the primary key. An index structure that is defined on the ordering
field
Secondary index:
 An index whose search key specifies an order different from the
sequential order of the file. Also called non-clustering index.
76. Define the terms fragmentation and replication, in terms of where data is
stored. DEC-2008
Fragmentation:
The system partitions the relation into several fragments and stores
each fragments at a different site.
Replication:
The system maintains several identical copies of the relation and stores
each replica at a different site.
77. What are structured data types? What are collection types, in particular?
DEC-2008
Sets are instances of collection types. Other instances of collection types
include arrays and multi sets.

78. Which are the factors to be considered for the evaluation of indexing and
hashing techniques? DEC-2010
Refer Part A - Q. No. 41
80. What is the drawback of flash memory? DEC-2010
Disadvantages:
 Flash memory cells have a limited number of write and erase cycles before
failing.
 Most flash drives do not have a write-protection mechanism

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 12


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
 Smaller size devices, such as flash drives make them easier to lose
 Currently costs a lot more per gigabyte than traditional hard drives for large
storage capacities.
 May require a special version of a program to run on a flash-based drive to
protect from prematurely wearing out the drive.
81. Give a diagrammatic representation to indicate the basic steps in query
processing. DEC-2002

82. How to choose the best evaluation plan for a query? Explain. MAY-2003
The optimizer should choose the access path that retrieves the fewest records in the
most efficient way by estimating the different cost and choosing the methods with the
least estimated cost.
83. Draw the structure of a typical node of a B+ tree and explain briefly. DEC-
2003 (Nov/Dec 2019)
B+ tree considers all the keys in nodes except the leaves as dummies. All keys are
duplicated in the leaves. This has the advantage that is all the leaves are linked
together sequentially, the entire tree may be scanned without visiting the higher
nodes at all.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 13


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

B+ Tree Structure

• A B + -Tree consists of one or more blocks of data, called nodes, linked


together by pointers. The B + -Tree is a tree structure. The tree has a single
node at the top, called the root node. The root node points to two or more
blocks , called child nodes. Each child nodes points to further child nodes and
so on.
• The B + -Tree consists of two types of (1) internal nodes and (2) leaf nodes:
• Internal nodes point to other nodes in the tree.
• Leaf nodes point to data in the database using data pointers. Leaf nodes also
Contain an additional pointer, called the sibling pointer, which is used to
improve the efficiency of certain types of search.
85. What can be done to reduce the occurrences of bucket overflows in a hash
file organization? DEC-2007
To reduce the occurrence of overflows
1. Choose the hash function more carefully, and make better estimation of the
relation size.
2. If the estimated size of the relation is n, and the number of records per block
is fr, allocate (n/fr)*(1+d) buckets instead of (nr/fr) buckets.
Here d is a fudge factor, typically around 0.2. Some space is wasted.
86. How does a B tree differ from B+ tree? Why is a B+ tree usually preferred as
an access structure to a data file. DEC-2008
A B tree of order m (maximum number of children for each node) is a tree
which satisfies:
1.Every node has at most m children.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 14


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
2.Every node (except root) has at least m⁄2 children.
3.The root has at least two children if it is not a leaf node.
4.All leaves appear in the same level, and carry information.
5.A non-leaf node with k children contains k−1 keys.
In a B+ tree, in contrast to a B-tree, all records are stored at the leaf level of
the tree, only keys are stored in interior nodes.
Advantage of B+ trees over B trees is they allow you to in pack more
pointers to other nodes by removing pointers to data, thus increasing the fan
out and thus decreasing the depth of the tree.
In B tree we have to traverse the whole tree structure to get result while
it is easier to search in B+ tree as the leaves form a linked list.
Since the internal nodes in B+ tree contains only pointers(index) to the key
values it very useful in implementing Indexed Sequential files.
We can traverse easily in B+ tree using the indices to reach the required key.

87. Describe flash memory. MAY-2010


Flash memory is an electronic non-volatile computer storage device that
can be electrically erased and reprogrammed.
Flash memory was developed from EEPROM (electrically erasable
programmable read-only memory). There are two main types of flash memory, which
are named after the NAND and NOR logic gates.
88. Mention all the operation of files(Apr/may 2019)
Various file operation are 1. Creation of file 2. Insertion of data 3. Deletion of
data 4.Searching desired data from a file
89.Which cost Components Contribute to Query Execution?
The cost of query evaluation can be measured in terms of a number of different
resources,
 disk accesses
 CPU time to execute a query
 cost of communication
 number of block transfers from disk

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 15


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
PART-B
1. What is Raid? List the different levels in RAID technology and explain
its features. MAY-2011
What is RAID? Briefly explain different levels of RAID. Discuss the factors to
be considered in choosing a RAID level. DEC-2013
Explain the concept of RAID. NOV 2016,Apr/may 2019
Explain what a RAID system is. How does it improve performance and
reliability. Discuss the level 3 and level 4 of the RAID. MAY 2017
Key points:

 Introduction
 RAID Levels
o LEVEL 0 – non redundant striping
o LEVEL 1 – mirrored disk
o LEVEL 2 – memory style error correcting code
o LEVEL 3 – bit interleaved parity
o LEVEL 4 – block interleaved parity
o LEVEL 5 – block interleaved distributed parity
o LEVEL 6 – P + Q redundancy
 Choices of RAID Level

Introduction:
o RAID - Redundant Array of Independent Disks are used to improve
the performance and reliability of disks.
o The main features of RAID are,
 High capacity and high speed by using multiple disks in parallel

 High reliability by storing data redundantly for data recovery under disk
failures.
 Improvement of Reliability via Redundancy
 Redundancy - store extra information that can be used to rebuild
information lost inadisk failure. This can be achieved through Mirroring
(or shadowing).

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 16


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
 Mirroring duplicates every disk. Logical disk consists of two physical
disks.
Every write is carried out on both disks
 Reads can take place from either disk
 If one disk in a pair fails, data is still available from the other
 Data loss would occur only if a disk fails, and its mirror disk. also fails
before the system is repaired. Probability of combined event is very
small.
 Mean time to data loss depends on mean time to failure, and mean time
to repair
RAID Levels:
 LEVEL 0 – non redundant striping
 LEVEL 1 – mirrored disk
 LEVEL 2 – memory style error correcting code
 LEVEL 3 – bit interleaved parity
 LEVEL 4 – block interleaved parity
 LEVEL 5 – block interleaved distributed parity
 LEVEL 6 – P + Q redundancy
In the figure, P indicates error-correcting bits, and C indicates a second copy of
the data.
LEVEL 0:
o RAID level 0 refers to disk arrays with striping at the level of blocks, but
without any redundancy (such as mirroring or parity bits).
o Figure shows an array of size 4.

o Lowest cost of all RAID organization


o Improves the speed and performance of the storage device.
o There is no parity and backup .

LEVEL 1:

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 17


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
o RAID level 1 refers to disk mirroring with block striping and 100% redundancy
in case of a failure.
o Figure Shows a mirrored organization that holds four disks worth of data.
o Used where availability and transaction rates are more important than storage
efficiency

LEVEL 2:
o RAID level 2, known as memory-style error-correcting-code (ECC)
organization, employs parity bits.
o Memory systems have long used parity bits for error detection and correction.
o Storage efficiency increases as the number of data disks increases
o Each byte in a memory system may have a parity bit associated with it that
records whether the numbers of bits in the byte that are set to 1 is even (parity
= 0) or odd (parity = 1).
o If one of the bits in the byte gets damaged (either a 1 becomes a 0, or a 0
becomes a 1), the parity of the byte changes and thus will not match the stored
parity.
o Similarly, if the stored parity bit gets damaged, it will not match the computed
parity.
o Thus, all 1-bit errors will be detected by the memory system.
o Error-correcting schemes store 2 or more extra bits, and can reconstruct the
data if a single bit gets damaged.
o The idea of error-correcting codes can be used directly in disk arrays by
striping bytes across disks.
o For example, the first bit of each byte could be stored in disk 1, the second bit
in disk 2, and so on until the eighth bit is stored in disk 8, and the error-
correction bits are stored in further disks.
o The disks labeled P stores the error correction bits.
o If one of the disks fails, the remaining bits of the byte and the associated error-
correction bits can be read from other disks, and can be used to reconstruct
the damaged data.
o Figure shows an array of size 4.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 18


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

LEVEL 3:
o RAID level 3, bit-interleaved parity organization, improves on level 2 by
exploiting the fact that disk controllers, unlike memory systems, can detect
whether a sector has been read correctly, so a single parity bit can be used for
error correction, as well as for detection.
o RAID 3 stripes the data onto multiple disks. The parity bit generated for data
word
is stored on a different disk.
o Data is interleaved bit-wise over the data disks and a single parity disk is used
and hence overcomes single disk failures.
o Each read operation will access all data disks and a write operation will access
all
the data disks and the single parity disk .
o To recover data in a failed disk, the XOR of bits from other disks including
pari~
bit disk is computed.
o If one of the sectors gets damaged, the system knows exactly which sector it is,
and, for each bit in the sector, the system can figure out whether it s a 1 or a 0
by computing the parity of the corresponding bits from sectors in the other
disks.
o If the parity of the remaining bits is equal to the stored parity, the missing bit is
0; otherwise, it is 1.

Benefits:
o Compared to Level 2 - it is less expensive in the number of extra disks (it has only
a one-disk overhead).
o Compared to Level 1 – It needs only one parity disk for several regular disks
o The transfer rate for reading or writing a single block is N times faster than a
RAID level 1 organization using N-way striping.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 19


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
o RAID level 3 supports a lower number of I/O operations per second, since every
disk has to participate in every I/O request.
LEVEL 4:
o RAID level 4, block-interleaved parity organization, uses block level striping and
in addition keeps a parity block on a separate disk for corresponding blocks from
N other disks.
o In this level, an entire block of data is written onto data disks and then the parity
is generated and stored on a different disk.
o This level uses block-level striping. The size of the blocks are called striping units.
o If one of the disks fails, the parity block can be used with the corresponding
blocks from the other disks to restore the blocks of the failed disk. A block read
accesses only one disk, allowing other requests to be processed by the other disks.
o Thus, the data-transfer rate for each access is slower, but multiple read accesses
can proceed in parallel, leading to a higher overall I/O rate.
o The transfer rates for large reads is high, since all the disks can be read in
parallel; large writes also have high transfer rates, since the data and parity can
be written in parallel.
o Small independent writes, on the other hand, cannot be performed in parallel. A
single write requires four disk accesses: two to read the two old blocks, and two to
write the two blocks.

LEVEL 5:
o RAID level 5, block-interleaved distributed parity, improves on level 4 by
partitioning data and parity among all N + 1 disks, instead of storing data in N
disks and parity in one disk.
o This level is called the block interleaved distributed parity disk array . Writes
whole data blocks onto different disks .
o Parity bits generated for data block stripe are distributed among all the data disks
o rather than storing them on a different dedicated disk.
o Provides best small read, large read and large write performance
o In level 5, all disks can participate in satisfying read requests.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 20


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
o For each set of N logical blocks, one of the disks stores the parity, and the other N
disks store the blocks.

o The P’s are distributed across all the disks.


o For example, with an array of 5 disks, the parity block, labeled Pk, for logical
blocks 4k, 4k+1, 4k+2, 4k+3 is stored in disk (k mod 5)+1; the corresponding
blocks of the other four disks store the 4 data blocks 4k to 4k + 3.
o The following table indicates how the first 20 blocks, numbered 0 to 19, and their
parity blocks are laid out.

Benefit :
It offers better read–write performance at the same cost than level 4.
LEVEL 6:
o RAID level 6, the P + Q redundancy scheme, is much like RAID level 5, but
stores extra redundant information to guard against multiple disk failures.
o In this level, two independent parities (P and Q) are generated and stored in
distributed fashion among multiple disks. Two parities provide additional fault
tolerance.
o Instead of using parity, level 6 uses error-correcting codes such as the Reed–
Solomon codes. 2 bits of redundant data are stored for every 4 bits of data.

Factors in choosing RAID level


 Monetary cost
 Performance: Number of I/O operations per second, and bandwidth
during normal operation
 Performance during failure
 Performance during rebuild of failed disk

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 21


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Including time taken to rebuild failed disk

2. Discuss about primary file storage system.

 The File is a collection of records. Using the primary key, we can access the
records. The type and frequency of access can be determined by the type of
file organization which was used for a given set of records.

 File organization is a logical relationship among various records. This


method defines how file records are mapped onto disk blocks.

 File organization is used to describe the way in which the records are stored
in terms of blocks, and the blocks are placed on the storage medium.

 The first approach to map the database to the file is to use the several files
and store only one fixed length record in any given file. An alternative
approach is to structure our files so that we can contain multiple lengths for
records.

 Files of fixed length records are easier to implement than the files of variable
length records.

There several ways of organizing records in files

 Sequential – store records in sequential order, based on the value of the search
key of each record
 Heap – a record can be placed anywhere in the file where there is space
 Hashing – a hash function computed on some attribute of each record; the
result specifies in which block of the file the record should be placed

Sequential File Organization

 A sequential file is designed for efficient processing of records in sorted order


based on some search-key.

 A search key is any attribute or set of attributes; it need not be the primary
key, or even a super key.

 To permit fast retrieval of records in search-key order, we chain together


records by pointers. The pointer in each record points to the next record in
search-key order.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 22


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
 To minimize the number of block accesses in sequential file processing, we
store records physically in search-key order, or as close to search-key order
as possible.

Figure : Sequential file of account records


The records are stored in search-key order, using branchname as the search key.
Insertion Rules :
1. Locate the record in the file that comes before the record to be
inserted in search-key order.
2. If there is a free record within the same block, insert the new record
there.
3. Otherwise, insert the new record in an overflow block.

Figure: Sequential file after insertion


 If relatively few records need to be stored in overflow blocks, this approach
works well.
 However, the file should be reorganized so that it is once again physically
stored in sequential order. Such reorganizations are costly, and must be done
during times when the system load is low.
Clustering File Organization
 Simple file structure stores each relation in a separate file.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 23


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
 Can instead store several relations in one file using a clustering file organization.
 E.g., clustering organization of customer and depositor:
Mapping of Objects to Files
 Mapping objects to files is similar to mapping tuples to files in a relational system; object data
can be stored using file structures.
 Objects in O-O databases may lack uniformity and may be very large; such objects have to
managed differently from records in a relational system.
 Set fields with a small number of elements may be implemented using data structures such as
linked lists. o Set fields with a larger number of elements may be implemented as separate
relations in the database.
 Set fields can also be eliminated at the storage level by normalization.
 Similar to conversion of multivalued attributes of E-R diagrams to relations Objects are
identified by an object identifier (OID);
 logical identifiers do not directly specify an object‘s physical location; must maintain an index
that maps an OID to the object‘s actual location.
 physical identifiers encode the location of the object so the object can be found directly. Physical
OIDs typically have the following parts:
1. a volume or file identifier
2. a page identifier within the volume or file
3. an offset within the page
3. Describe the different method of implementing variable length records. (File
organization)MAY-2011

 Fixed Length Representation


o Reserved Space
o List Representation
 Disadvantage
 Variable-Length Records :
 Byte String Representation
 Slotted Page Structure

FIXED-LENGTH REPRESENTATION

 Another way to implement variable-length records efficiently in a file system is


to use one or more fixed-length records to represent one variable-length

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 24


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
record.

 There are two ways of doing this:

o Reserved Space
o List representation
1. Reserved space - If there is a maximum record length that is never exceeded, we
can use fixed-length records of that length and the free space is filled with a
special null, or end-of-record, symbol.

2. List representation - Variable-length records can be represented by lists of fixed


length records, chained together by pointers.

If we choose to apply the reserved-space method to our account example, we


need to select a maximum record length. A record in this file is of the account-list
type, but with the array containing exactly three elements. Those branches with fewer
than three accounts have records with null fields. We use the symbol ⊥ to represent
this situation. The reserved-space method is useful when most records have a length
close to the maximum. Otherwise, a significant amount of space may be wasted.
Disadvantage:
 Waste of space in all records except the first in a chain.
 The first record needs to have the branch-name value, but subsequent records
do not.
To deal with this problem, we allow two kinds of blocks in our file:
1. Anchor block, which contains the first record of a chain

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 25


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
2. Overflow block, which contains records other than those that are the first
record of a chain
Thus, all records within a block have the same length, even though not all
records in the file have the same length.

VARIABLE-LENGTH RECORDS :
 Variable-length records arise in database systems in several ways:
 Storage of multiple record types in a file
 Record types that allow variable lengths for one or more fields
 Record types that allow repeating fields
Consider an account record,

 We define account-info as an array with an arbitrary number of elements. That


is, the type definition does not limit the number of elements in the array. There
is no limit on how large a record can be.
Byte-String Representation
 A simple method for implementing variable-length records is to attach a
special end of record (⊥) symbol to the end of each record.
 We can then store each record as a string of consecutive bytes.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 26


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Fig : Byte String Representation


Disadvantages:
 It is not easy to reuse space occupied formerly by a deleted record.
 There is no space for records to grow longer.
 If a variable-length record becomes longer, it must be moved
Slotted-Page Structure
 Slotted-page structure is commonly used for organizing records within a single
block.

Fig : Slotted Page Structure


 There is a header at the beginning of each block, containing the following
information:
1. The number of record entries in the header
2. The end of free space in the block
3. An array whose entries contain the location and size of each record
 The actual records are allocated contiguously in the block, starting from the
end of the block.
 The free space in the block is contiguous, between the final entry in the header
array, and the first record.
 If a record is inserted, space is allocated for it at the end of free space, and an
entry containing its size and location is added to the header.
 If a record is deleted, the space that it occupies is freed, and its entry is set to
deleted.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 27


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
 Further, the records in the block before the deleted record are moved, so that
the free space created by the deletion gets occupied, and all free space is again
between the final entry in the header array and the first record.
 The end-of-free-space pointer in the header is appropriately updated as well.
 The slotted-page structure requires that there be no pointers that point
directly to records. Instead, pointers must point to the entry in the header that
contains the actual location of the record.
4. Explain the various indexing schemes used in database environment. MAY-
2011
What is the use of an index structure and explain the concept of ordered
Indices? DEC-2004
Explain the different properties of indexes in detail. MAY-2007
Explain the index schemas used in database systems. MAY-2008

 Indices- Definition
 Types of indices
o Ordered Indices
o Hash Indices
 Ordered Indices
o Primary index
 Dense Index
 Sparse Index
 Multilevel Indices
o Secondary Indices
 Hash indices

 The database system would look up an index to find on which disk block
the corresponding record resides, and then fetch the disk block, to get the record.
 There are two basic kinds of indices:
 Ordered indices - Based on a sorted ordering of the values.
 Hash indices - Based on a uniform distribution of values across a range
of buckets. The bucket to which a value is assigned is determined by a
function, called a hash function.
 Each technique must be evaluated on the basis of these factors:

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 28


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
 Access types : The types of access that are supported efficiently.
Access types can include finding records with a specified attribute
value and finding records whose attribute values fall in a
specified range.
 Access time : The time it takes to find a particular data item, or
set of Items.
 Insertion time: The time it takes to insert a new data item. This
value Includes the time it takes to find the correct place to insert
the new data item, as well as the time it takes to update the
index structure.
 Deletion time: The time it takes to delete a data item. This value
includes the time it takes to find the item to be deleted, as well as the
time it takes to update the index structure.
 Space overhead: The additional space occupied by an index
structure.
ORDERED INDICES
 Search Key : An attribute or set of attributes used to look up records
in a file.
 An ordered index stores the values of the search keys in sorted order,
and associates with each search key the records that contain it.
 The records in the indexed file may themselves be stored in some
sorted order.
 If the file containing the records is sequentially ordered, a primary
index is an index whose search key also defines the sequential order of
the file. Primary indices are also called clustering indices.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 29


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
PRIMARY INDEX

 All files are ordered sequentially on some search key.

 Files with a primary index on the search key, are called index- sequential files.

 They are designed for applications that require both sequential processing of
the entire file and random access to individual records.

 Figure shows a sequential file of account records taken from our banking
example.

 The records are stored in search-key order, with branch-name used as the
search key.

 Types of primary index

 Dense index
 Sparse index
 Multilevel indices
Dense index:
 In a dense primary index, the index record contains the search-key
value and a pointer to the first data record with that search-key value.
 The rest of the records with the same search key-value would be stored
sequentially after the first record, since, because the index is a primary
one, records are sorted on the same search key.
 Dense index implementations may store a list of pointers to all records
with the same search-key value.

Figure : Dense Index


Insertion

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 30


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
 The system performs a lookup using the search-key value that appears in the
record to be inserted.
1. If the search-key value does not appear in the index, the system inserts an index
record with the search-key value in the index at the appropriate position.
2. Otherwise the following actions are taken:
a. If the index record stores pointers to all records with the same search-key
value, the system adds a pointer to the new record to the index record.
b. If the index record stores a pointer to only the first record with the search-
key value. The system then places the record being inserted after the other
records with the same search-key values.
Deletion
 To delete a record, the system first looks up the record to be deleted.
1. If the deleted record was the only record with its particular search-key value, then
the system deletes the corresponding index record from the index.
2. Otherwise if the index record stores pointers to all records with the same search-
key value, the system deletes the pointer to the deleted record from the index
record.
3. Otherwise, the index record stores a pointer to only the first record with the
search - key value. In this case, if the deleted record was the first record with
the search-key value, the system updates the index record to point to the next
record.
Sparse index:
 An index record appears for only some of the search-key values.
 To locate a Record, we find the index entry with the largest search-key value
that is less than or equal to the search-key value for which we are looking.
 We start at the record pointed to by that index entry, and follow the pointers in
the file until we find the desired record.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 31


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Figure : Sparse Index


Insertion
The system performs a lookup using the search-key value that appears in the
record to be inserted.
Assume that the index stores an entry for each block.
 If the system creates a new block, it inserts the first search-key value
appearing in the new block into the index.
 On the other hand, if the new record has the least search-key value in its
block, the system updates the index entry pointing to the block; if not, the
system makes no change to the index.
Deletion
To delete a record, the system first looks up the record to be deleted.
1. If the index does not contain an index record with the search-key value of
the deleted record, nothing needs to be done to the index.
2. Otherwise the system takes the following actions:
a. If the deleted record was the only record with its search key, the system
replaces the corresponding index record with an index record for the next
search-key value. If the next search-key value already has an index
entry, the entry is deleted instead of being replaced.
b. Otherwise, if the index record for the search-key value points to the
record being deleted, the system updates the index record to point to the
next record with the same search-key value.
Advantage of Sparse Index:
1. It requires less space
2. It impose less maintenance overhead for insertions and deletions.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 32


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Disadvantage:
1. It is slower compared to dense index.
Multilevel Indices
 Indices with two or more levels are called multilevel indices.
 To locate a record, first use binary search on the outer index to find the record
or the largest search-key value less than or equal to the one that we desire.
 The pointer points to a block of the inner index.
 We scan this block until we find the record that has the largest search-key
value less than or equal to the one that we desire.
 The pointer in this record points to the block of the file that contains the
record for which we are looking.
 Each level of index could correspond to a unit of physical storage. Thus, we
may have indices at the track, cylinder, and disk levels.
Example: Dictionary

Figure : Two Level Multilevel indices


SECONDARY INDEX
 Secondary indices must be dense, with an index entry for every search-key
value, and a pointer to every record in the file.
 A secondary index must contain pointers to all the records.
 The pointers in such a secondary index do not point directly to the file.
Instead, each points to a bucket that contains pointers to the file.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 33


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
 Secondary indices improve the performance of queries that use keys other
than the search key of the primary index.
 However, they impose a significant overhead on modification of the database.

Figure: Secondary index on account file, on non candidate key balance


Hash Indices

 A hash index organizes the search keys, with their associated pointers,
into a hash file structure.
 Apply a hash function on a search key to identify a bucket, and store the
key and its associated pointers in the bucket.

 On the account file, for the search key account-number, the hash function
in the figure computes the sum of the digits of the account number
modulo 7.
 The hash index has seven buckets, each of size 2.
 One of the buckets has three keys mapped to it, so it has an overflow bucket.

Figure: Hash index on search key acc_no of account file

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 34


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
4. Explain static and dynamic Hashing Techniques?MAY-2008
Describe a hash file organization. DEC-2004
Explain various hashing techniques. MAY-2007

Static hashing
• Hash file organization
• Hash function
• Bucket overflows
o Reason for overflow
o Handling of overflows
• Hash indices
Dynamic Hashing
• Extendable Hashing

STATIC HASHING
Hash File Organization
 In a hash file organization, we obtain the address of the disk block
containing a desired record directly by computing a function on the search-key
value of the record.
 Bucket is used to denote a unit of storage that can store one or more records.
 A bucket is typically a disk block, but could be chosen to be smaller or larger
than a disk block.
Hash Functions
 Let K denote the set of all search-key values, and let B denote the set of all
bucket addresses.
 A hash function h is a function from K to B. Let h denote a hash function. To
insert a record with search key Ki, we compute h(Ki), which gives the address
of the bucket for that record.
 Qualities of Hash function :
• The distribution is uniform.
• The distribution is random.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 35


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Figure: Hash organization of account file, with branch-name as the key.


Bucket Overflows
 When a record is inserted, the bucket to which it is mapped has space to store
the record. If the bucket does not have enough space, a bucket overflow is
said to occur.
 Reason for Bucket overflows:
o Insufficient buckets. The number of buckets, nB, must be chosen
such that nB > nr/fr, where nr denotes the total number of records that
will be stored, and fr denotes the number of records that will fit in a
bucket. The number of buckets is chosen to be (nr/fr) * (1 + d), where d
is a fudge factor, typically around 0.2.
o Skew. Some buckets are assigned more records than others, so a
bucket may overflow even when other buckets still have space. This
situation is called bucket skew.
Skew can occur for two reasons:
1. Multiple records may have the same search key.
2. The chosen hash function may result in non uniform distribution

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 36


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
of search keys.
o Handling of overflow buckets
 Closed Hashing
 Open Hashing
Closed Hashing / Separate Chaining

Figure : Overflow chaining in a hash structure


 If a record must be inserted into a bucket b, and b is already full, the system
provides an overflow bucket for b, and inserts the record into the overflow
bucket.
 If the overflow bucket is also full, the system provides another overflow bucket,
and so on.
 All the overflow buckets of a given bucket are chained together in a linked list,
called overflow chaining
Open Hashing / Linear probing
 The set of buckets is fixed, and there are no overflow chains.
 Instead, if a bucket is full, the system inserts records in some other bucket in
the initial set of buckets B.
 Hash Indices
 A hash index organizes the search keys, with their associated pointers, into a
hash file structure. Apply a hash function on a search key to identify a bucket,
and
store the key and its associated pointers in the bucket.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 37


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Figure: Hash index on search key acc_no of account file


 On the account file, for the search key account-number. The hash function in
the figure computes the sum of the digits of the account number modulo 7.
The hash index has seven buckets, each of size 2. One of the buckets has
three keys mapped to it, so it has an overflow bucket.
DYNAMIC HASHING
 There are three options for static hashing
1. Choose a hash function based on the current file size. This option will result
in performance degradation as the database grows.
2. Choose a hash function based on the anticipated size of the file at some
Point in the future. Although performance degradation is avoided, a significant
amount of space may be wasted initially.
3. Periodically reorganize the hash structure in response to file growth. Such a
reorganization involves choosing a new hash function, recomputing the hash function
on every record in the file, and generating new bucket assignments.
This reorganization is a massive, time-consuming operation one form of
dynamic hashing, called extendable hashing.
Extendable hashing

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 38


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Extendable hashing copes with changes in database size by splitting and
coalescing buckets as the database grows and shrinks. As a result, space efficiency is
retained.
Example:
Account Table

 Hash function for branch name

 Initial extendable hash structure

 Hash structure after inserting all the records of the account relation

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 39


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Note : Refer class notes for detailed example


6. Briefly describe about B+ tree index file structure. (DEC-2010)

Mention the purpose of indexing . How this can be done by B+


tree? Explain.MAY-2010
Describe the structure of B+ tree and list the characteristics of a
B+ tree. DEC-2007
Explain the structure of B+ Tree and how to process queries in B+ tree.
DEC-2013,(Apr/May 2019)

B+ Tree - Introduction
 Structure of B+ tree
 Structure of the leaf nodes
 Queries On B+ Trees
 Find Operation
Algorithm
 Insert Operation
Algorithm
 Delete Operation
Algorithm

B+ TREE INDEX FILES


 The main disadvantage of the index-sequential file organization is that
performance degrades as the file grows.
 A B+ tree index is the form of a balanced tree in which every path
from the root of the tree to a leaf of the tree is of the same length.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 40


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
 Each non leaf node in the tree has between n/2 and n children, where
n is fixed for a particular tree.

Structure of a B+ Tree
 A B+ tree index is a multilevel index.

Figure: Typical node of a B+ Tree


 It contains upto n − 1 search-key values K1, K2, . . .,Kn−1, and n
pointers P1, P2, . . . , Pn. The search-key values within a node are kept
in sorted order; thus, if i < j, then Ki < Kj .
Structure of the leaf nodes

 For i = 1, 2, . . . , n − 1, pointer Pi points to either a file record with search-key value Ki


or to a bucket of pointers, each of which points to a file record with search-key value Ki.

Figure: A leaf node for account B+ Tree


 Each leaf can hold up to n − 1 values. Leaf nodes can contain as few as
(n−1)/2 values. The ranges of values in each leaf do not overlap.
 If Li and Lj are leaf nodes and i < j, then every search key value in Li is less than
every search-key value in Lj.
 The structure of non leaf nodes is the same as that for leaf nodes, except
that all pointers are pointers to tree nodes.
 A non leaf node may hold up to n pointers, and must hold at least n/2 pointers.
 The number of pointers in a node is called the fanout of the node.
 Let us consider a node containing m pointers. For i = 2, 3, . . .,m − 1, pointer Pi
points to the subtree that contains search-key values less than Ki and greater
than or equal to Ki−1.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 41


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
 Pointer Pm points to the part of the subtree that contains those key values greater
than or equal to Km−1, and pointer P1 points to the part of the subtree that
contains those search-key values less than K1

Figure: Complete B+ tree for the account file (n = 3).

QUERIES ON B+ TREES
 Find Operation
To find all records with a search-key value of V, steps are
1. Examine the root node, looking for the smallest search-key value greater
than V.
2. If the search-key value is Ki, then follow pointer Pi to another node.
3. If no such value, then k ≥ Km−1, where m is the number of pointers in the
node.
4. Follow Pm to another node. Repeat the steps from 1-4.
5. At the leaf node, the search key value Ki equals V is found, then pointer Pi
directs to the desired record or bucket.
6. If the value V is not found in the leaf node, no record with key value V exists.
Algorithm

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 42


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
 Insert Operation
Steps:
1. Find the leaf node in which the search-key value would appear.
2. If the search-key value already appears in the leaf node, add the new record
to the file and to the bucket a pointer to the record.
3. If the search-key value does not appear, insert the value in the leaf node,
and position it such that the search keys are still in order. Then new record is
inserted in the file , if necessary, a new bucket is
created with the appropriate pointer.
Algorithm :

Example Let us insert branch name “Clearview” into a B+ tree.

“Clearview” should appear in the node containing “Brighton” and “Downtown.”


There is no room to insert the search-key value “Clearview.” Therefore, the node is
split into two nodes with Brighton and Clearview in one node and Downtown in
another node.

Figure: Insertion of “Clearview” into B+ tree

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 43


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
 Delete Operation
Steps are
1. Find the record to be deleted, and remove it from the file.
2. Remove the search-key value from the leaf node.
Algorithm

Example :
Let us delete “ downtown “ from the B+ tree.

Locate the entry for “Downtown” then delete the entry for “Downtown” from its
leaf node, the leaf becomes empty. This node must be eliminated from the B+ tree. To
delete a leaf node, we must delete the pointer to it from its parent.

Figure : Deletion of Downtown from B+ tree

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 44


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
7. What are the steps involved in Query processing? How would you estimate
the cost of query ? ( DEC-2011)(Nov/Dec 2019)
 Query Processing - Definition
 Steps in query processing
 Query Execution Plan or Query Evaluation Plan
 Measures of Query Cost

Definition
 Query processing refers to the range of activities involved in extracting data
from a database.
 The activities include translation of queries in high-level database languages
into expressions that can be used at the physical level of the file system, a
variety of query-optimizing transformations, and actual evaluation of queries.
Steps involved in Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation
 Before query processing can begin, the system must translate the query into a
usable form.
 Query processing is to translate a given query into its internal form.
 In generating the internal form of the query, the parser checks the syntax of the
user’s query, verifies that the relation names appearing in the query are names
of the relations in the database, and so on.
 The system constructs a parse-tree representation of the query, which it then
translates into a relational-algebra expression.
 Each SQL query can itself be translated into a relational-algebra expression in
one of several ways.
 Furthermore, the relational-algebra representation of a query specifies only
partially how to evaluate a query.
 There are usually several ways to evaluate relational-algebra expressions.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 45


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Figure: Steps in query processing


Example : consider the query
select balance
from account
where balance < 2500
This query can be translated into either of the following relational-algebra
expressions:
• σbalance<2500 (Πbalance (account))
• Πbalance (σbalance<2500 (account))
Each relational-algebra operation can be executed by one of several different
algorithms.

Figure : Query Evaluation Plan


Query Execution plan or Query Evaluation plan :
 A sequence of primitive operations that can be used to evaluate a query is a
query execution plan or query-evaluation plan.
 The query-execution engine takes a query-evaluation plan, executes that
plan, and returns the answers to the query.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 46


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
 Once the query plan is chosen, the query is evaluated with that plan, and
the result of the query is the output.
 In order to optimize a query, a query optimizer must know the cost of each
operation.
Measures of Query Cost
 The cost of query evaluation can be measured in terms of a number of
different resources,
 disk accesses
 CPU time to execute a query
 cost of communication
 number of block transfers from disk
 A more accurate measure would therefore estimate
1. The number of seek operations performed
2. The number of blocks read
3. The number of blocks written.
8. Explain the following with relevant examples

1. B Tree
The primary distinction between the two approaches is that
1. A B-tree eliminates the redundant storage of search-key values. A B-
tree allows search-key values to appear only once.
2. Able to store the index in fewer tree nodes than in the corresponding B+
tree index.
Structure of a Leaf Node

Ki – Search Key value


Pi – record pointer or bucket pointer, pointing to the record
or bucket containing the record.
Structure of a non Leaf Node

In non leaf nodes, the


pointers Pi - tree

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 47


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
pointers
Bi - bucket or file-record pointers.
 In the generalized B-tree in the figure, there are n – 1 keys in the
leaf node, but there are m − 1 keys in the nonleaf node.

 This discrepancy occurs because nonleaf nodes must include


pointers Bi, thus reducing the number of search keys that can be
held in these nodes.

 The number of nodes accessed in a lookup in a B-tree depends on where


the search key is located.

 It is possible to find the desired value in a B-tree before reaching a leaf


node.

 However, roughly n times as many keys are stored in the leaf level of a B-
tree as in the nonleaf levels, and, since n is typically large, the benefit of
finding certain values early is relatively small.

 Lookup in a B-tree is faster for some search keys but slower for others.
Lookup time is still proportional to the logarithm of the number of search
keys.

 Deletion in a B-tree is more complicated. In a B-tree, the deleted entry may


appear in a nonleaf node.

 The proper value must be selected as a replacement from the tree of the
node containing the deleted entry.

 Specifically, if search key Ki is deleted, the smallest search key appearing in


the subtree of pointer Pi+1 must be moved to the field formerly occupied by
Ki.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 48


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
9. Explain in detail about algorithm for SELECT and JOIN operations?( DEC-
2010)(Nov/Dec 2019)
SELECTION OPERATION
In query processing, the file scan is the lowest-level operator to access
data. File scans are search algorithms that locate and retrieve records that
fulfill a selection condition.
In relational systems, a file scan allows an entire relation to be read in
those cases where the relation is stored in a single, dedicated file.
Basic Algorithms
Consider a selection operation on a relation whose tuples are stored
together in one file.
Two scan algorithms to implement the selection operation are:
• A1 (linear search) - In a linear search, the system scans each file block and
tests all records to see whether they satisfy the selection condition.
• A2 (binary search) - If the file is ordered on an attribute, and the selection
condition is an equality comparison on the attribute, binary search is used to locate
records that satisfy the selection.
Selections Using Indices
Index structures are referred to as access paths, since they provide a
path through which data can be located and accessed. Search algorithms that
use an index are referred to as index scans
Search algorithms that use an index are:
• A3 (primary index, equality on key) - For an equality comparison on a Key
attribute with a primary index, we can use the index to retrieve a single Record that
satisfies the corresponding equality condition.
• A4 (primary index, equality on non key) – Multiple records can be Retrieved
by using a primary index when the selection condition specifies an Equality
comparison on a non key attribute, A. The only difference from the Previous case is
that multiple records may need to be fetched.
• A5 (secondary index, equality) - Selections specifying an equality Condition
can use a secondary
index. This strategy can retrieve a single record if the equality condition is on a
key; multiple records may get retrieved if the indexing field is not a key.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 49


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Selections Involving Comparisons
• A6 (primary index, comparison) - A primary ordered index can be used when the
selection condition is a comparison.
For comparison conditions of the form A > v or A ≥ v, a primary index on A can be
used to direct the retrieval of tuples, as follows. For A ≥ v, look up the value v in the
index to find the first tuple in the file that has a value of A = v. A file scan starting
from that tuple up to the end of the file returns all tuples that satisfy the condition.
For A > v, the file scan starts with the first tuple such that A > v.
For comparisons of the form A < v or A ≤ v, an index lookup is not required. For A
< v, we use a simple file scan starting from the beginning of the file, and continuing
up to (but not including) the first tuple with attribute A = v. The case of A ≤ v is
similar, except that the scan continues up to (but not including) the first tuple with
attribute A > v. In either case, the index is not useful.
• A7 (secondary index, comparison) - A secondary ordered index is to guide
retrieval for comparison conditions involving <,≤,≥, or >. The lowest level index blocks
are scanned, either from the smallest value up to v (for <and ≤), or from v up to the
maximum value (for > and ≥).
Implementation of Complex Selections
• Conjunction: A conjunctive selection is a selection of the form
σθ1∧θ2∧···∧θn(r)
• Disjunction: A disjunctive selection is a selection of the form
σθ1∨θ2∨···∨θn(r)

• Negation: The result of a selection σ¬θ(r) is the set of tuples of r for which the

condition θ evaluates to false. In the absence of nulls, this set is simply the set of
tuples that are not in σθ(r).
• A8 (conjunctive selection using one index)- First determine whether an access
path is available for an attribute. If so then, one of the selection algorithms A2
through A7 can retrieve records satisfying that condition.
• A9 (conjunctive selection using composite index) - If the selection specifies an
equality condition on two or more attributes, and a composite index exists on these
combined attribute fields, then the index can be searched directly.
• A10 (conjunctive selection by intersection of identifiers) - This algorithm
requires indices with record pointers, on the fields involved in the individual

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 50


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
conditions. The algorithm scans each index for pointers to tuples that satisfy an
individual condition.
The cost of algorithm A10 is the sum of the costs of the individual index
scans, plus the cost of retrieving the records in the intersection of the
retrieved lists of pointers. This cost can be reduced by sorting the list of
pointers and
retrieving records in the sorted order.
• A11 (disjunctive selection by union of identifiers) - If access paths are available
on all the conditions of a disjunctive selection, each index is scanned for pointers to
tuples that satisfy the individual condition. The union of all the retrieved pointers
yields the set of pointers to all tuples that satisfy the disjunctive condition.
Algorithms for implementing JOIN Operation
 Join: time-consuming operation. We will consider only natural join
operation
 Two-way join: join on two files.
 Multiway join: involving more than two files
 The following examples of two-way JOIN operation (RΘ A=BS) will be used: –
OP6: EMPLOYEE Θ DNO=DNUMBER DEPARTMENT
OP7: DEPARTMENT Θ MGRSSN=SSN EMPLOYEE
J1: Nested-loop join (brute force)
 For each record t in R (outer loop), retrieve every record s from S (inner loop)
and test whether the two records satisfy the join condition t[A] = s[B].
J2: Single-loop join (using an access structure to retrieve the matching records)
 If an index (or hash key) exists for one of the two join attributes (e.g B of S),
retrieve each record t in R, one at a time (single loop), and then use the access
structure to retrieve directly all matching records s from S that satisfy s[B] =
t[A].
J3: Sort-merge join:
 If the records of R and S are physically sorted (ordered) by value of the join
attributes A and B, respectively, we can implement the join in the most efficient
way.
 Both files are scanned concurrently in order of the join attributes, matching the
records that have the same values for A and B.

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 51


MAILAM ENGINEERING COLLEGE, MAILAM
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
 If the files are not sorted, they may be sorted first by using external sorting.
 Pairs of file blocks are copied into memory buffers in order and records of each
file are scanned only once each for matching with the other file if A & B are key
attributes.
 The method is slightly modified in case where A and B are not key attributes.
J4: Hash-join
 The records of files R and S are both hashed to the same hash file using the
same hashing function on the join attributes A of R and B of S as hash keys.
April/may 2019
PART A

1. Define Dense index?(page no:11,Q.no:74)


2. Mention all the operation of files? (page no:15,Q.no:84)
PART B

1.a) What is RAID? Briefly discuss about RAID. (page no:15,Q.no:1)


Or
b) Describe the structure of B+ tree and give the algorithm for search in the B+
tree with example? (page no:40,Q.no:6)
NOVEMBER/DECEMBER 2019
PART-A
1. How do you represent leaf node of a B+ tree of order p ? ( page no:15,Q.no:85)
2. Which cost components contribute to query execution? ( page no:13,Q.no:83)
PART-A
1.a)i) Explain the various levels of RAID systems? (10 Marks)( page no:15,Q.no:1)
ii) Why data dictionary storage is important? (3) (Refer to Unit 1)
b)i) With simple algorithms explain the computing of Nested-loop join and
Block Nested-loop join. (10) ( page no:49,Q.no:9)
ii) Sketch and concise the basic steps in Query Processing. (3) ( page
no:46,Q.no:7)

PREPARED BY:S.VANAKOVARAYAN AP/CSE, D.SARANYA AP/CSE 52

You might also like