Nothing Special   »   [go: up one dir, main page]

DBMS Unit 3

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 81

UNIT 3

DATA STORAGE AND QUERY PROCESSING

Record storage and Primary file


organization- Secondary storage
Devices- Operations on Files- Heap
File- Sorted Files- Hashing
Techniques – Index Structure for files
–Different types of Indexes- B-Tree -
B+Tree – Query Processing.
Storage System in DBMS
• A database system provides an ultimate view of the stored
data.
Types of Data Storage
Classification of Physical Storage
Media
 Speed with which data can be accessed
 Cost per unit of data
 Reliability
 data loss on power failure or system crash
 physical failure of the storage device
 Can differentiate storage into:
 volatile storage: loses contents when power is
switched off
 non-volatile storage:
 Contents persist even when power is
switched off.
 Includes secondary and tertiary storage, as
well as battery-backed up main-memory.

Database System Concepts - 5th Edition 11.4 ©Silberschatz, Korth and Sudarshan
• File – A file is named collection of related information that is
recorded on secondary storage such as magnetic disks, magnetic
tables and optical disks.
Types of file organization

 Sequential file organization


 Heap file organization
 Hash file organization
 B+ file organization
 Indexed sequential access method (ISAM)
 Cluster file organization
Sequential File Organization
This method is the easiest method for file organization. In this
method, files are stored sequentially.
This method can be implemented in two ways:
1. Pile File Method
2. Sorted File Method
1. Pile File Method
• It is a quite simple method. In this method, we store the record
in a sequence, i.e., one after another.
• Here, the record will be inserted in the order in which they are
inserted into tables.
• In case of updating or deleting of any record, the record will be
searched in the memory blocks.
• When it is found, then it will be marked for deleting, and the
new record is inserted.
Inserting a new record

When a new record is inserted, it is placed at the end of the


file.
2. Sorted File Method
• In this method, the new record is always inserted at the file's
end, and then it will sort the sequence in ascending or
descending order.
• Sorting of records is based on any primary key or any other key.
• In the case of modification of any record, it will update the
record and then sort the file, and lastly, the updated record is
placed in the right place.
Inserting a new record
Pros of sequential file organization
• It contains a fast and efficient method for the huge amount of
data.
• It is simple in design. It requires no much effort to store the data.
• This method is used when most of the records have to be
accessed like grade calculation of a student, generating the
salary slip, etc.
• This method is used for report generation or statistical
calculations.

Cons of sequential file organization


• It will waste time as we cannot jump on a particular record that
is required but we have to move sequentially which takes our
time.
• Sorted file method takes more time and space for sorting the
records.
Heap file organization
• It is the simplest and most basic type of organization. It works
with data blocks.
• In heap file organization, the records are inserted at the file's
end. When the records are inserted, it doesn't require the
sorting and ordering of records.
• When the data block is full, the new record is stored in some
other block.
• This new data block need not to be the very next data block,
but it can select any data block in the memory to store new
records. The heap file is also known as an unordered file.
If a new record is inserted, then in the above case it will be
inserted into data block 1.
Insertion of a new record

If we want to search, update or delete the data in heap file


organization, then we need to traverse the data from staring of the
file till we get the requested record.
• If the database is very large then searching, updating or deleting of
record will be time-consuming because there is no sorting or
ordering of records.
• In the heap file organization, we need to check all the data until we
get the requested record.

Pros of Heap file organization


• It is a very good method of file organization for bulk insertion. If
there is a large number of data which needs to load into the
database at a time, then this method is best suited.

Cons of Heap file organization


• This method is inefficient for the large database because it takes
time to search or modify the record.
Hash File Organization
• Hashing is an efficient technique to directly search the location of
desired data on the disk without using index structure.
• Data is stored at the data blocks whose address is generated by
using hash function.
• The memory location where these records are stored is called as
data block or data bucket.
• Data bucket – Data buckets are the memory locations where
the records are stored. These buckets are also considered
as Unit Of Storage.
• Hash Function – Hash function is a mapping function that
maps all the set of search keys to actual record address.
Generally, hash function uses primary key to generate the hash
index – address of the data block.
• Hash Index-The prefix of an entire hash value is taken as a
hash index.
• Every hash index has a depth value to signify how many bits
are used for computing a hash function.
• These bits can address 2n buckets. When all these bits are
consumed ? then the depth value is increased linearly and twice
the buckets are allocated.
Types of Hashing
Static Hashing
• In static hashing, the resultant data bucket address will always be
the same.
• Here, there will be no change in the bucket address.
• Hence in this static hashing, the number of data buckets in
memory remains constant throughout.
Operations of Static Hashing
• Searching a record
When a record needs to be searched, then the same hash function
retrieves the address of the bucket where the data is stored.
• Insert a Record
When a new record is inserted into the table, then we will
generate an address for a new record based on the hash key and
record is stored in that location.
• Delete a Record
To delete a record, we will first fetch the record which is
supposed to be deleted. Then we will delete the records for that
address in memory.
• Update a Record
To update a record, we will first search it using a hash function,
and then the data record is updated.
• Static hashing is further divided into
1.Open hashing
2.Close hashing
1. Open Hashing
• When a hash function generates an address at which data is
already stored, then the next bucket will be allocated to it. This
mechanism is called as Linear Probing.
2. Close Hashing
• When buckets are full, then a new data bucket is allocated for
the same hash result and is linked after the previous one. This
mechanism is known as Overflow chaining.
Dynamic Hashing
• The dynamic hashing method is used to overcome the problems of
static hashing like bucket overflow.
• In this method, data buckets grow or shrink as the records
increases or decreases. This method is also known as Extendable
hashing method.
• This method makes hashing dynamic, i.e., it allows insertion or
deletion without resulting in poor performance.
Indexing in DBMS
• Indexing is used to optimize the performance of a database
by minimizing the number of disk accesses required when a
query is processed.
• The index is a type of data structure. It is used to locate and
access the data in a database table quickly.

Index structure:
Indexes can be created using some database columns.
• The first column of the database is the search key that contains
a copy of the primary key or candidate key of the table. The
values of the primary key are stored in sorted order so that the
corresponding data can be accessed easily.
• The second column of the database is the data reference. It
contains a set of pointers holding the address of the disk block
where the value of the particular key can be found.
Ordered indices
How do we find the data record using
the sparse index ?
Look up search key ≤ 40 in sparse index file
Follow database address to data block (or
next level index !!)
Search key in data block
B - Tree
B-Tree is a self-balanced search tree in which every
node contains multiple keys and has more than two
children.
B-Tree of Order m has the following properties...
Property #1 - All leaf nodes must be at same level.
Property #2 - All nodes except root must have at least [m/2]-1 keys
and maximum of m-1 keys.
Property #3 - All non leaf nodes except root (i.e. all internal nodes)
must have at least m/2 children.
Property #4 - If the root node is a non leaf node, then it must
have atleast 2 children.
Property #5 - A non leaf node with n-1 keys must have n number of
children.
Property #6 - All the key values in a node must be in Ascending
Order.
Operations on a B-Tree
The following operations are performed on a B-Tree...
1.Search
2.Insertion
3.Deletion
B+ Tree
• B+ Tree is an extension of B Tree which allows efficient
insertion, deletion and search operations.
Structure of B+ Tree

•In the B+ tree, every leaf node is at equal distance from the root
node.
•The B+ tree is of the order n where n is fixed for every B+ tree.

•It contains an internal node and leaf node.


Structure of an internal node of a B+-tree
Structure of an internal node (including the root
node):
Query Processing
 Query processing:
 Is the list of activities that are perform to obtain the
required tuples that satisfy a given query.
 Query optimization:
 The process of choosing a suitable
execution strategy for processing a query.
 Two internal representations of a query:
 Query Tree
 Query Graph
Steps in Query Processing
Parser checks the syntax of Translator translates the
query and verifies query into its internal
attribute name and form (relational algebra)
relation name
Parser and Relational algebra
Query
translator expression

Choose best execution plan


Optimizer
Execute the query‐evaluation
plan and returns output
Evaluation
Query output Execution plan
engine

Database Catalog
Data Statistics about Data
Translating SQL Queries into Relational
Algebra
SELECT LNAME, FNAME
FROM EMPLOYEE
WHERE SALARY > ( SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);

SELECT LNAME, FNAME SELECT MAX (SALARY)


FROM EMPLOYEE FROM EMPLOYEE
WHERE SALARY > C WHERE DNO = 5

πLNAME, FNAME (σSALARY>C(EMPLOYEE)) πMAX SALARY (σDNO=5 (EMPLOYEE))


Measures of Query Cost
 Cost is generally measured as the total time required to execute a
statement/query.
 Factors contribute to time cost
1. Disk accesses (time to process a data request and retrieve the required
data from the storage device)
2. CPU time to execute a query
3. Network communication cost
 Disk access is the predominant (major) cost, since disk access is
slow as compared to in‐memory operation.
 Cost to write a block is greater than cost to read a block because
data is read back after being written to ensure that the write was
successful.
Selection operation
 Symbol:  (Sigma)
 Notation:  condition(Relation)
 Operation: Selects tuples from a relation that satisfy a
given condition.
RollNo Name Branch SPI
101 Raj CE 8
102 Meet ME 9
103 Harsh EE 8
104 Punit CE 9

 Branch=‘CE’ (Student)
RollNo Name Branch SPI
101 Raj CE 8
104 Punit CE 9
Search algorithm for selection operation
1. Linear search (A1)
2. Binary search (A2)
Linear search (A1)
 It scans each blocks and tests all records to see whether
they
satisfy the selection condition.
• Cost of linear search (worst case) = br
br denotes number of blocks containing records from relation r
 If the selection condition is there on a (primary) key attribute,
then system can stop searching if the required record is found.
• cost of linear search (best case) = (br /2)
 Linear search can be applied regardless of
• selection condition or
• ordering of records in the file (relation)
 This algorithm is slower than binary search algorithm.
Binary search (A2)
 Generally, this algorithm is used if selection is an equality
comparison on the (primary) key attribute and file (relation) is
ordered (sorted) on (primary) key attribute.
 cost of binary search = [log2(br)]
• br denotes number of blocks containing records from relation r
 If the selection is on non (primary) key attribute then multiple
block may contains required records, then the cost of scanning
such blocks need to be added to the cost estimate.
 This algorithm is faster than linear search algorithm.
Evaluation of expressions
 Expression may contain more than one operations,
solving expression will be difficult if it contains more than one
expression. customer )
 Cust_Name ( Balance<2500 (account)
 To evaluate such expression we need to evaluate each operation
one by one in appropriate order.
3  Cust_Name

2
Bottom to top
Execution

1 customer
 Balance<2500
account
Query optimization
 It is a process of selecting the most efficient query evaluation
plan from the available possible plans.
Customer )
 Cust_Name ( Balance<2500 (Account)
Efficient plan 2 records 4 records

Customer ))
 Cust_Name ( Balance<2500 (Account

Customer 4 records Account 4 records


Cid Ano Cust_name Ano Balance
C01 A01 Raj A01 3000
C02 A02 Meet A02 1000
C03 A03 Harsh A03 2000
C04 A04 Punit A04 4000
Approaches to Query Optimization
1. Exhaustive Search Optimization
• Generates all possible query plans and then the best plan is selected.
• It provides best solution.
2. Heuristic Based Optimization
 Heuristic based optimization uses rule‐based
approaches for query optimization.
optimization
• Performs select and project operations before join operations. This is done
by moving the select and project operations down the query tree. This
reduces the number of tuples available for join.
• Avoid cross‐product operation because they result in very large‐sized
intermediate tables.
• This algorithms do not necessarily produce the best query plan.
Transformation of relational expressions
 Two relational algebra expressions are said to be equivalent if the
two expressions generate the same set of tuples.
 Example:
Customer Account
Cid Ano Cust_name Ano Balance Cust_name
C01 A01 Raj A01 3000 Meet
C02 A02 Meet A02 1000 Harsh
C03 A03 Harsh A03 2000
C04 A04 Punit A04 4000

Customer ))
 Cust_Name ( Balance<2500 (Account

Customer )
 Cust_Name ( Balance<2500 (Account)
Transformation of relational expressions
1. Combined selection operation can be divided into sequence of
individual selections. This transformation is called cascade of σ.
Customer Output
Cid Ano Cust_name Balance Cid Ano Cust_name Balance
C01 1 Raj 3000 C02 2 Meet 1000
C02 2 Meet 1000
C03 3 Harsh 2000
C04 4 Punit 4000

σAno<3 Λ Balance<2000 (Customer) = σAno<3 (σBalance<2000 (Customer))

= σθ1(σθ2 (E))
σθ1Λθ2 (E)
Transformation of relational expressions
2. Selection operations are commutative.

Customer Output
Cid Ano Cust_name Balance Cid Ano Cust_name Balance
C01 1 Raj 3000 C02 2 Meet 1000
C02 2 Meet 1000
C03 3 Harsh 2000
C04 4 Punit 4000

σAno<3 (σBalance<2000 (Customer)) = σBalance<2000 (σAno<3 (Customer))

σθ1(σθ2 (E) = σθ2(σθ1 (E))


Transformation of relational expressions
3. If more than one projection operation is used in expression then
only the outer projection operation is required. So skip all the
other inner projection operation.
Customer Output
Cid Ano Cust_name Balance Cust_name
C01 1 Raj 3000 Raj
C02 2 Meet 1000 Meet
C03 3 Harsh 2000 Harsh
C04 4 Punit 4000 Punit

∏Cust_name (∏Ano, Cust_name (Customer)) = (Customer)


∏Cust_name
∏L1 (∏L2 (… (∏Ln (E))…)) = ∏L1 (E)
Transformation of relational expressions
4. Selection operation can be joined with cartesian product and
theta join.
Customer Balance Output
Cid Ano Cust_name Ano Balance Cid Ano Cust_name Balance
C01 1 Raj 1 3000 C01 1 Raj 3000
C02 2 Meet 2 1000 C02 2 Meet 1000
C03 3 Harsh 3 2000
C04 4 Punit 4 4000

σ (Customer
Ano<3 = (Customer) σ Ano<3
Balance) (Balance)
σ θ (E1 E2) = E1 θ E2

σ θ1 (E1 θ2E2) =
θ1Λθ2 E2
E1
Transformation of relational expressions
5. Theta operations are commutative.

Customer Balance Output


Cid Ano Cust_name Ano Balance Cid Ano Cust_name Balance
C01 1 Raj 1 3000 C01 1 Raj 3000
C02 2 Meet 2 1000 C02 2 Meet 1000
C03 3 Harsh 3 2000
C04 4 Punit 4 4000

(Balance) σ Ano<3 = (Customer) σ Ano<3


(Customer) (Balance)

E1 θ E2 = E2 θ E1
Transformation of relational expressions
6. Natural join operations are associative.
(E1 E2) E3 = E1 (E2 E3)
7. Selection operation distribute over theta join operation
under the following condition
• When all the attributes in the selection condition θ0 involves only
the attributes of the one of the expression (says E1) being joined.

σ θ0 (E1 θ E2) = (σθ0 (E1)) θ E2


• When the selection condition θ1 involves only the attributes of E1 and θ2
involves only the attributes of E2.
E2) = (σθ1(E1) (σθ2 (E2)))
σθ1Λθ2 (E1 θ θ
Transformation of relational expressions
8. Set operations union and intersection are commutative.
set difference is not commutative Union Intersect
Customer Employee Output Output
Cust_name Emp_name Name Name
Raj Meet Raj Meet
Meet Suresh Meet
Suresh

Customer U = Employee U
Employee Customer
Customer ∩ Employee = Employee ∩ Customer

E1 U E2 = E2
U E1 E1 ∩ E2 =
E2 ∩ E1
Transformation of relational expressions
9. Set operations union and intersection are associative.
Union Intersect
Customer Employee Student Output Output
Cust_name Emp_name Emp_name Name Name
Raj Meet Raj Raj Meet
Meet Suresh Meet Meet
Suresh

(Customer U Employee) U Student = Customer U (Employee U Student)

(Customer ∩ Employee) ∩ Student = Customer ∩ (Employee ∩ Student)

(E1 U E2) U E3 = E1 U
(E2 U E3) (E1 ∩ E2) ∩ E2 =
E1 ∩ (E2 ∩ E3)
Transformation of relational expressions
10. Selection operation distributes over U, ∩ and –.
σ (E1 – E2)
θ = σθ(E1) – σθ(E2)
similarly selection operation is distributed for U and ∩ also.
Result of the expressions
 Expression may contain more than one operations,
solving expression will be difficult if it contains more than one
expression. customer )
 Cust_Name ( Balance<2500 (account)
 To evaluate such expression we need to evaluate each operation
one by one in appropriate order.
3  Cust_Name

2
Bottom to top
Execution

1 customer
 Balance<2500
account

You might also like