DBMS 3 - 5 Unit PM
DBMS 3 - 5 Unit PM
DBMS 3 - 5 Unit PM
What is SQL?
SQL is a short-form of the structured query language, and it is pronounced as S-Q-L
or sometimes as See-Quell.
This database language is mainly designed for maintaining the data in relational
database management systems. It is a special tool used by data professionals for
handling structured data (data which is stored in the form of tables). It is also designed
for stream processing in RDSMS.
You can easily create and manipulate the database, access and modify the table rows
and columns, etc. This query language became the standard of ANSI in the year of
1986 and ISO in the year of 1987.
If you want to get a job in the field of data science, then it is the most important query
language to learn. Big enterprises like Facebook, Instagram, and LinkedIn, use SQL for
storing the data in the back-end.
classic query engine allows data professionals and users to maintain non-SQL queries.
The architecture of SQL is shown in the following diagram:
Set By
Mr. Palampalli Mohan
Update View
The SQL UPDATE Query is used to modify the existing records in a table or a view.
It is a Data Manipulation Language Command as it only modifies the data of the
database object.
Since it only interacts with the data of a table or a view, the UPDATE statement needs
to used cautiously. If the data to be modified aren't selected beforehand, all the rows
in the table associated with the view will be affected so the correct data will either
be lost or needs to be reinserted. Therefore, to filter records that need to be modified,
you can use a WHERE clause. Using a WHERE clause, you can either update a single
row or multiple rows.
The UPDATE statement makes use of locks on each row while modifying them in a
table or view, and once the row is modified, the lock is released. Therefore, it can
either make changes to a single row or multiple rows with a single query.
Example
Assume we have created a table named Customers using the CREATE
TABLE statement using the following query −
);
2. Its motive is the protection of data. Its motive is the validity of data.
Relational database design (RDD) models’ information and data into a set of tables
with rows and columns. Each row of a relation/table represents a record, and each
column represents an attribute of data. The Structured Query Language (SQL) is
used to manipulate relational databases. The design of a relational database is
composed of four stages, where the data are modeled into a set of related tables.
The stages are −
• Define relations/attributes
• Define primary keys
• Define relationships
• Normalization
Relational databases differ from other databases in their approach to organizing data
and performing transactions. In an RDD, the data are organized into tables and all
types of data access are carried out via controlled transactions. Relational database
design satisfies the ACID (atomicity, consistency, integrity, and durability) properties
required from a database design. Relational database design mandates the use of a
database server in applications for dealing with data management problems.
Set By
Mr. Palampalli Mohan
• Once you have decided on the purpose of the database, gather the data
that are needed to be stored in the database. Divide the data into subject-
based tables.
• Choose one column (or a few columns) as the so-called primary key,
which uniquely identifies the each of the rows.
Step 3 − Create Relationships among Tables
A database consisting of independent and unrelated tables serves little purpose (you
may consider using a spreadsheet instead). The power of a relational database lies
in the relationship that can be defined between tables. The most crucial aspect in
designing a relational database is to identify the relationships among tables. The
types of relationship include:
• one-to-many
Set By
Mr. Palampalli Mohan
• many-to-many
• one-to-one
One-to-Many
In a "class roster" database, a teacher may teach zero or more classes, while a class
is taught by one (and only one) teacher.
The column
teacherID in the child table Classes is known as the foreign key. A foreign key of a
child table is a primary key of a parent table, used to reference the parent table.
Many-to-Many
In a "product sales" database, a customer's order may contain one or more products;
and a product can appear in many orders. In a "bookstore" database, a book is written
by one or more authors; while an author may write zero or more books. This kind of
relationship is known as many-to-many.
Set By
Mr. Palampalli Mohan
One-to-One
In a "product sales" database, a product may have optional supplementary
information such as image, more description and comment. Keeping them inside the
Products table results in many empty spaces (in those records without these optional
data). Furthermore, these large data may degrade the performance of the database.
Set By
Mr. Palampalli Mohan
Column Data Types
You need to choose an appropriate data type for each column. Commonly data types
include integers, floating-point numbers, string (or text), date/time, binary, collection
(such as enumeration and set).
Step 4 − Refine & Normalize the Design
For example,
Functional Dependency ?
The functional dependency is a relationship that exists between two attributes. It
typically exists between the primary key and non-key attribute within a table.
1. X → Y
The left side of FD is known as a determinant, the right side of the production is known
as a dependent.
For example:
Here Emp_Id attribute can uniquely identify the Emp_Name attribute of employee
table because if we know the Emp_Id, we can tell that employee name associated with
it.
1. Emp_Id → Emp_Name
Example:
Example:
1. ID → Name,
2. Name → DOB
Set By
Mr. Palampalli Mohan
Normalization ?
o Normalization is the process of organizing the data in the database.
o Normalization is used to minimize the redundancy from a relation or set of relations.
It is also used to eliminate undesirable characteristics like Insertion, Update, and
Deletion Anomalies.
o Normalization divides the larger table into smaller and links them using relationships.
o The normal form is used to reduce redundancy from the database table.
The main reason for normalizing the relations is removing these anomalies. Failure to
eliminate anomalies leads to data redundancy and can cause data integrity and other
problems as the database grows. Normalization consists of a series of guidelines that
helps to guide you in creating a good database structure.
o Insertion Anomaly: Insertion Anomaly refers to when one cannot insert a new tuple
into a relationship due to lack of data.
o Deletion Anomaly: The delete anomaly refers to the situation where the deletion of
data results in the unintended loss of some other important data.
o Updatation Anomaly: The update anomaly is when an update of a single data value
requires multiple rows of data to be updated.
Normal Description
Form
2NF A relation will be in 2NF if it is in 1NF and all non-key attributes are fully functional depe
key.
4NF A relation will be in 4NF if it is in Boyce Codd's normal form and has no multi-valued dep
5NF A relation is in 5NF. If it is in 4NF and does not contain any join dependency, joining shou
Advantages of Normalization
o Normalization helps to minimize data redundancy.
o Greater overall database organization.
o Data consistency within the database.
o Much more flexible database design.
o Enforces the concept of relational integrity.
Set By
Mr. Palampalli Mohan
Disadvantages of Normalization
o You cannot start building the database before knowing what the user needs.
o The performance degrades when normalizing the relations to higher normal forms, i.e.,
4NF, 5NF.
o It is very time-consuming and difficult to normalize relations of a higher degree.
o Careless decomposition may lead to a bad database design, leading to serious
problems.
EMPLOYEE table:
14 John 7272826385, UP
9064738238
The decomposition of the EMPLOYEE table into 1NF has been shown below:
C++ vs Java
14 John 9064738238 UP
Example: Let's assume, a school can store the data of teachers and the subjects they
teach. In a school, a teacher can teach more than one subject.
TEACHER table
25 Chemistry 30
25 Biology 30
47 English 35
83 Math 38
83 Computer 38
Abstract Class vs Interface | Difference between Abstract class and Interface in Java
To convert the given table into 2NF, we decompose it into two tables:
TEACHER_DETAIL table:
Set By
Mr. Palampalli Mohan
TEACHER_ID TEACHER_AGE
25 30
47 35
83 38
TEACHER_SUBJECT table:
TEACHER_ID SUBJECT
25 Chemistry
25 Biology
47 English
83 Math
83 Computer
A relation is in third normal form if it holds atleast one of the following conditions for
every non-trivial function dependency X → Y.
1. X is a super key.
2. Y is a prime attribute, i.e., each element of Y is part of some candidate key.
Example:
Set By
Mr. Palampalli Mohan
EMPLOYEE_DETAIL table:
Non-prime attributes: In the given table, all attributes except EMP_ID are non-
prime.
That's why we need to move the EMP_CITY and EMP_STATE to the new
<EMPLOYEE_ZIP> table, with EMP_ZIP as a Primary key.
EMPLOYEE table:
EMPLOYEE_ZIP table:
201010 UP Noida
02228 US Boston
60007 US Chicago
06389 UK Norwich
462007 MP Bhopal
Example: Let's assume there is a company where employees work in more than one
department.
EMPLOYEE table:
2.6M
446
Polymorphism in Java | Dynamic Method Dispatch
1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys.
To convert the given table into BCNF, we decompose it into three tables:
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY
264 India
264 India
EMP_DEPT table:
EMP_DEPT_MAPPING table:
EMP_ID EMP_DEPT
D394 283
Set By
Mr. Palampalli Mohan
D394 300
D283 232
D283 549
Functional dependencies:
1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys:
Forthefirsttable: EMP_ID
Forthesecond.table: EMP_DEPT
For the third table: {EMP_ID, EMP_DEPT}
Now, this is in BCNF because left side part of both the functional dependencies is a
key.
Example
STUDENT
21 Computer Dancing
21 Math Singing
Set By
Mr. Palampalli Mohan
34 Chemistry Dancing
74 Biology Cricket
59 Physics Hockey
The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent
entity. Hence, there is no relationship between COURSE and HOBBY.
So to make the above table into 4NF, we can decompose it into two tables:
STUDENT_COURSE
STU_ID COURSE
21 Computer
21 Math
34 Chemistry
74 Biology
59 Physics
STUDENT_HOBBY
STU_ID HOBBY
Set By
Mr. Palampalli Mohan
21 Dancing
21 Singing
34 Dancing
74 Cricket
59 Hockey
Example
SUBJECT LECTURER SEMESTER
Suppose we add a new Semester as Semester 3 but do not know about the subject
and who will be taking that subject so we leave Lecturer and Subject as NULL. But all
three columns together acts as a primary key, so we can't leave other two columns
blank.
So to make the above table into 5NF, we can decompose it into three relations P1, P2
& P3:
P1
SEMESTER SUBJECT
Semester 1 Computer
Semester 1 Math
Semester 1 Chemistry
Semester 2 Math
P2
SUBJECT LECTURER
Computer Anshika
Computer John
Math John
Set By
Mr. Palampalli Mohan
Math Akash
Chemistry Praveen
P3
SEMSTER LECTURER
Semester 1 Anshika
Semester 1 John
Semester 1 John
Semester 2 Akash
Semester 1 Praveen
DBMS Architecture ?
o The DBMS design depends upon its architecture. The basic client/server architecture is
used to deal with a large number of PCs, web servers, database servers and other
components that are connected with networks.
o The client/server architecture consists of many PCs and a workstation which are
connected via the network.
o DBMS architecture depends upon how users are connected to the database to get their
request done.
Set By
Mr. Palampalli Mohan
Types of DBMS Architecture
Database architecture can be seen as a single tier or multi-tier. But logically, database
architecture is of two types like: 2-tier architecture and 3-tier architecture.
1-Tier Architecture
o In this architecture, the database is directly available to the user. It means the user can
directly sit on the DBMS and uses it.
o Any changes done here will directly be done on the database itself. It doesn't provide
a handy tool for end users.
o The 1-Tier architecture is used for development of the local application, where
programmers can directly communicate with the database for the quick response.
2-Tier Architecture
o The 2-Tier architecture is same as basic client-server. In the two-tier architecture,
applications on the client end can directly communicate with the database at the server
side. For this interaction, API's like: ODBC, JDBC are used.
o The user interfaces and application programs are run on the client-side.
Set By
Mr. Palampalli Mohan
o The server side is responsible to provide the functionalities like: query processing and
transaction management.
o To communicate with the DBMS, client-side application establishes a connection with
the server side.
3-Tier Architecture
o The 3-Tier architecture contains another layer between the client and server. In this
architecture, client can't directly communicate with the server.
o The application on the client-end interacts with an application server which further
communicates with the database system.
o End user has no idea about the existence of the database beyond the application
server. The database also has no idea about any other user beyond the application.
o The 3-Tier architecture is used in case of large web application.
Set By
Mr. Palampalli Mohan
Types:
2.1M
252
Data may be stored on several places in two ways using distributed data storage:
1. Replication - With this strategy, every aspect of the connection is redundantly kept at
two or more locations. It is a completely redundant database if the entire database is
accessible from every location. Systems preserve copies of the data as a result of
replication.
2. Fragmentation - In this method, the relationships are broken up into smaller pieces
and each fragment is kept in the many locations where it is needed. To ensure there is
no data loss, the pieces must be created in a way that allows for the reconstruction of
the original relation. As fragmentation doesn't result in duplicate data, consistency is
not a concern.
o Residence unrelated
o Spread-out query processing
o The administration of distributed transactions
o Independent of hardware
o Network independent of operating systems
o Transparency of transactions
o DBMS unrelated<
Performance measures
There are two main resources of performance of a database system, which are
explained below −
• Throughput − The number of tasks that can be completed in a given
time interval. A system that processes a large number of small
transactions can improve throughput by processing many transactions
in parallel.
• Response time − The amount of time it takes to complete a single task
from the time it is submitted. A system that processes large transactions
can improve response time, as well as throughput by performing
subtasks of each transaction in parallel.
Benefits of parallel Database
The benefits of the parallel database are explained below −
Speed
Speed is the main advantage of parallel databases. The server breaks up a request
for a user database into parts and sends each part to a separate computer.
Set By
Mr. Palampalli Mohan
We eventually function on the pieces and combine the outputs, returning them to
the customer. It speeds up most requests for data so that large databases can be
reached more easily.
Capacity
As more users request access to the database, the network administrators are
adding more machines to the parallel server, increasing their overall capacity.
For example, a parallel database enables a large online store to have at the same
time access to information from thousands of users. With a single server, this level
of performance is not feasible.
Reliability
Despite the failure of any computer in the cluster, a properly configured parallel
database will continue to work. The database server senses that there is no response
from a single computer and redirects its function to the other computers.
Many companies, such as online retailers, want their database to be accessible as
fast as possible. This is where a parallel database stands good.
This method also helps in conducting scheduled maintenance on a computer-by-
computer technician. They send a server command to uninstall the affected device,
then perform the maintenance and update required.
UNIT - 4
Set By
Mr. Palampalli Mohan
DBMS Concurrency Control ?
Concurrency Control is the management procedure that is required for controlling
concurrent execution of the operations that take place on a database.
But before knowing about concurrency control, we should know about concurrent
execution.
For example:
Set By
Mr. Palampalli Mohan
Consider the below diagram where two transactions TX and TY, are performed on
the same account A where the balance of account A is $300.
o At time t1, transaction TX reads the value of account A, i.e., $300 (only read).
o At time t2, transaction TX deducts $50 from account A that becomes $250 (only
deducted and not updated/write).
o Alternately, at time t3, transaction TY reads the value of account A that will be $300
only because TX didn't update the value yet.
o At time t4, transaction TY adds $100 to account A that becomes $400 (only added but
not updated/write).
o At time t6, transaction TX writes the value of account A that will be updated as $250
only, as TY didn't update the value yet.
o Similarly, at time t7, transaction TY writes the values of account A, so it will write as
done at time t4 that will be $400. It means the value written by TX is lost, i.e., $250 is
lost.
For example:
For example:
o At time t1, transaction TX reads the value from account A, i.e., $300.
o At time t2, transaction TY reads the value from account A, i.e., $300.
o At time t3, transaction TY updates the value of account A by adding $100 to the
available balance, and then it becomes $400.
o At time t4, transaction TY writes the updated value, i.e., $400.
o After that, at time t5, transaction TX reads the available value of account A, and that will
be read as $400.
o It means that within the same transaction TX, it reads two different values of account A,
i.e., $ 300 initially, and after updation made by transaction TY, it reads $400. It is an
unrepeatable read and is therefore known as the Unrepeatable read problem.
Set By
Mr. Palampalli Mohan
Thus, in order to maintain consistency in the database and avoid such problems that
take place in concurrent execution, management is needed, and that is where the
concept of Concurrency Control comes into role.
We will understand and discuss each protocol one by one in our next sections.
Property of Transaction
1. Atomicity
2. Consistency
3. Isolation
4. Durability
Set By
Mr. Palampalli Mohan
Atomicity
o It states that all operations of the transaction take place at once if not, the transaction
is aborted.
o There is no midway, i.e., the transaction cannot occur partially. Each transaction is
treated as one unit and either run to completion or is not executed at all.
Abort: If a transaction aborts then all the changes made are not visible.
Example: Let's assume that following transaction T consisting of T1 and T2. A consists
of Rs 600 and B consists of Rs 300. Transfer Rs 100 from account A to account B.
T1 T2
Read(A) Read(B)
A:= A-100 Y:=
Write(A) Write(B)
If the transaction T fails after the completion of transaction T1 but before completion
of transaction T2, then the amount will be deducted from A but not added to B. This
shows the inconsistent database state. In order to ensure correctness of database state,
the transaction must be executed in entirety.
Consistency
o The integrity constraints are maintained so that the database is consistent before and
after the transaction.
o The execution of a transaction will leave a database in either its prior stable state or a
new stable state.
o The consistent property of database states that every transaction sees a consistent
database instance.
o The transaction is used to transform the database from one consistent state to another
consistent state.
For example: The total amount must be maintained before or after the transaction.
Therefore, the database is consistent. In the case when T1 is completed but T2 fails,
then inconsistency will occur.
Set By
Mr. Palampalli Mohan
Isolation
o It shows that the data which is used at the time of execution of a transaction cannot
be used by the second transaction until the first one is completed.
o In isolation, if the transaction T1 is being executed and using the data item X, then that
data item can't be accessed by any other transaction T2 until the transaction T1 ends.
o The concurrency control subsystem of the DBMS enforced the isolation property.
Durability
o The durability property is used to indicate the performance of the database's consistent
state. It states that the transaction made the permanent changes.
o They cannot be lost by the erroneous operation of a faulty transaction or by the system
failure. When a transaction is completed, then the database reaches a state known as
the consistent state. That consistent state cannot be lost, even in the event of a system's
failure.
o The recovery subsystem of the DBMS has the responsibility of Durability property.
Recoverability of Schedule ?
o The above table 1 shows a schedule which has two transactions. T1 reads and
writes the value of A and that value is read and written by T2. T2 commits but
later on, T1 fails. Due to the failure, we have to rollback T1. T2 should also be
rollback because it reads the value written by T1, but T2 can't be rollback
because it already committed. So this type of schedule is known as irrecoverable
schedule.
o Irrecoverable schedule: The schedule will be irrecoverable if Tj reads the
updated value of Ti and Tj committed before Ti commit.
o
o The above table 2 shows a schedule with two transactions. Transaction T1 reads
and writes A, and that value is read and written by transaction T2. But later on,
T1 fails. Due to this, we have to rollback T1. T2 should be rollback because T2
has read the value written by T1. As it has not committed before T1 commits so
we can rollback transaction T2 as well. So it is recoverable with cascade rollback.
Set By
Mr. Palampalli Mohan
o Recoverable with cascading rollback: The schedule will be recoverable with
cascading rollback if Tj reads the updated value of Ti. Commit of Tj is delayed
till commit of Ti.
o
o The above Table 3 shows a schedule with two transactions. Transaction T1 reads
and write A and commits, and that value is read and written by T2. So this is a
cascade less recoverable schedule.
Serializable schedule ?
o The serializability of schedules is used to find non-serial schedules that allow the
transaction to execute concurrently without interfering with one another.
o It identifies which schedules are correct when executions of the transaction have
interleaving of their operations.
o A non-serial schedule will be serializable if its result is equal to the result of its
transactions executed serially.
Set By
Mr. Palampalli Mohan
Set By
Mr. Palampalli Mohan
Here,
1. Shared lock:
o It is also known as a Read-only lock. In a shared lock, the data item can only read by
the transaction.
o It can be shared between the transactions because when the transaction holds a lock,
then it can't update the data on the data item.
2. Exclusive lock:
2M
242
Next
Stay
o In the exclusive lock, the data item can be both reads as well as written by the
transaction.
o This lock is exclusive, and in this lock, multiple transactions do not modify the same
data simultaneously.
Growing phase: In the growing phase, a new lock on the data item may be acquired
by the transaction, but none can be released.
Shrinking phase: In the shrinking phase, existing lock held by the transaction may be
released, but no new locks can be acquired.
In the below example, if lock conversion is allowed then the following phase can
happen:
Example:
Set By
Mr. Palampalli Mohan
The following way shows how unlocking and locking work with 2-PL.
Transaction T1:
Transaction T2:
Deadlock in DBMS ?
A deadlock is a condition where two or more transactions are waiting indefinitely for
one another to give up locks. Deadlock is said to be one of the most feared
complications in DBMS as no task ever gets finished and is in waiting state forever.
For example: In the student table, transaction T1 holds a lock on some rows and needs
to update some rows in the grade table. Simultaneously, transaction T2 holds locks on
some rows in the grade table and needs to update the rows in the Student table held
by Transaction T1.
Now, the main problem arises. Now Transaction T1 is waiting for T2 to release its lock
and similarly, transaction T2 is waiting for T1 to release its lock. All activities come to a
halt state and remain at a standstill. It will remain in a standstill until the DBMS detects
the deadlock and aborts one of the transactions.
Set By
Mr. Palampalli Mohan
Polymorphism in Java | Dynamic Method Dispatch
Deadlock Avoidance
o When a database is stuck in a deadlock state, then it is better to avoid the database
rather than aborting or restating the database. This is a waste of time and resource.
o Deadlock avoidance mechanism is used to detect any deadlock situation in advance. A
method like "wait for graph" is used for detecting the deadlock situation but this
method is suitable only for the smaller database. For the larger database, deadlock
prevention method can be used.
Deadlock Detection
In a database, when a transaction waits indefinitely to obtain a lock, then the DBMS
should detect whether the transaction is involved in a deadlock or not. The lock
manager maintains a Wait for the graph to detect the deadlock cycle in the database.
The wait for a graph for the above scenario is shown below:
Deadlock Prevention
o Deadlock prevention method is suitable for a large database. If the resources are
allocated in such a way that deadlock never occurs, then the deadlock can be
prevented.
o The Database management system analyzes the operations of the transaction whether
they can create a deadlock situation or not. If they do, then the DBMS never allowed
that transaction to be executed.
Wait-Die scheme
In this scheme, if a transaction requests for a resource which is already held with a
conflicting lock by another transaction then the DBMS simply checks the timestamp of
both transactions. It allows the older transaction to wait until the resource is available
for execution.
Set By
Mr. Palampalli Mohan
Let's assume there are two transactions Ti and Tj and let TS(T) is a timestamp of any
transaction T. If T2 holds a lock by some other transaction and T1 is requesting for
resources held by T2 then the following actions are performed by DBMS:
1. Check if TS(Ti) < TS(Tj) - If Ti is the older transaction and Tj has held some resource,
then Ti is allowed to wait until the data-item is available for execution. That means if
the older transaction is waiting for a resource which is locked by the younger
transaction, then the older transaction is allowed to wait for resource until it is available.
2. Check if TS(Ti) < TS(Tj) - If Ti is older transaction and has held some resource and if Tj
is waiting for it, then Tj is killed and restarted later with the random delay but with the
same timestamp.
2.1M
180
Where,
o TS protocol ensures freedom from deadlock that means no transaction ever waits.
o But the schedule may not be recoverable and may not even be cascade- free.
Recovery Techniques
Recovery Techniques of the information base are demonstrated as follows −
• Conceded Update
• Quick Update
• Checkpoint
Conceded Update Method
In this technique, an information base isn't truly refreshed on a circle until after an
exchange arrives at its submitting point. After it, the updates are put away
perseveringly in the log and afterward kept in touch with the information base.
Before the submitting point, the exchange refreshes are overseen in the nearby
exchange workspace like cradles. In the event that an exchange comes up short prior
to coming to the submit point, it won't have changed the information base.
Set By
Mr. Palampalli Mohan
Subsequently, there is no compelling reason to UNDO. So it is important to REDO
the impact of the tasks of a submitted exchange from the log, since then impact may
not yet have been recorded.
Reserving/Buffering
In this at least one circle, pages that incorporate information things to be refreshed
are stored into principal memory supports and afterward refreshed in memory prior
to being composed back to plate.
An assortment of in-memory cushions called the DBMS reserve is monitored by
DBMS for holding these cradles. A catalogue is utilized to monitor which information
base things are in the cradle
Log-Based Recovery
o The log is a sequence of records. Log of each transaction is maintained in some stable
storage so that if any failure occurs, then it can be recovered from there.
o If any operation is performed on the database, then it will be recorded in the log.
o But the process of storing the logs should be done before the actual transaction is
applied in the database.
Let's assume there is a transaction to modify the City of a student. The following logs
are written for this transaction.
1. <Tn, Commit>
1. If the log contains the record <Ti, Start> and <Ti, Commit> or <Ti, Commit>, then the
Transaction Ti needs to be redone.
2. If log contains record<Tn, Start> but does not contain the record either <Ti, commit>
or <Ti, abort>, then the Transaction Ti needs to be undone.
Shadow paging is one of the techniques that is used to recover from failure. We all
know that recovery means to get back the information, which is lost. It helps to
maintain database consistency in case of failure.
Set By
Mr. Palampalli Mohan
Concept of shadow paging
Now let see the concept of shadow paging step by step −
• Step 1 − Page is a segment of memory. Page table is an index of pages.
Each table entry points to a page on the disk.
• Step 2 − Two page tables are used during the life of a transaction: the
current page table and the shadow page table. Shadow page table is a
copy of the current page table.
• Step 3 − When a transaction starts, both the tables look identical, the
current table is updated for each write operation.
• Step 4 − The shadow page is never changed during the life of the
transaction.
• Step 5 − When the current transaction is committed, the shadow page
entry becomes a copy of the current page table entry and the disk block
with the old data is released.
• Step 6 − The shadow page table is stored in non-volatile memory. If the
system crash occurs, then the shadow page table is copied to the
current page table.
The shadow paging is represented diagrammatically as follows:-
Advantages
The advantages of shadow paging are as follows −
UNIT – 5
Set By
Mr. Palampalli Mohan
Architecture of Parallel Databases |
DBMS ?
•••
A parallel DBMS is a DBMS that runs across multiple processors or CPUs and is
mainly designed to execute query operations in parallel, wherever possible. The
parallel DBMS link a number of smaller machines to achieve the same throughput
as expected from a single large machine.
In Parallel Databases, mainly there are three architectural designs for parallel
DBMS. They are as follows:
1. Shared Memory Architecture
2. Shared Disk Architecture
3. Shared Nothing Architecture
Let’s discuss them one by one:
1. Shared Memory Architecture- In Shared Memory Architecture, there are
multiple CPUs that are attached to an interconnection network. They are able to
share a single or global main memory and common disk arrays. It is to be noted
that, In this architecture, a single copy of a multi-threaded operating system and
multithreaded DBMS can support these multiple CPUs. Also, the shared memory
is a solid coupled architecture in which multiple CPUs share their memory. It is
also known as Symmetric multiprocessing (SMP). This architecture has a very
wide range which starts from personal workstations that support a few
microprocessors in parallel via RISC.
Advantages :
1. The interconnection network is no longer a bottleneck each CPU has its
own memory.
2. Load-balancing is easier in shared disk architecture.
3. There is better fault tolerance.
Set By
Mr. Palampalli Mohan
Disadvantages :
1. If the number of CPUs increases, the problems of interference and
memory contentions also increase.
2. There’s also exists a scalability problem.
3, Shared Nothing Architecture :
Shared Nothing Architecture is multiple processor architecture in which each
processor has its own memory and disk storage. In this, multiple CPUs are
attached to an interconnection network through a node. Also, note that no two
CPUs can access the same disk area. In this architecture, no sharing of memory or
disk resources is done. It is also known as Massively parallel processing
(MPP).
Advantages :
1. It has better scalability as no sharing of resources is done
2. Multiple CPUs can be added
Disadvantages:
Set By
Mr. Palampalli Mohan
1. The cost of communications is higher as it involves sending of data and
software interaction at both ends
2. The cost of non-local disk access is higher than the cost of shared disk
architectures.
Note that this technology is typically used for very large databases that have the
size of 1012 bytes or TB or for the system that has the process of thousands of
transactions per second.
4, Hierarchical Architecture :
This architecture is a combination of shared disk, shared memory and shared
nothing architectures. This architecture is scalable due to availability of more
memory and many processor. But is costly to other architecture.
• Round-robin partitioning –
In Round Robin partitioning, the relations are studied in any
order. The ith tuple is sent to the disk number(i % n). So, disks take
turns receiving new rows of data. This technique ensures the even
distribution of tuples across disks and is ideally suitable for
applications that wish to read the entire relation sequentially for each
query.
• Schema partitioning –
In schema partitioning, different tables within a database are placed on
different disks. See figure 2 below:
figure – 2
2. Intra-query parallelism :
Intra-query parallelism refers to the execution of a single query in a parallel
process on different CPUs using a shared-nothing paralleling architecture
technique. This uses two types of approaches:
• First approach –
In this approach, each CPU can execute the duplicate task against some
data portion.
• Second approach –
In this approach, the task can be divided into different sectors with
each CPU executing a distinct subtask.
3. Inter-query parallelism :
In Inter-query parallelism, there is an execution of multiple transactions by each
Set By
Mr. Palampalli Mohan
CPU. It is called parallel transaction processing. DBMS uses transaction
dispatching to carry inter query parallelism. We can also use some different
methods, like efficient lock management. In this method, each query is run
sequentially, which leads to slowing down the running of long queries. In such
cases, DBMS must understand the locks held by different transactions running on
different processes. Inter query parallelism on shared disk architecture performs
best when transactions that execute in parallel do not accept the same data. Also,
it is the easiest form of parallelism in DBMS, and there is an increased transaction
throughput.
4. Intra-operation parallelism :
Intra-operation parallelism is a sort of parallelism in which we parallelize the
execution of each individual operation of a task like sorting, joins, projections,
and so on. The level of parallelism is very high in intra-operation parallelism.
This type of parallelism is natural in database systems. Let’s take an SQL query
example:
SELECT * FROM Vehicles ORDER BY Model_Number;
5. Inter-operation parallelism :
When different operations in a query expression are executed in parallel, then it
is called inter-operation parallelism. They are of two types –
• Pipelined parallelism –
In pipeline parallelism, the output row of one operation is consumed
by the second operation even before the first operation has produced
the entire set of rows in its output
• Independent parallelism –
In this parallelism, the operations in query expressions that are not
dependent on each other can be executed in parallel. This parallelism is
very useful in the case of the lower degree of parallelism.
The two simple operations: scanning a relation and loading a relation. Pages can
be read in parallel while scanning a relation, and the retrieved tuples can then be
merged, if the relation is partitioned across several disks. More generally, the idea also
applies when retrieving all tuples that meet a selection condition. If hashing or range
partitioning is used, selection queries can be answered by going to just those
processors that contain relevant tuples.
Sorting
A simple idea is to let each CPU sort the part of the relation that is on its local disk and
to then merge these sorted sets of tuples. The degree of parallelism is likely to be
Set By
Mr. Palampalli Mohan
limited by the merging phase.
A better idea is to redistribute all tuples in the relation using range partitioning. For
example, if we want to sort a collection of employee tuples by salary, salary values
range from 10 to 210, and we have 20 processors, we could send all tuples with salary
values in the range 10 to 20 to the, and so on. (Prior to the redistribution,
while tuples are distributed across the processors, we cannot assume that they are
distributed according to salary ranges.)
Each processor then sorts the tuples assigned to it, using some sequential sorting
algorithm. For example, a processor can collect tuples until its memory is full, then sort
these tuples and write out a run, until all incoming tuples have been written to such
sorted runs on the local disk. These runs can then be merged to create the sorted
version of the set of tuples assigned to this processor. The entire sorted relation can
be retrieved by visiting the processors in an order corresponding to the ranges
assigned to them and simply scanning the tuples.
The basic challenge in parallel sorting is to do the range partitioning so that each
processor receives roughly the same number of tuples; otherwise, a processor that
receives a disproportionately large number of tuples to sort becomes a bottleneck and
limits the scalability of the parallel sort. One good approach to range partitioning is to
obtain a sample of the entire relation by taking samples at each processor that initially
contains part of the relation. The (relatively small) sample is sorted and used to identify
ranges with equal numbers of tuples. This set of range values, called a splitting vector,
is then distributed to all processors and used to range partition the entire relation.
A particularly important application of parallel sorting is sorting the data entries in tree-
structured indexes. Sorting data entries can significantly speed up the process of bulk-
loading an index.
Joins
Suppose that we want to join two relations, say, A and B,on the age attribute. We
assume that they are initially distributed across several disks in some way that is not
useful for the join operation, that is, the initial partitioning is not based on the join
attribute. The basic idea for joining A and B in parallel is to decompose the join into a
collection of k smaller joins. We can decompose the join by partitioning both A and B
into a collection of k logical buckets or partitions. By using the same partitioning
function for both A and B, we ensure that the union of the k smaller joins computes
the join of A and B; this idea is similar to intuition behind the partitioning phase of a
sequential hash join. Because A and B are initially distributed across several
processors, the partitioning step can itself be done in parallel at these processors. At
each processor, all local tuples are retrieved and hashed into one of k partitions, with
the same hash function used at all sites, of course.
Alternatively, we can partition A and B by dividing the range of the join attribute age
into k disjoint subranges and placing A and B tuples into partitions according to
the subrange to which their age values belong. For example, suppose that we have
Set By
Mr. Palampalli Mohan
10 processors, the join attribute is age, with values from 0 to 100. Assuming uniform
distribution, A and B tuples with 0 age < 10 go to processor 1, 10 age < 20 go to
processor 2, and so on. This approach is likely to be more susceptible than hash
partitioning to data skew (i.e., the number of tuples to be joined can vary widely across
partitions), unless the subranges are carefully determined.
If range partitioning is used, the algorithm outlined above leads to a parallel version of
a sort-merge join, with the advantage that the output is available in sorted order. If
hash partitioning is used, we obtain a parallel version of a hash join.
Optimizing a single query for parallel execution has received more attention; systems
typically optimize queries without regard to other queries that might be executing at
the same time.
• The result of one operator can be pipelined into another. For example, consider a
left-deep plan in which all the joins use index nested loops. The result of the (i.e.,
the bottom-most) join is the outer relation tuples for the next join node. As tuples
are produced by the join, they can be used to probe the inner relation in the second
join. The result of the second join can similarly be pipelined into the next join, and
so on.
Set By
Mr. Palampalli Mohan
• Multiple independent operations can be executed concurrently. For example, con-
sider a (nonleft-deep) plan in which relations A and B are joined, relations C and D
are joined, and the results of these two joins are originally joined. Clearly, the join
of A and B can be executed concurrently with the join of C and D.
An optimizer that seeks to parallelize query evaluation has to consider several issues,
and we will only outline the main points. The cost of executing individual operations in
parallel (e.g., parallel sorting) obviously di
ers from executing them sequentially, and the optimizer should estimate operation
costs accordingly. Next, the plan that returns answers quickest may not be the plan
with the least cost.
For example, the cost of A B plus the cost of C D plus the cost of joining their
results may be more than the cost of the cheapest left-deep plan. However, the time
taken is the time for the more expensive of A B and C D, plus the time to join
their results. This time may be less than the time taken by the cheapest left-deep plan.
This observation suggests that a parallelizing optimizer should not restrict itself to only
left-deep trees and should also consider bushy trees, which significantly enlarge the
space of plans to be considered. Finally, there are a number of parameters such as
avalialble bluer space and the number of free processors that will be known only at
run-time.
o Autonomy:
Set By
Mr. Palampalli Mohan
It reveals the division of power inside the Database System and the degree of
autonomy enjoyed by each individual DBMS.
o Heterogeneity:
It speaks of the similarity or differences between the databases, system parts, and data
models.
This architecture is two level architecture where clients and servers are the points or
levels where the main functionality is divided. There is various functionality provided
by the server, like managing the transaction, managing the data, processing the
queries, and optimization.
In this architecture, each node or peer is considered as a server as well as a client, and
it performs its database services as both (server and client). The peers coordinate their
efforts and share their resources with one another.
Each site stores the same database in a Homogenous Database. Since each site has
the same database stored, so all the data management schemes, operating system,
and data structures will be the same across all sites. They are, therefore, simple to
handle.
o Replication:
This method involves redundantly storing the full relationship at two or more locations.
Since a complete database can be accessed from each site, it becomes a redundant
database. Systems preserve copies of the data as a result of replication.
This has advantages because it makes more data accessible at many locations.
Additionally, query requests can now be handled in parallel.
However, there are some drawbacks as well. Data must be updated frequently. Any
changes performed at one site must be documented at every site where that relation
is stored in order to avoid inconsistent results. There is a tonne of overhead here.
Additionally, since concurrent access must now be monitored across several sites,
concurrency control becomes far more complicated.
o Fragmentation:
According to this method, the relationships are divided (i.e., broken up into smaller
pieces), and each fragment is stored at the many locations where it is needed. To
ensure there is no data loss, the pieces must be created in a way that allows for the
reconstruction of the original relation.
Ways of fragmentation:
o Horizontal Fragmentation:
Set By
Mr. Palampalli Mohan
In Horizontal Fragmentation, the relational table or schema is broken down into a
group of one and more rows, and each row gets one fragment of the schema. It is also
called splitting by rows.
o Vertical Fragmentation:
In this fragmentation, a relational table or schema is divided into some more schemas
of smaller sizes. A common candidate key must be present in each fragment in order
to guarantee a lossless join. This is also called splitting by columns.
Note: Most of the time, a hybrid approach of replication and fragmentation is used.
Distributed Transactions ?
Set By
Mr. Palampalli Mohan
Distributed Transactions :
If a client transaction calls actions on multiple servers, it is said to be
distributed. Distributed transactions can be structured in two different
ways:
1. Flat transactions
2. Nested transactions
FLAT TRANSACTIONS :
A flat transaction has a single initiating point(Begin) and a single end
point(Commit or abort). They are usually very simple and are generally used
for short activities rather than larger ones.
A client makes requests to multiple servers in a flat transaction. Transaction
T, for example, is a flat transaction that performs operations on objects in
servers X, Y, and Z.
Before moving on to the next request, a flat client transaction completes the
previous one. As a result, each transaction visits the server object in order.
A transaction can only wait for one object at a time when servers utilize
locking.
Flat Transaction
Nested Transaction
In this section, we will see how the above techniques are implemented in a
distributed database system.
Coordinated Checkpointing :
As the name suggests, coordinated checkpointing synchronises all processes to
write their state to local stable storage at the same time. Coordinated
checkpointing’s key benefit is that the saved state is automatically globally
consistent, preventing cascading rollbacks that could cause a domino effect.
Message Logging :
The core principle of message logging is that we can still obtain a globally
consistent state even if the transmission of messages can be replayed, but without
having to restore that state from stable storage. Instead, any communications that
have been sent since the last checkpoint are simply retransmitted and treated
appropriately.
Set By
Mr. Palampalli Mohan