Nothing Special   »   [go: up one dir, main page]

DBMS 3 - 5 Unit PM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

Set By

Mr. Palampalli Mohan


DBMS 3 UNIT

SQL definion and upadate view ?

What is SQL?
SQL is a short-form of the structured query language, and it is pronounced as S-Q-L
or sometimes as See-Quell.

This database language is mainly designed for maintaining the data in relational
database management systems. It is a special tool used by data professionals for
handling structured data (data which is stored in the form of tables). It is also designed
for stream processing in RDSMS.

You can easily create and manipulate the database, access and modify the table rows
and columns, etc. This query language became the standard of ANSI in the year of
1986 and ISO in the year of 1987.

If you want to get a job in the field of data science, then it is the most important query
language to learn. Big enterprises like Facebook, Instagram, and LinkedIn, use SQL for
storing the data in the back-end.

classic query engine allows data professionals and users to maintain non-SQL queries.
The architecture of SQL is shown in the following diagram:
Set By
Mr. Palampalli Mohan

Update View
The SQL UPDATE Query is used to modify the existing records in a table or a view.
It is a Data Manipulation Language Command as it only modifies the data of the
database object.
Since it only interacts with the data of a table or a view, the UPDATE statement needs
to used cautiously. If the data to be modified aren't selected beforehand, all the rows
in the table associated with the view will be affected so the correct data will either
be lost or needs to be reinserted. Therefore, to filter records that need to be modified,
you can use a WHERE clause. Using a WHERE clause, you can either update a single
row or multiple rows.
The UPDATE statement makes use of locks on each row while modifying them in a
table or view, and once the row is modified, the lock is released. Therefore, it can
either make changes to a single row or multiple rows with a single query.

Update View Statement


A view is a database object that can contain rows (all or selected) from an existing
table. It can be created from one or many tables which depends on the provided SQL
query to create a view. A view in a database can be updated under certain conditions
which are given below −
• The SELECT clause may not contain the keyword DISTINCT.
Set By
Mr. Palampalli Mohan
• The SELECT clause may not contain summary functions.
• The SELECT clause may not contain set functions.
• The SELECT clause may not contain set operators.
• The SELECT clause may not contain an ORDER BY clause.
• The FROM clause may not contain multiple tables.
• The WHERE clause may not contain subqueries.
• The query may not contain GROUP BY or HAVING.
• Calculated columns may not be updated.
• All NOT NULL columns from the base table must be included in the
view in order for the INSERT query to function.
Syntax
The basic syntax of the UPDATE query with a WHERE clause is as follows −
UPDATE view_name
SET column1 = value1, column2 = value2...., columnN = valueN
WHERE [condition];
You can combine N number of conditions using the AND or the OR operators.

Example
Assume we have created a table named Customers using the CREATE
TABLE statement using the following query −

CREATE TABLE CUSTOMERS(

ID INT NOT NULL,

NAME VARCHAR (20) NOT NULL,

AGE INT NOT NULL,

ADDRESS CHAR (25) ,

SALARY DECIMAL (18, 2),

PRIMARY KEY (ID)

);

Data Security and Data Integrity ?



••
1. Data Security :
Data security refers to the prevention of data from unauthorized users. It is only
allowed to access the data to the authorized users. In database, the DBA or head
Set By
Mr. Palampalli Mohan
of department can access all the data. Some users are only allowed to retrieve
data, whereas others are allowed to retrieve as well as to modify the data.
2. Data Integrity :
Data integrity is defined as the data contained in the database is both correct and
consistent. For this purpose, the data stored in the database must satisfy certain
types of procedures (rules). The data in a database must be correct and
consistent. DBMS provides different ways to implement such types of constraints
(rules). It can be implemented by rules i.e., Primary Key, Secondary Key, Foreign
key. This improves data integrity in a database.
Difference between Data Security and Data Integrity :
S.No. Data Security Data Integrity

Data integrity refers to the


Data security refers to the prevention quality of data, which assures
of data corruption through the use of the data is complete and has a
1. controlled access mechanisms. whole structure.

2. Its motive is the protection of data. Its motive is the validity of data.

Its work is to only the people who


should have access to the data are the Its work is to check the data is
3. only ones who can access the data. correct and not corrupt.

It refers to making sure that data is


accessed by its intended users, thus It refers to the structure of the
ensuring the privacy and protection data and how it matches the
4. of data. schema of the database.

Some of the means to preserve


Some of the popular means of data integrity are backing up, error
security are detection, designing a suitable
authentication/authorization, user interface and correcting
5. masking, and encryptions. data.

It relates to the physical form of data It relates to the logical


against accidental or intentional loss protection (correct, complete
6. or misuse and destruction. and consistence) of data.

It avoids human error when


7. It avoids unauthorized access of data. data is entered.

It can be implemented through : It can be implemented by


following rule :
• user accounts (passwords)
8. • authentication schemes • Primary Key
Set By
Mr. Palampalli Mohan
• Foreign Key
• Relationship

Relational Database Management System ?

Relational database design (RDD) models’ information and data into a set of tables
with rows and columns. Each row of a relation/table represents a record, and each
column represents an attribute of data. The Structured Query Language (SQL) is
used to manipulate relational databases. The design of a relational database is
composed of four stages, where the data are modeled into a set of related tables.
The stages are −

• Define relations/attributes
• Define primary keys
• Define relationships
• Normalization
Relational databases differ from other databases in their approach to organizing data
and performing transactions. In an RDD, the data are organized into tables and all
types of data access are carried out via controlled transactions. Relational database
design satisfies the ACID (atomicity, consistency, integrity, and durability) properties
required from a database design. Relational database design mandates the use of a
database server in applications for dealing with data management problems.
Set By
Mr. Palampalli Mohan

Relational Database Design Process


Database design is more art than science, as you have to make many decisions.
Databases are usually customized to suit a particular application. No two customized
applications are alike, and hence, no two databases are alike. Guidelines (usually in
terms of what not to do instead of what to do) are provided in making these design
decision, but the choices ultimately rest on the designer.
Step 1 − Define the Purpose of the Database (Requirement Analysis)

• Gather the requirements and define the objective of your database.


• Drafting out the sample input forms, queries and reports often help.
Step 2 − Gather Data, Organize in tables and Specify the Primary Keys

• Once you have decided on the purpose of the database, gather the data
that are needed to be stored in the database. Divide the data into subject-
based tables.
• Choose one column (or a few columns) as the so-called primary key,
which uniquely identifies the each of the rows.
Step 3 − Create Relationships among Tables
A database consisting of independent and unrelated tables serves little purpose (you
may consider using a spreadsheet instead). The power of a relational database lies
in the relationship that can be defined between tables. The most crucial aspect in
designing a relational database is to identify the relationships among tables. The
types of relationship include:

• one-to-many
Set By
Mr. Palampalli Mohan
• many-to-many
• one-to-one
One-to-Many
In a "class roster" database, a teacher may teach zero or more classes, while a class
is taught by one (and only one) teacher.

The column
teacherID in the child table Classes is known as the foreign key. A foreign key of a
child table is a primary key of a parent table, used to reference the parent table.

Many-to-Many
In a "product sales" database, a customer's order may contain one or more products;
and a product can appear in many orders. In a "bookstore" database, a book is written
by one or more authors; while an author may write zero or more books. This kind of
relationship is known as many-to-many.
Set By
Mr. Palampalli Mohan

One-to-One
In a "product sales" database, a product may have optional supplementary
information such as image, more description and comment. Keeping them inside the
Products table results in many empty spaces (in those records without these optional
data). Furthermore, these large data may degrade the performance of the database.
Set By
Mr. Palampalli Mohan
Column Data Types
You need to choose an appropriate data type for each column. Commonly data types
include integers, floating-point numbers, string (or text), date/time, binary, collection
(such as enumeration and set).
Step 4 − Refine & Normalize the Design
For example,

• adding more columns,


• create a new table for optional data using one-to-one relationship,
• split a large table into two smaller tables,
• Other methods.

Functional Dependency ?
The functional dependency is a relationship that exists between two attributes. It
typically exists between the primary key and non-key attribute within a table.

1. X → Y

The left side of FD is known as a determinant, the right side of the production is known
as a dependent.

For example:

Assume we have an employee table with attributes: Emp_Id, Emp_Name, Emp_Address.

Here Emp_Id attribute can uniquely identify the Emp_Name attribute of employee
table because if we know the Emp_Id, we can tell that employee name associated with
it.

Functional dependency can be written as:

1. Emp_Id → Emp_Name

We can say that Emp_Name is functionally dependent on Emp_Id.


Set By
Mr. Palampalli Mohan
Types of Functional dependency

1. Trivial functional dependency


o A → B has trivial functional dependency if B is a subset of A.
o The following dependencies are also trivial like: A → A, B → B

Example:

1. Consider a table with two columns Employee_Id and Employee_Name.


2. {Employee_id, Employee_Name} → Employee_Id is a trivial functional dependency
as
3. Employee_Id is a subset of {Employee_Id, Employee_Name}.
4. Also, Employee_Id → Employee_Id and Employee_Name → Employee_Name are tri
vial dependencies too.
2. Non-trivial functional dependency
o A → B has a non-trivial functional dependency if B is not a subset of A.
o When A intersection B is NULL, then A → B is called as complete non-trivial.

Example:

1. ID → Name,
2. Name → DOB
Set By
Mr. Palampalli Mohan
Normalization ?
o Normalization is the process of organizing the data in the database.
o Normalization is used to minimize the redundancy from a relation or set of relations.
It is also used to eliminate undesirable characteristics like Insertion, Update, and
Deletion Anomalies.
o Normalization divides the larger table into smaller and links them using relationships.
o The normal form is used to reduce redundancy from the database table.

Why do we need Normalization?

Abstract class in Java | Abstraction in Java

The main reason for normalizing the relations is removing these anomalies. Failure to
eliminate anomalies leads to data redundancy and can cause data integrity and other
problems as the database grows. Normalization consists of a series of guidelines that
helps to guide you in creating a good database structure.

Data modification anomalies can be categorized into three types:

o Insertion Anomaly: Insertion Anomaly refers to when one cannot insert a new tuple
into a relationship due to lack of data.
o Deletion Anomaly: The delete anomaly refers to the situation where the deletion of
data results in the unintended loss of some other important data.
o Updatation Anomaly: The update anomaly is when an update of a single data value
requires multiple rows of data to be updated.

Types of Normal Forms:


Normalization works through a series of stages called Normal forms. The normal forms
apply to individual relations. The relation is said to be in particular normal form if it
satisfies constraints.

Following are the various types of Normal forms:


Set By
Mr. Palampalli Mohan

Normal Description
Form

1NF A relation is in 1NF if it contains an atomic value.

2NF A relation will be in 2NF if it is in 1NF and all non-key attributes are fully functional depe
key.

3NF A relation will be in 3NF if it is in 2NF and no transition dependency exists.

BCNF A stronger definition of 3NF is known as Boyce Codd's normal form.

4NF A relation will be in 4NF if it is in Boyce Codd's normal form and has no multi-valued dep

5NF A relation is in 5NF. If it is in 4NF and does not contain any join dependency, joining shou

Advantages of Normalization
o Normalization helps to minimize data redundancy.
o Greater overall database organization.
o Data consistency within the database.
o Much more flexible database design.
o Enforces the concept of relational integrity.
Set By
Mr. Palampalli Mohan
Disadvantages of Normalization
o You cannot start building the database before knowing what the user needs.
o The performance degrades when normalizing the relations to higher normal forms, i.e.,
4NF, 5NF.
o It is very time-consuming and difficult to normalize relations of a higher degree.
o Careless decomposition may lead to a bad database design, leading to serious
problems.

`First Normal Form (1NF)


o A relation will be 1NF if it contains an atomic value.
o It states that an attribute of a table cannot hold multiple values. It must hold
only single-valued attribute.
o First normal form disallows the multi-valued attribute, composite attribute, and
their combinations.

Example: Relation EMPLOYEE is not in 1NF because of multi-valued attribute


EMP_PHONE.

EMPLOYEE table:

EMP_ID EMP_NAME EMP_PHONE EMP_ST

14 John 7272826385, UP
9064738238

20 Harry 8574783832 Bihar

12 Sam 7390372389, Punjab


8589830302

The decomposition of the EMPLOYEE table into 1NF has been shown below:

C++ vs Java

EMP_ID EMP_NAME EMP_PHONE EMP_ST


Set By
Mr. Palampalli Mohan
14 John 7272826385 UP

14 John 9064738238 UP

20 Harry 8574783832 Bihar

12 Sam 7390372389 Punjab

12 Sam 8589830302 Punjab

Second Normal Form (2NF)


o In the 2NF, relational must be in 1NF.
o In the second normal form, all non-key attributes are fully functional dependent
on the primary key

Example: Let's assume, a school can store the data of teachers and the subjects they
teach. In a school, a teacher can teach more than one subject.

TEACHER table

TEACHER_ID SUBJECT TEACHER_AGE

25 Chemistry 30

25 Biology 30

47 English 35

83 Math 38

83 Computer 38

In the given table, non-prime attribute TEACHER_AGE is dependent on TEACHER_ID


which is a proper subset of a candidate key. That's why it violates the rule for 2NF.

Abstract Class vs Interface | Difference between Abstract class and Interface in Java

To convert the given table into 2NF, we decompose it into two tables:

TEACHER_DETAIL table:
Set By
Mr. Palampalli Mohan
TEACHER_ID TEACHER_AGE

25 30

47 35

83 38

TEACHER_SUBJECT table:

TEACHER_ID SUBJECT

25 Chemistry

25 Biology

47 English

83 Math

83 Computer

Third Normal Form (3NF)


o A relation will be in 3NF if it is in 2NF and not contain any transitive partial
dependency.
o 3NF is used to reduce the data duplication. It is also used to achieve the data
integrity.
o If there is no transitive dependency for non-prime attributes, then the relation
must be in third normal form.

A relation is in third normal form if it holds atleast one of the following conditions for
every non-trivial function dependency X → Y.

1. X is a super key.
2. Y is a prime attribute, i.e., each element of Y is part of some candidate key.

Example:
Set By
Mr. Palampalli Mohan
EMPLOYEE_DETAIL table:

EMP_ID EMP_NAME EMP_ZIP EMP_STATE

222 Harry 201010 UP

333 Stephan 02228 US

444 Lan 60007 US

555 Katharine 06389 UK

666 John 462007 MP

Super key in the table above:

1. {EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID, EMP_NAME, EMP_ZIP}....so o


n

Candidate key: {EMP_ID}

Non-prime attributes: In the given table, all attributes except EMP_ID are non-
prime.

Here, EMP_STATE & EMP_CITY dependent on EMP_ZIP and EMP_ZIP dependent


on EMP_ID. The non-prime attributes (EMP_STATE, EMP_CITY) transitively
dependent on super key(EMP_ID). It violates the rule of third normal form.

That's why we need to move the EMP_CITY and EMP_STATE to the new
<EMPLOYEE_ZIP> table, with EMP_ZIP as a Primary key.

EMPLOYEE table:

EMP_ID EMP_NAME EMP_ZIP

222 Harry 201010

333 Stephan 02228

444 Lan 60007


Set By
Mr. Palampalli Mohan
555 Katharine 06389

666 John 462007

EMPLOYEE_ZIP table:

EMP_ZIP EMP_STATE EMP_CITY

201010 UP Noida

02228 US Boston

60007 US Chicago

06389 UK Norwich

462007 MP Bhopal

Boyce Codd normal form (BCNF)


o BCNF is the advance version of 3NF. It is stricter than 3NF.
o A table is in BCNF if every functional dependency X → Y, X is the super key of
the table.
o For BCNF, the table should be in 3NF, and for every FD, LHS is super key.

Example: Let's assume there is a company where employees work in more than one
department.

EMPLOYEE table:

EMP_ID EMP_COUNTRY EMP_DEPT DEPT_TYPE EMP

264 India Designing D394 283

264 India Testing D394 300

364 UK Stores D283 232

364 UK Developing D283 549


Set By
Mr. Palampalli Mohan
In the above table Functional dependencies are as follows:

2.6M
446
Polymorphism in Java | Dynamic Method Dispatch

1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}

Candidate key: {EMP-ID, EMP-DEPT}

The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys.

To convert the given table into BCNF, we decompose it into three tables:

EMP_COUNTRY table:

EMP_ID EMP_COUNTRY

264 India

264 India

EMP_DEPT table:

EMP_DEPT DEPT_TYPE EMP_DEPT_NO

Designing D394 283

Testing D394 300

Stores D283 232

Developing D283 549

EMP_DEPT_MAPPING table:

EMP_ID EMP_DEPT

D394 283
Set By
Mr. Palampalli Mohan
D394 300

D283 232

D283 549

Functional dependencies:

1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}

Candidate keys:

Forthefirsttable: EMP_ID
Forthesecond.table: EMP_DEPT
For the third table: {EMP_ID, EMP_DEPT}

Now, this is in BCNF because left side part of both the functional dependencies is a
key.

Fourth normal form (4NF)


o A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued
dependency.
o For a dependency A → B, if for a single value of A, multiple values of B exists, then the
relation will be a multi-valued dependency.

Example
STUDENT

STU_ID COURSE HOBBY

21 Computer Dancing

21 Math Singing
Set By
Mr. Palampalli Mohan
34 Chemistry Dancing

74 Biology Cricket

59 Physics Hockey

The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent
entity. Hence, there is no relationship between COURSE and HOBBY.

In the STUDENT relation, a student with STU_ID, 21 contains two


courses, Computer and Math and two hobbies, Dancing and Singing. So there is a
Multi-valued dependency on STU_ID, which leads to unnecessary repetition of data.

Hello Java Program for Beginners

So to make the above table into 4NF, we can decompose it into two tables:

STUDENT_COURSE

STU_ID COURSE

21 Computer

21 Math

34 Chemistry

74 Biology

59 Physics

STUDENT_HOBBY

STU_ID HOBBY
Set By
Mr. Palampalli Mohan
21 Dancing

21 Singing

34 Dancing

74 Cricket

59 Hockey

Fifth normal form (5NF)


o A relation is in 5NF if it is in 4NF and not contains any join dependency and joining
should be lossless.
o 5NF is satisfied when all the tables are broken into as many tables as possible in order
to avoid redundancy.
o 5NF is also known as Project-join normal form (PJ/NF).

Example
SUBJECT LECTURER SEMESTER

Computer Anshika Semester 1

Computer John Semester 1

Math John Semester 1

Math Akash Semester 2

Chemistry Praveen Semester 1


Set By
Mr. Palampalli Mohan
In the above table, John takes both Computer and Math class for Semester 1 but he
doesn't take Math class for Semester 2. In this case, combination of all these fields
required to identify a valid data.

Suppose we add a new Semester as Semester 3 but do not know about the subject
and who will be taking that subject so we leave Lecturer and Subject as NULL. But all
three columns together acts as a primary key, so we can't leave other two columns
blank.

So to make the above table into 5NF, we can decompose it into three relations P1, P2
& P3:

Java Try Catch

P1

SEMESTER SUBJECT

Semester 1 Computer

Semester 1 Math

Semester 1 Chemistry

Semester 2 Math

P2

SUBJECT LECTURER

Computer Anshika

Computer John

Math John
Set By
Mr. Palampalli Mohan
Math Akash

Chemistry Praveen

P3

SEMSTER LECTURER

Semester 1 Anshika

Semester 1 John

Semester 1 John

Semester 2 Akash

Semester 1 Praveen

DBMS Architecture ?
o The DBMS design depends upon its architecture. The basic client/server architecture is
used to deal with a large number of PCs, web servers, database servers and other
components that are connected with networks.
o The client/server architecture consists of many PCs and a workstation which are
connected via the network.
o DBMS architecture depends upon how users are connected to the database to get their
request done.
Set By
Mr. Palampalli Mohan
Types of DBMS Architecture

Database architecture can be seen as a single tier or multi-tier. But logically, database
architecture is of two types like: 2-tier architecture and 3-tier architecture.

1-Tier Architecture
o In this architecture, the database is directly available to the user. It means the user can
directly sit on the DBMS and uses it.
o Any changes done here will directly be done on the database itself. It doesn't provide
a handy tool for end users.
o The 1-Tier architecture is used for development of the local application, where
programmers can directly communicate with the database for the quick response.

2-Tier Architecture
o The 2-Tier architecture is same as basic client-server. In the two-tier architecture,
applications on the client end can directly communicate with the database at the server
side. For this interaction, API's like: ODBC, JDBC are used.
o The user interfaces and application programs are run on the client-side.
Set By
Mr. Palampalli Mohan
o The server side is responsible to provide the functionalities like: query processing and
transaction management.
o To communicate with the DBMS, client-side application establishes a connection with
the server side.

Fig: 2-tier Architecture

3-Tier Architecture
o The 3-Tier architecture contains another layer between the client and server. In this
architecture, client can't directly communicate with the server.
o The application on the client-end interacts with an application server which further
communicates with the database system.
o End user has no idea about the existence of the database beyond the application
server. The database also has no idea about any other user beyond the application.
o The 3-Tier architecture is used in case of large web application.
Set By
Mr. Palampalli Mohan

Fig: 3-tier Architecture

Distributed Database System in DBMS ?


A distributed database is essentially a database that is dispersed across numerous sites,
i.e., on various computers or over a network of computers, and is not restricted to a
single system. A distributed database system is spread across several locations with
distinct physical components. This can be necessary when different people from all
over the world need to access a certain database. It must be handled such that, to
users, it seems to be a single database.

Types:

1. Homogeneous Database: A homogeneous database stores data uniformly across


all locations. All sites utilize the same operating system, database management system,
and data structures. They are therefore simple to handle.

2.1M

252

Method Overloading vs Overriding in Java


Set By
Mr. Palampalli Mohan
2. Heterogeneous Database: With a heterogeneous distributed database, many
locations may employ various software and schema, which may cause issues with
queries and transactions. Moreover, one site could not be even aware of the existence
of the other sites. Various operating systems and database applications may be used
by various machines. They could even employ separate database data models.
Translations are therefore necessary for communication across various sites.

Data may be stored on several places in two ways using distributed data storage:

1. Replication - With this strategy, every aspect of the connection is redundantly kept at
two or more locations. It is a completely redundant database if the entire database is
accessible from every location. Systems preserve copies of the data as a result of
replication.
2. Fragmentation - In this method, the relationships are broken up into smaller pieces
and each fragment is kept in the many locations where it is needed. To ensure there is
no data loss, the pieces must be created in a way that allows for the reconstruction of
the original relation. As fragmentation doesn't result in duplicate data, consistency is
not a concern.

Relationships can be fragmented in one of two ways:


o Separating the relation into groups of tuples using rows results in horizontal
fragmentation, where each tuple is allocated to at least one fragment.
o Vertical fragmentation, also known as splitting by columns, occurs when a relation's
schema is split up into smaller schemas. A common candidate key must be present in
each fragment in order to guarantee a lossless join

Sometimes a strategy that combines fragmentation and replication is employed.

Uses for distributed databases


o The corporate management information system makes use of it.
o Multimedia apps utilize it.
o Used in hotel chains, military command systems, etc.
o The production control system also makes use of it
Set By
Mr. Palampalli Mohan
Characteristics of distributed databases
Distributed databases are logically connected to one another when they are part of a
collection, and they frequently form a single logical database. Data is physically stored
across several sites and is separately handled in distributed databases.

Generally speaking, distributed databases have the following characteristics:

o Residence unrelated
o Spread-out query processing
o The administration of distributed transactions
o Independent of hardware
o Network independent of operating systems
o Transparency of transactions
o DBMS unrelated<

Types of Distributed Database


o Data instances are created in various areas of the database using replicated data.
Distributed databases may access identical data locally by utilizing duplicated data,
which reduces bandwidth. Read-only and writable data are the two types of replicated
data that may be distinguished.
o Only the initial instance of replicated data can be changed in read-only versions; all
subsequent corporate data replications are then updated. Data that is writable can be
modified, but only the initial occurrence is affected.
Set By
Mr. Palampalli Mohan
o Primary keys that point to a single database record are used to identify horizontally
fragmented data. Horizontal fragmentation is typically used when business locations
only want access to the database for their own branch.
o Using primary keys that are duplicates of each other and accessible to each branch of
the database is how vertically fragmented data is organized. When a company's branch
and central location deal with the same accounts differently, vertically fragmented data
is used.
o Data that has been edited or modified for decision support databases is referred to as
reorganised data. When two distinct systems are managing transactions and decision
support, reorganised data is generally utilised. When there are numerous requests,
online transaction processing must be reconfigured, and decision support systems
might be challenging to manage.
o In order to accommodate various departments and circumstances, separate schema
data separates the database and the software used to access it. Often, there is overlap
between many databases and separate schema data

What is a parallel database and explain how it


works?

A parallel database is one which involves multiple processors and working in


parallel on the database used to provide the services.
A parallel database system seeks to improve performance through parallelization of
various operations like loading data, building index and evaluating queries parallel
systems improve processing and I/O speeds by using multiple CPU’s and disks in
parallel.

Working of parallel database


Let us discuss how parallel database works in step by step manner −
Step 1 − Parallel processing divides a large task into many smaller tasks and
executes the smaller tasks concurrently on several CPU’s and completes it more
quickly.
Set By
Mr. Palampalli Mohan
Step 2 − The driving force behind parallel database systems is the demand of
applications that have to query extremely large databases of the order of terabytes
or that have to process a large number of transactions per second.
Step 3 − In parallel processing, many operations are performed simultaneously as
opposed to serial processing, in which the computational steps are performed
sequentially.
This working of parallel database is explained in the diagram given below −

Performance measures
There are two main resources of performance of a database system, which are
explained below −
• Throughput − The number of tasks that can be completed in a given
time interval. A system that processes a large number of small
transactions can improve throughput by processing many transactions
in parallel.
• Response time − The amount of time it takes to complete a single task
from the time it is submitted. A system that processes large transactions
can improve response time, as well as throughput by performing
subtasks of each transaction in parallel.
Benefits of parallel Database
The benefits of the parallel database are explained below −

Speed
Speed is the main advantage of parallel databases. The server breaks up a request
for a user database into parts and sends each part to a separate computer.
Set By
Mr. Palampalli Mohan
We eventually function on the pieces and combine the outputs, returning them to
the customer. It speeds up most requests for data so that large databases can be
reached more easily.

Capacity
As more users request access to the database, the network administrators are
adding more machines to the parallel server, increasing their overall capacity.
For example, a parallel database enables a large online store to have at the same
time access to information from thousands of users. With a single server, this level
of performance is not feasible.

Reliability
Despite the failure of any computer in the cluster, a properly configured parallel
database will continue to work. The database server senses that there is no response
from a single computer and redirects its function to the other computers.
Many companies, such as online retailers, want their database to be accessible as
fast as possible. This is where a parallel database stands good.
This method also helps in conducting scheduled maintenance on a computer-by-
computer technician. They send a server command to uninstall the affected device,
then perform the maintenance and update required.

Benefits for queries


Parallel query processing can benefit the following types of queries −
• Select statements that scan large numbers of pages but output a few
rows only.
• Select statements that include union, order by, or distinct, since these
queries can populate worktables in parallel, and can make use of
parallel sorting.
• Select statements that use merge joins can use parallel processing for
scanning tables and also for sorting and merging.
• Select statements where the reformatting strategy is chosen by the
optimizer, since these can populate worktables in parallel, and can make
use of parallel sorting.
• Create index statements, and the alter table - add constraint clauses
that create indexes, unique and primary keys.

UNIT - 4
Set By
Mr. Palampalli Mohan
DBMS Concurrency Control ?
Concurrency Control is the management procedure that is required for controlling
concurrent execution of the operations that take place on a database.

But before knowing about concurrency control, we should know about concurrent
execution.

Concurrent Execution in DBMS


o In a multi-user system, multiple users can access and use the same database at one
time, which is known as the concurrent execution of the database. It means that the
same database is executed simultaneously on a multi-user system by different users.
o While working on the database transactions, there occurs the requirement of using the
database by multiple users for performing different operations, and in that case,
concurrent execution of the database is performed.
o The thing is that the simultaneous execution that is performed should be done in an
interleaved manner, and no operation should affect the other executing operations,
thus maintaining the consistency of the database. Thus, on making the concurrent
execution of the transaction operations, there occur several challenging problems that
need to be solved.

Problems with Concurrent Execution


In a database transaction, the two main operations are READ and WRITE operations.
So, there is a need to manage these two operations in the concurrent execution of the
transactions as if these operations are not performed in an interleaved manner, and
the data may become inconsistent. So, the following problems occur with the
Concurrent Execution of the operations:

Problem 1: Lost Update Problems (W - W Conflict)


The problem occurs when two different database transactions perform the read/write
operations on the same database items in an interleaved manner (i.e., concurrent
execution) that makes the values of the items incorrect hence making the database
inconsistent.

For example:
Set By
Mr. Palampalli Mohan
Consider the below diagram where two transactions TX and TY, are performed on
the same account A where the balance of account A is $300.

o At time t1, transaction TX reads the value of account A, i.e., $300 (only read).
o At time t2, transaction TX deducts $50 from account A that becomes $250 (only
deducted and not updated/write).
o Alternately, at time t3, transaction TY reads the value of account A that will be $300
only because TX didn't update the value yet.
o At time t4, transaction TY adds $100 to account A that becomes $400 (only added but
not updated/write).
o At time t6, transaction TX writes the value of account A that will be updated as $250
only, as TY didn't update the value yet.
o Similarly, at time t7, transaction TY writes the values of account A, so it will write as
done at time t4 that will be $400. It means the value written by TX is lost, i.e., $250 is
lost.

Hence data becomes incorrect, and database sets to inconsistent.


Set By
Mr. Palampalli Mohan
Dirty Read Problems (W-R Conflict)
The dirty read problem occurs when one transaction updates an item of the database,
and somehow the transaction fails, and before the data gets rollback, the updated
database item is accessed by another transaction. There comes the Read-Write Conflict
between both transactions.

For example:

Consider two transactions TX and TY in the below diagram performing read/write


operations on account A where the available balance in account A is $300:

o At time t1, transaction TX reads the value of account A, i.e., $300.


o At time t2, transaction TX adds $50 to account A that becomes $350.
o At time t3, transaction TX writes the updated value in account A, i.e., $350.
o Then at time t4, transaction TY reads account A that will be read as $350.
o Then at time t5, transaction TX rollbacks due to server problem, and the value changes
back to $300 (as initially).
o But the value for account A remains $350 for transaction TY as committed, which is the
dirty read and therefore known as the Dirty Read Problem.
Set By
Mr. Palampalli Mohan
Unrepeatable Read Problem (W-R Conflict)
Also known as Inconsistent Retrievals Problem that occurs when in a transaction, two
different values are read for the same database item.

For example:

Consider two transactions, TX and TY, performing the read/write operations on


account A, having an available balance = $300. The diagram is shown below:

o At time t1, transaction TX reads the value from account A, i.e., $300.
o At time t2, transaction TY reads the value from account A, i.e., $300.
o At time t3, transaction TY updates the value of account A by adding $100 to the
available balance, and then it becomes $400.
o At time t4, transaction TY writes the updated value, i.e., $400.
o After that, at time t5, transaction TX reads the available value of account A, and that will
be read as $400.
o It means that within the same transaction TX, it reads two different values of account A,
i.e., $ 300 initially, and after updation made by transaction TY, it reads $400. It is an
unrepeatable read and is therefore known as the Unrepeatable read problem.
Set By
Mr. Palampalli Mohan
Thus, in order to maintain consistency in the database and avoid such problems that
take place in concurrent execution, management is needed, and that is where the
concept of Concurrency Control comes into role.

Concurrency Control Protocols


The concurrency control protocols ensure the atomicity, consistency, isolation,
durability and serializability of the concurrent execution of the database transactions.
Therefore, these protocols are categorized as:

o Lock Based Concurrency Control Protocol


o Time Stamp Concurrency Control Protocol
o Validation Based Concurrency Control Protocol

We will understand and discuss each protocol one by one in our next sections.

Transaction property or ACID PROPERTIES


The transaction has the four properties. These are used to maintain consistency in a
database, before and after the transaction.

Property of Transaction
1. Atomicity
2. Consistency
3. Isolation
4. Durability
Set By
Mr. Palampalli Mohan

Atomicity
o It states that all operations of the transaction take place at once if not, the transaction
is aborted.
o There is no midway, i.e., the transaction cannot occur partially. Each transaction is
treated as one unit and either run to completion or is not executed at all.

Atomicity involves the following two operations:

Abort: If a transaction aborts then all the changes made are not visible.

OOPs Concepts in Java


Set By
Mr. Palampalli Mohan
Commit: If a transaction commits then all the changes made are visible.

Example: Let's assume that following transaction T consisting of T1 and T2. A consists
of Rs 600 and B consists of Rs 300. Transfer Rs 100 from account A to account B.

T1 T2

Read(A) Read(B)
A:= A-100 Y:=
Write(A) Write(B)

After completion of the transaction, A consists of Rs 500 and B consists of Rs 400.

If the transaction T fails after the completion of transaction T1 but before completion
of transaction T2, then the amount will be deducted from A but not added to B. This
shows the inconsistent database state. In order to ensure correctness of database state,
the transaction must be executed in entirety.

Consistency
o The integrity constraints are maintained so that the database is consistent before and
after the transaction.
o The execution of a transaction will leave a database in either its prior stable state or a
new stable state.
o The consistent property of database states that every transaction sees a consistent
database instance.
o The transaction is used to transform the database from one consistent state to another
consistent state.

For example: The total amount must be maintained before or after the transaction.

1. Total before T occurs = 600+300=900


2. Total after T occurs= 500+400=900

Therefore, the database is consistent. In the case when T1 is completed but T2 fails,
then inconsistency will occur.
Set By
Mr. Palampalli Mohan
Isolation
o It shows that the data which is used at the time of execution of a transaction cannot
be used by the second transaction until the first one is completed.
o In isolation, if the transaction T1 is being executed and using the data item X, then that
data item can't be accessed by any other transaction T2 until the transaction T1 ends.
o The concurrency control subsystem of the DBMS enforced the isolation property.

Durability
o The durability property is used to indicate the performance of the database's consistent
state. It states that the transaction made the permanent changes.
o They cannot be lost by the erroneous operation of a faulty transaction or by the system
failure. When a transaction is completed, then the database reaches a state known as
the consistent state. That consistent state cannot be lost, even in the event of a system's
failure.
o The recovery subsystem of the DBMS has the responsibility of Durability property.

Recoverability of Schedule ?

o Sometimes a transaction may not execute completely due to a software issue,


system crash or hardware failure. In that case, the failed transaction has to be
rollback. But some other transaction may also have used value produced by the
failed transaction. So we also have to rollback those transactions.
Set By
Mr. Palampalli Mohan
o

o The above table 1 shows a schedule which has two transactions. T1 reads and
writes the value of A and that value is read and written by T2. T2 commits but
later on, T1 fails. Due to the failure, we have to rollback T1. T2 should also be
rollback because it reads the value written by T1, but T2 can't be rollback
because it already committed. So this type of schedule is known as irrecoverable
schedule.
o Irrecoverable schedule: The schedule will be irrecoverable if Tj reads the
updated value of Ti and Tj committed before Ti commit.
o

o The above table 2 shows a schedule with two transactions. Transaction T1 reads
and writes A, and that value is read and written by transaction T2. But later on,
T1 fails. Due to this, we have to rollback T1. T2 should be rollback because T2
has read the value written by T1. As it has not committed before T1 commits so
we can rollback transaction T2 as well. So it is recoverable with cascade rollback.
Set By
Mr. Palampalli Mohan
o Recoverable with cascading rollback: The schedule will be recoverable with
cascading rollback if Tj reads the updated value of Ti. Commit of Tj is delayed
till commit of Ti.
o

o The above Table 3 shows a schedule with two transactions. Transaction T1 reads
and write A and commits, and that value is read and written by T2. So this is a
cascade less recoverable schedule.

Serializable schedule ?
o The serializability of schedules is used to find non-serial schedules that allow the
transaction to execute concurrently without interfering with one another.
o It identifies which schedules are correct when executions of the transaction have
interleaving of their operations.
o A non-serial schedule will be serializable if its result is equal to the result of its
transactions executed serially.
Set By
Mr. Palampalli Mohan
Set By
Mr. Palampalli Mohan

Here,

Schedule A and Schedule B are serial schedule.

Schedule C and Schedule D are Non-serial schedule.


Set By
Mr. Palampalli Mohan
Types of Lock-Based Protocol ?
In this type of protocol, any transaction cannot read or write data until it acquires an
appropriate lock on it. There are two types of lock:

1. Shared lock:

o It is also known as a Read-only lock. In a shared lock, the data item can only read by
the transaction.
o It can be shared between the transactions because when the transaction holds a lock,
then it can't update the data on the data item.

2. Exclusive lock:

2M

242

Abstract class in Java | Abstraction in Java

Next

Stay

o In the exclusive lock, the data item can be both reads as well as written by the
transaction.
o This lock is exclusive, and in this lock, multiple transactions do not modify the same
data simultaneously.

There are four types of lock protocols available:


1. Simplistic lock protocol
It is the simplest way of locking the data while transaction. Simplistic lock-based
protocols allow all the transactions to get the lock on the data before insert or delete
or update on it. It will unlock the data item after completing the transaction.

2. Pre-claiming Lock Protocol


o Pre-claiming Lock Protocols evaluate the transaction to list all the data items on which
they need locks.
o Before initiating an execution of the transaction, it requests DBMS for all the lock on
all those data items.
Set By
Mr. Palampalli Mohan
o If all the locks are granted then this protocol allows the transaction to begin. When the
transaction is completed then it releases all the lock.
o If all the locks are not granted then this protocol allows the transaction to rolls back
and waits until all the locks are granted.

3. Two-phase locking (2PL)


o The two-phase locking protocol divides the execution phase of the transaction into
three parts.
o In the first part, when the execution of the transaction starts, it seeks permission for
the lock it requires.
o In the second part, the transaction acquires all the locks. The third phase is started as
soon as the transaction releases its first lock.
o In the third phase, the transaction cannot demand any new locks. It only releases the
acquired locks.
Set By
Mr. Palampalli Mohan

There are two phases of 2PL:

Growing phase: In the growing phase, a new lock on the data item may be acquired
by the transaction, but none can be released.

Shrinking phase: In the shrinking phase, existing lock held by the transaction may be
released, but no new locks can be acquired.

In the below example, if lock conversion is allowed then the following phase can
happen:

1. Upgrading of lock (from S(a) to X (a)) is allowed in growing phase.


2. Downgrading of lock (from X(a) to S(a)) must be done in shrinking phase.

Example:
Set By
Mr. Palampalli Mohan

The following way shows how unlocking and locking work with 2-PL.

Transaction T1:

o Growing phase: from step 1-3


o Shrinking phase: from step 5-7
o Lock point: at 3

Transaction T2:

o Growing phase: from step 2-6


o Shrinking phase: from step 8-9
o Lock point: at 6
Set By
Mr. Palampalli Mohan
4. Strict Two-phase locking (Strict-2PL)
o The first phase of Strict-2PL is similar to 2PL. In the first phase, after acquiring all the
locks, the transaction continues to execute normally.
o The only difference between 2PL and strict 2PL is that Strict-2PL does not release a lock
after using it.
o Strict-2PL waits until the whole transaction to commit, and then it releases all the locks
at a time.
o Strict-2PL protocol does not have shrinking phase of lock release

Deadlock in DBMS ?
A deadlock is a condition where two or more transactions are waiting indefinitely for
one another to give up locks. Deadlock is said to be one of the most feared
complications in DBMS as no task ever gets finished and is in waiting state forever.

For example: In the student table, transaction T1 holds a lock on some rows and needs
to update some rows in the grade table. Simultaneously, transaction T2 holds locks on
some rows in the grade table and needs to update the rows in the Student table held
by Transaction T1.

Now, the main problem arises. Now Transaction T1 is waiting for T2 to release its lock
and similarly, transaction T2 is waiting for T1 to release its lock. All activities come to a
halt state and remain at a standstill. It will remain in a standstill until the DBMS detects
the deadlock and aborts one of the transactions.
Set By
Mr. Palampalli Mohan
Polymorphism in Java | Dynamic Method Dispatch

Deadlock Avoidance
o When a database is stuck in a deadlock state, then it is better to avoid the database
rather than aborting or restating the database. This is a waste of time and resource.
o Deadlock avoidance mechanism is used to detect any deadlock situation in advance. A
method like "wait for graph" is used for detecting the deadlock situation but this
method is suitable only for the smaller database. For the larger database, deadlock
prevention method can be used.

Deadlock Detection
In a database, when a transaction waits indefinitely to obtain a lock, then the DBMS
should detect whether the transaction is involved in a deadlock or not. The lock
manager maintains a Wait for the graph to detect the deadlock cycle in the database.

Wait for Graph


o This is the suitable method for deadlock detection. In this method, a graph is created
based on the transaction and their lock. If the created graph has a cycle or closed loop,
then there is a deadlock.
Set By
Mr. Palampalli Mohan
o The wait for the graph is maintained by the system for every transaction which is
waiting for some data held by the others. The system keeps checking the graph if there
is any cycle in the graph.

The wait for a graph for the above scenario is shown below:

Deadlock Prevention
o Deadlock prevention method is suitable for a large database. If the resources are
allocated in such a way that deadlock never occurs, then the deadlock can be
prevented.
o The Database management system analyzes the operations of the transaction whether
they can create a deadlock situation or not. If they do, then the DBMS never allowed
that transaction to be executed.

Wait-Die scheme
In this scheme, if a transaction requests for a resource which is already held with a
conflicting lock by another transaction then the DBMS simply checks the timestamp of
both transactions. It allows the older transaction to wait until the resource is available
for execution.
Set By
Mr. Palampalli Mohan
Let's assume there are two transactions Ti and Tj and let TS(T) is a timestamp of any
transaction T. If T2 holds a lock by some other transaction and T1 is requesting for
resources held by T2 then the following actions are performed by DBMS:

1. Check if TS(Ti) < TS(Tj) - If Ti is the older transaction and Tj has held some resource,
then Ti is allowed to wait until the data-item is available for execution. That means if
the older transaction is waiting for a resource which is locked by the younger
transaction, then the older transaction is allowed to wait for resource until it is available.
2. Check if TS(Ti) < TS(Tj) - If Ti is older transaction and has held some resource and if Tj
is waiting for it, then Tj is killed and restarted later with the random delay but with the
same timestamp.

Wound wait scheme


o In wound wait scheme, if the older transaction requests for a resource which is held by
the younger transaction, then older transaction forces younger one to kill the
transaction and release the resource. After the minute delay, the younger transaction
is restarted but with the same timestamp.
o If the older transaction has held a resource which is requested by the Younger
transaction, then the younger transaction is asked to wait until older releases it.

Timestamp based concurrency control ?


o The Timestamp Ordering Protocol is used to order the transactions based on their
Timestamps. The order of transaction is nothing but the ascending order of the
transaction creation.
o The priority of the older transaction is higher that's why it executes first. To determine
the timestamp of the transaction, this protocol uses system time or logical counter.
o The lock-based protocol is used to manage the order between conflicting pairs among
transactions at the execution time. But Timestamp based protocols start working as
soon as a transaction is created.
o Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has
entered the system at 007 times and transaction T2 has entered the system at 009
times. T1 has the higher priority, so it executes first as it is entered the system first.
o The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write'
operation on a data.
Set By
Mr. Palampalli Mohan
Basic Timestamp ordering protocol works as follows:

1. Check the following condition whenever a transaction Ti issues a Read


(X) operation:

o If W_TS(X) >TS(Ti) then the operation is rejected.


o If W_TS(X) <= TS(Ti) then the operation is executed.
o Timestamps of all the data items are updated.

2. Check the following condition whenever a transaction Ti issues


a Write(X) operation:

2.1M

180

Hello Java Program for Beginners

o If TS(Ti) < R_TS(X) then the operation is rejected.


o If TS(Ti) < W_TS(X) then the operation is rejected and Ti is rolled back otherwise the
operation is executed.

Where,

TS(TI) denotes the timestamp of the transaction Ti.

R_TS(X) denotes the Read time-stamp of data-item X.

W_TS(X) denotes the Write time-stamp of data-item X.

Advantages and Disadvantages of TO protocol:


o TO protocol ensures serializability since the precedence graph is as follows:
Set By
Mr. Palampalli Mohan

o TS protocol ensures freedom from deadlock that means no transaction ever waits.
o But the schedule may not be recoverable and may not even be cascade- free.

What are the Database recovery techniques?

Recovery Techniques
Recovery Techniques of the information base are demonstrated as follows −

Log Based Recovery


Logs are the continuation of records which are used to oversee records of the
activities during an exchange. Logs are composed before the real change and put
away on a steady stockpiling media.
Log Based Recovery procedure works in three distinct habits as follows −

• Conceded Update
• Quick Update
• Checkpoint
Conceded Update Method
In this technique, an information base isn't truly refreshed on a circle until after an
exchange arrives at its submitting point. After it, the updates are put away
perseveringly in the log and afterward kept in touch with the information base.
Before the submitting point, the exchange refreshes are overseen in the nearby
exchange workspace like cradles. In the event that an exchange comes up short prior
to coming to the submit point, it won't have changed the information base.
Set By
Mr. Palampalli Mohan
Subsequently, there is no compelling reason to UNDO. So it is important to REDO
the impact of the tasks of a submitted exchange from the log, since then impact may
not yet have been recorded.

Quick Update Method


In this technique, the information base might be refreshed by certain activities of an
exchange before the exchange compasses its submit point. These activities are
reliably recorded in the sign on circle viably composing before adjusted.
In the event that an exchange prematurely ends subsequent to keeping record of a
few changes to the information base, however before submit point, the impact of its
procedure on the data set should be fixed.

Reserving/Buffering
In this at least one circle, pages that incorporate information things to be refreshed
are stored into principal memory supports and afterward refreshed in memory prior
to being composed back to plate.
An assortment of in-memory cushions called the DBMS reserve is monitored by
DBMS for holding these cradles. A catalogue is utilized to monitor which information
base things are in the cradle

Log-Based Recovery
o The log is a sequence of records. Log of each transaction is maintained in some stable
storage so that if any failure occurs, then it can be recovered from there.
o If any operation is performed on the database, then it will be recorded in the log.
o But the process of storing the logs should be done before the actual transaction is
applied in the database.

Let's assume there is a transaction to modify the City of a student. The following logs
are written for this transaction.

o When the transaction is initiated, then it writes 'start' log.


1. <Tn, Start>
o When the transaction modifies the City from 'Noida' to 'Bangalore', then another log
is written to the file.

1. <Tn, City, 'Noida', 'Bangalore' >


Set By
Mr. Palampalli Mohan
o When the transaction is finished, then it writes another log to indicate the end of the
transaction.

1. <Tn, Commit>

There are two approaches to modify the database:

1. Deferred database modification:


o The deferred modification technique occurs if the transaction does not modify the
database until it has committed.
o In this method, all the logs are created and stored in the stable storage, and the
database is updated when a transaction commits.

2. Immediate database modification:


o The Immediate modification technique occurs if database modification occurs while
the transaction is still active.
o In this technique, the database is modified immediately after every operation. It follows
an actual database modification.

Recovery using Log records


When the system is crashed, then the system consults the log to find which
transactions need to be undone and which need to be redone.

Method Overloading vs Overriding in Java

1. If the log contains the record <Ti, Start> and <Ti, Commit> or <Ti, Commit>, then the
Transaction Ti needs to be redone.
2. If log contains record<Tn, Start> but does not contain the record either <Ti, commit>
or <Ti, abort>, then the Transaction Ti needs to be undone.

What is shadow paging in DBMS?

Shadow paging is one of the techniques that is used to recover from failure. We all
know that recovery means to get back the information, which is lost. It helps to
maintain database consistency in case of failure.
Set By
Mr. Palampalli Mohan
Concept of shadow paging
Now let see the concept of shadow paging step by step −
• Step 1 − Page is a segment of memory. Page table is an index of pages.
Each table entry points to a page on the disk.
• Step 2 − Two page tables are used during the life of a transaction: the
current page table and the shadow page table. Shadow page table is a
copy of the current page table.
• Step 3 − When a transaction starts, both the tables look identical, the
current table is updated for each write operation.
• Step 4 − The shadow page is never changed during the life of the
transaction.
• Step 5 − When the current transaction is committed, the shadow page
entry becomes a copy of the current page table entry and the disk block
with the old data is released.
• Step 6 − The shadow page table is stored in non-volatile memory. If the
system crash occurs, then the shadow page table is copied to the
current page table.
The shadow paging is represented diagrammatically as follows:-

Advantages
The advantages of shadow paging are as follows −

• No need for log records.


• No undo/ Redo algorithm.
• Recovery is faster.
Disadvantages
The disadvantages of shadow paging are as follows −
• Data is fragmented or scattered.
• Garbage collection problem. Database pages containing old versions of
modified data need to be garbage collected after every transaction.
• Concurrent transactions are difficult to execute.

UNIT – 5
Set By
Mr. Palampalli Mohan
Architecture of Parallel Databases |
DBMS ?
•••
A parallel DBMS is a DBMS that runs across multiple processors or CPUs and is
mainly designed to execute query operations in parallel, wherever possible. The
parallel DBMS link a number of smaller machines to achieve the same throughput
as expected from a single large machine.
In Parallel Databases, mainly there are three architectural designs for parallel
DBMS. They are as follows:
1. Shared Memory Architecture
2. Shared Disk Architecture
3. Shared Nothing Architecture
Let’s discuss them one by one:
1. Shared Memory Architecture- In Shared Memory Architecture, there are
multiple CPUs that are attached to an interconnection network. They are able to
share a single or global main memory and common disk arrays. It is to be noted
that, In this architecture, a single copy of a multi-threaded operating system and
multithreaded DBMS can support these multiple CPUs. Also, the shared memory
is a solid coupled architecture in which multiple CPUs share their memory. It is
also known as Symmetric multiprocessing (SMP). This architecture has a very
wide range which starts from personal workstations that support a few
microprocessors in parallel via RISC.

Shared Memory Architecture


Set By
Mr. Palampalli Mohan
Advantages :
1. It has high-speed data access for a limited number of processors.
2. The communication is efficient.
Disadvantages :
1. It cannot use beyond 80 or 100 CPUs in parallel.
2. The bus or the interconnection network gets block due to the
increment of the large number of CPUs.
2. Shared Disk Architectures :
In Shared Disk Architecture, various CPUs are attached to an interconnection
network. In this, each CPU has its own memory and all of them have access to the
same disk. Also, note that here the memory is not shared among CPUs therefore
each node has its own copy of the operating system and DBMS. Shared disk
architecture is a loosely coupled architecture optimized for applications that are
inherently centralized. They are also known as clusters.

Shared Disk Architecture

Advantages :
1. The interconnection network is no longer a bottleneck each CPU has its
own memory.
2. Load-balancing is easier in shared disk architecture.
3. There is better fault tolerance.
Set By
Mr. Palampalli Mohan
Disadvantages :
1. If the number of CPUs increases, the problems of interference and
memory contentions also increase.
2. There’s also exists a scalability problem.
3, Shared Nothing Architecture :
Shared Nothing Architecture is multiple processor architecture in which each
processor has its own memory and disk storage. In this, multiple CPUs are
attached to an interconnection network through a node. Also, note that no two
CPUs can access the same disk area. In this architecture, no sharing of memory or
disk resources is done. It is also known as Massively parallel processing
(MPP).

Shared Nothing Architecture

Advantages :
1. It has better scalability as no sharing of resources is done
2. Multiple CPUs can be added
Disadvantages:
Set By
Mr. Palampalli Mohan
1. The cost of communications is higher as it involves sending of data and
software interaction at both ends
2. The cost of non-local disk access is higher than the cost of shared disk
architectures.
Note that this technology is typically used for very large databases that have the
size of 1012 bytes or TB or for the system that has the process of thousands of
transactions per second.
4, Hierarchical Architecture :
This architecture is a combination of shared disk, shared memory and shared
nothing architectures. This architecture is scalable due to availability of more
memory and many processor. But is costly to other architecture.

Parallel Query Evaluation in DBMS ?



••
Parallelism in a query allows us to parallel execution of multiple queries by
decomposing them into the parts that work in parallel. This can be achieved by
shared-nothing architecture. Parallelism is also used in fastening the process of a
query execution as more and more resources like processors and disks are
provided. We can achieve parallelism in a query by the following methods :
1. I/O parallelism
2. Intra-query parallelism
3. Inter-query parallelism
4. Intra-operation parallelism
5. Inter-operation parallelism
1. I/O parallelism :
It is a form of parallelism in which the relations are partitioned on multiple
disks a motive to reduce the retrieval time of relations from the disk.
Within, the data inputted is partitioned and then processing is done in
parallel with each partition.
.Hash partitioning –
As we already know, a Hash Function is a fast, mathematical function. Each
row of the original relationship is hashed on partitioning attributes
• Range partitioning –
In range partitioning, it issues continuous attribute value ranges to
each disk.
Set By
Mr. Palampalli Mohan

• Round-robin partitioning –
In Round Robin partitioning, the relations are studied in any
order. The ith tuple is sent to the disk number(i % n). So, disks take
turns receiving new rows of data. This technique ensures the even
distribution of tuples across disks and is ideally suitable for
applications that wish to read the entire relation sequentially for each
query.

• Schema partitioning –
In schema partitioning, different tables within a database are placed on
different disks. See figure 2 below:

figure – 2

2. Intra-query parallelism :
Intra-query parallelism refers to the execution of a single query in a parallel
process on different CPUs using a shared-nothing paralleling architecture
technique. This uses two types of approaches:
• First approach –
In this approach, each CPU can execute the duplicate task against some
data portion.
• Second approach –
In this approach, the task can be divided into different sectors with
each CPU executing a distinct subtask.
3. Inter-query parallelism :
In Inter-query parallelism, there is an execution of multiple transactions by each
Set By
Mr. Palampalli Mohan
CPU. It is called parallel transaction processing. DBMS uses transaction
dispatching to carry inter query parallelism. We can also use some different
methods, like efficient lock management. In this method, each query is run
sequentially, which leads to slowing down the running of long queries. In such
cases, DBMS must understand the locks held by different transactions running on
different processes. Inter query parallelism on shared disk architecture performs
best when transactions that execute in parallel do not accept the same data. Also,
it is the easiest form of parallelism in DBMS, and there is an increased transaction
throughput.
4. Intra-operation parallelism :
Intra-operation parallelism is a sort of parallelism in which we parallelize the
execution of each individual operation of a task like sorting, joins, projections,
and so on. The level of parallelism is very high in intra-operation parallelism.
This type of parallelism is natural in database systems. Let’s take an SQL query
example:
SELECT * FROM Vehicles ORDER BY Model_Number;
5. Inter-operation parallelism :
When different operations in a query expression are executed in parallel, then it
is called inter-operation parallelism. They are of two types –
• Pipelined parallelism –
In pipeline parallelism, the output row of one operation is consumed
by the second operation even before the first operation has produced
the entire set of rows in its output
• Independent parallelism –
In this parallelism, the operations in query expressions that are not
dependent on each other can be executed in parallel. This parallelism is
very useful in the case of the lower degree of parallelism.

Parallelizing Individual Operations

Bulk Loading and Scanning

The two simple operations: scanning a relation and loading a relation. Pages can
be read in parallel while scanning a relation, and the retrieved tuples can then be
merged, if the relation is partitioned across several disks. More generally, the idea also
applies when retrieving all tuples that meet a selection condition. If hashing or range
partitioning is used, selection queries can be answered by going to just those
processors that contain relevant tuples.

Sorting

A simple idea is to let each CPU sort the part of the relation that is on its local disk and
to then merge these sorted sets of tuples. The degree of parallelism is likely to be
Set By
Mr. Palampalli Mohan
limited by the merging phase.

A better idea is to redistribute all tuples in the relation using range partitioning. For
example, if we want to sort a collection of employee tuples by salary, salary values
range from 10 to 210, and we have 20 processors, we could send all tuples with salary
values in the range 10 to 20 to the, and so on. (Prior to the redistribution,
while tuples are distributed across the processors, we cannot assume that they are
distributed according to salary ranges.)

Each processor then sorts the tuples assigned to it, using some sequential sorting
algorithm. For example, a processor can collect tuples until its memory is full, then sort
these tuples and write out a run, until all incoming tuples have been written to such
sorted runs on the local disk. These runs can then be merged to create the sorted
version of the set of tuples assigned to this processor. The entire sorted relation can
be retrieved by visiting the processors in an order corresponding to the ranges
assigned to them and simply scanning the tuples.

The basic challenge in parallel sorting is to do the range partitioning so that each
processor receives roughly the same number of tuples; otherwise, a processor that
receives a disproportionately large number of tuples to sort becomes a bottleneck and
limits the scalability of the parallel sort. One good approach to range partitioning is to
obtain a sample of the entire relation by taking samples at each processor that initially
contains part of the relation. The (relatively small) sample is sorted and used to identify
ranges with equal numbers of tuples. This set of range values, called a splitting vector,
is then distributed to all processors and used to range partition the entire relation.

A particularly important application of parallel sorting is sorting the data entries in tree-
structured indexes. Sorting data entries can significantly speed up the process of bulk-
loading an index.

Joins

Suppose that we want to join two relations, say, A and B,on the age attribute. We
assume that they are initially distributed across several disks in some way that is not
useful for the join operation, that is, the initial partitioning is not based on the join
attribute. The basic idea for joining A and B in parallel is to decompose the join into a
collection of k smaller joins. We can decompose the join by partitioning both A and B
into a collection of k logical buckets or partitions. By using the same partitioning
function for both A and B, we ensure that the union of the k smaller joins computes
the join of A and B; this idea is similar to intuition behind the partitioning phase of a
sequential hash join. Because A and B are initially distributed across several
processors, the partitioning step can itself be done in parallel at these processors. At
each processor, all local tuples are retrieved and hashed into one of k partitions, with
the same hash function used at all sites, of course.

Alternatively, we can partition A and B by dividing the range of the join attribute age
into k disjoint subranges and placing A and B tuples into partitions according to
the subrange to which their age values belong. For example, suppose that we have
Set By
Mr. Palampalli Mohan
10 processors, the join attribute is age, with values from 0 to 100. Assuming uniform
distribution, A and B tuples with 0 age < 10 go to processor 1, 10 age < 20 go to
processor 2, and so on. This approach is likely to be more susceptible than hash
partitioning to data skew (i.e., the number of tuples to be joined can vary widely across
partitions), unless the subranges are carefully determined.

If range partitioning is used, the algorithm outlined above leads to a parallel version of
a sort-merge join, with the advantage that the output is available in sorted order. If
hash partitioning is used, we obtain a parallel version of a hash join.

Parallel Query Optimization ?

Optimizing a single query for parallel execution has received more attention; systems
typically optimize queries without regard to other queries that might be executing at
the same time.

Two kinds of interoperation parallelism can be exploited within a query:

• The result of one operator can be pipelined into another. For example, consider a
left-deep plan in which all the joins use index nested loops. The result of the (i.e.,
the bottom-most) join is the outer relation tuples for the next join node. As tuples
are produced by the join, they can be used to probe the inner relation in the second
join. The result of the second join can similarly be pipelined into the next join, and
so on.
Set By
Mr. Palampalli Mohan
• Multiple independent operations can be executed concurrently. For example, con-
sider a (nonleft-deep) plan in which relations A and B are joined, relations C and D
are joined, and the results of these two joins are originally joined. Clearly, the join
of A and B can be executed concurrently with the join of C and D.

An optimizer that seeks to parallelize query evaluation has to consider several issues,
and we will only outline the main points. The cost of executing individual operations in
parallel (e.g., parallel sorting) obviously di
ers from executing them sequentially, and the optimizer should estimate operation
costs accordingly. Next, the plan that returns answers quickest may not be the plan
with the least cost.

For example, the cost of A B plus the cost of C D plus the cost of joining their
results may be more than the cost of the cheapest left-deep plan. However, the time
taken is the time for the more expensive of A B and C D, plus the time to join
their results. This time may be less than the time taken by the cheapest left-deep plan.
This observation suggests that a parallelizing optimizer should not restrict itself to only
left-deep trees and should also consider bushy trees, which significantly enlarge the
space of plans to be considered. Finally, there are a number of parameters such as
avalialble bluer space and the number of free processors that will be known only at
run-time.

Distributed Database Architecture in


DBMS?
Distributed Database System:
A Distributed Database System is a kind of database that is present or divided in more
than one location, which means it is not limited to any single computer system. It is
divided over the network of various systems. The Distributed Database System is
physically present on the different systems in different locations. This can be necessary
when different users from all over the world need to access a specific database. For a
user, it should be handled in such a way that it seems like a single database.

Parameters of Distributed Database Systems:


o Distribution:

It describes how data is physically distributed among the several sites.

o Autonomy:
Set By
Mr. Palampalli Mohan
It reveals the division of power inside the Database System and the degree of
autonomy enjoyed by each individual DBMS.

o Heterogeneity:

It speaks of the similarity or differences between the databases, system parts, and data
models.

Common Architecture Models of Distributed


Database Systems:
o Client-Server Architecture of DDBMS:

This architecture is two level architecture where clients and servers are the points or
levels where the main functionality is divided. There is various functionality provided
by the server, like managing the transaction, managing the data, processing the
queries, and optimization.

o Peer-to-peer Architecture of DDBMS:

In this architecture, each node or peer is considered as a server as well as a client, and
it performs its database services as both (server and client). The peers coordinate their
efforts and share their resources with one another.

o Multi DBMS Architecture of DDBMS:

This is an amalgam of two or more independent Database Systems that functions as a


single integrated Database System.

Types of Distributed Database Systems:


o Homogeneous Database System:

Each site stores the same database in a Homogenous Database. Since each site has
the same database stored, so all the data management schemes, operating system,
and data structures will be the same across all sites. They are, therefore, simple to
handle.

o Heterogeneous Database System:


Set By
Mr. Palampalli Mohan
In this type of Database System, different sites are used to store the data and relational
tables, which makes it difficult for database administrators to do the transactions and
run the queries into the database. Additionally, one site might not even be aware of
the existence of the other sites. Different operating systems and database applications
may be used by various computers. Since each system has its own database model to
store the data, therefore it is required there should be translation schemes to establish
the connections between different sites to transfer the data.

Distributed Data Storage:


There are two methods by which we can store the data on different sites:

o Replication:

This method involves redundantly storing the full relationship at two or more locations.
Since a complete database can be accessed from each site, it becomes a redundant
database. Systems preserve copies of the data as a result of replication.

This has advantages because it makes more data accessible at many locations.
Additionally, query requests can now be handled in parallel.

However, there are some drawbacks as well. Data must be updated frequently. Any
changes performed at one site must be documented at every site where that relation
is stored in order to avoid inconsistent results. There is a tonne of overhead here.
Additionally, since concurrent access must now be monitored across several sites,
concurrency control becomes far more complicated.

o Fragmentation:

According to this method, the relationships are divided (i.e., broken up into smaller
pieces), and each fragment is stored at the many locations where it is needed. To
ensure there is no data loss, the pieces must be created in a way that allows for the
reconstruction of the original relation.

Since Fragmentation doesn't result in duplicate data, consistency is not a concern.

Ways of fragmentation:
o Horizontal Fragmentation:
Set By
Mr. Palampalli Mohan
In Horizontal Fragmentation, the relational table or schema is broken down into a
group of one and more rows, and each row gets one fragment of the schema. It is also
called splitting by rows.

o Vertical Fragmentation:

In this fragmentation, a relational table or schema is divided into some more schemas
of smaller sizes. A common candidate key must be present in each fragment in order
to guarantee a lossless join. This is also called splitting by columns.

Note: Most of the time, a hybrid approach of replication and fragmentation is used.

Application of Distributed Database Systems:


o Multimedia apps use it.
o The manufacturing control system also makes use of it.
o Another application is by corporate management for the information system.
o It is used in hotel chains, military command systems, etc.

Distribututed Query processing ?

In a distributed dabase system, processing a query comprises of optimization at both


the global and the local level. The query enters the database system at the client or
controlling site. Here, the user is validated, the query is checked, translated, and
optimized at a global level.
The architecture can be represented as −
Set By
Mr. Palampalli Mohan

Mapping Global Queries into Local Queries


The process of mapping global queries to local ones can be realized as follows −
• The tables required in a global query have fragments distributed across
multiple sites. The local databases have information only about local
data. The controlling site uses the global data dictionary to gather
information about the distribution and reconstructs the global view from
the fragments.
• If there is no replication, the global optimizer runs local queries at the
sites where the fragments are stored. If there is replication, the global
optimizer selects the site based upon communication cost, workload,
and server speed.
• The global optimizer generates a distributed execution plan so that
least amount of data transfer occurs across the sites. The plan states
the location of the fragments, order in which query steps needs to be
executed and the processes involved in transferring intermediate
results.
• The local queries are optimized by the local database servers. Finally,
the local query results are merged together through union operation in
case of horizontal fragments and join operation for vertical fragments.

Distributed Transactions ?
Set By
Mr. Palampalli Mohan

Distributed Transactions :
If a client transaction calls actions on multiple servers, it is said to be
distributed. Distributed transactions can be structured in two different
ways:
1. Flat transactions
2. Nested transactions
FLAT TRANSACTIONS :
A flat transaction has a single initiating point(Begin) and a single end
point(Commit or abort). They are usually very simple and are generally used
for short activities rather than larger ones.
A client makes requests to multiple servers in a flat transaction. Transaction
T, for example, is a flat transaction that performs operations on objects in
servers X, Y, and Z.
Before moving on to the next request, a flat client transaction completes the
previous one. As a result, each transaction visits the server object in order.
A transaction can only wait for one object at a time when servers utilize
locking.

Flat Transaction

Limitations of a flat Transaction :


• All work is lost in the event of a crash.
• Only one DBMS may be used at a time.
• No partial rollback is possible.
NESTED TRANSACTIONS :
A transaction that includes other transactions within its initiating point and
a end point are known as nested transactions. So the nesting of the
transactions is done in a transaction. The nested transactions here are
called sub-transactions.
The top-level transaction in a nested transaction can open sub-transactions,
and each sub-transaction can open more sub-transactions down to any
depth of nesting.
A client’s transaction T opens up two sub-transactions, T1 and T2, which
access objects on servers X and Y, as shown in the diagram below.
Set By
Mr. Palampalli Mohan
T1.1, T1.2, T2.1, and T2.2, which access the objects on the servers M,N, and
P, are opened by the sub-transactions T1 and T2.

Nested Transaction

Concurrent Execution of the Sub-transactions is done which are at the same


level – in the nested transaction strategy.Here, in the above diagram, T1
and T2 invoke objects on different servers and hence they can run in parallel
and are therefore concurrent.
T1.1, T1.2, T2.1, and T2.2 are four sub-transactions. These sub-
transactions can also run in parallel.
Consider a distributed transaction (T) in which a customer transfers :
• Rs. 105 from account A to account C and
• Subsequently, Rs. 205 from account B to account D.
It can be viewed/ thought of as :
Transaction T :
Start
Transfer Rs 105 from A to C :
Deduct Rs 105 from A(withdraw from A) & Add Rs 105 to C(deposit to
C)
Transfer Rs 205 from B to D :
Deduct Rs 205 from B (withdraw from B)& Add Rs 205 to D(deposit to
D)
End
Assuming :
1. Account A is on server X
2. Account B is on server Y,and
Set By
Mr. Palampalli Mohan
3. Accounts C and D are on server Z.
The transaction T involves four requests – 2 for deposits and 2 for
withdrawals. Now they can be treated as sub transactions (T1, T2, T3, T4)
of the transaction T.
As shown in the figure below, transaction T is designed as a set of four
nested transactions : T1, T2, T3 and T4.

Concurrency Control in Distributed Systems ?

In this section, we will see how the above techniques are implemented in a
distributed database system.

Distributed Two-phase Locking Algorithm


The basic principle of distributed two-phase locking is same as the basic two-phase
locking protocol. However, in a distributed system there are sites designated as lock
managers. A lock manager controls lock acquisition requests from transaction
monitors. In order to enforce co-ordination between the lock managers in various
sites, at least one site is given the authority to see all transactions and detect lock
conflicts.
Depending upon the number of sites who can detect lock conflicts, distributed two-
phase locking approaches can be of three types −
• Centralized two-phase locking − In this approach, one site is
designated as the central lock manager. All the sites in the environment
know the location of the central lock manager and obtain lock from it
during transactions.
• Primary copy two-phase locking − In this approach, a number of sites
are designated as lock control centers. Each of these sites has the
responsibility of managing a defined set of locks. All the sites know
which lock control center is responsible for managing lock of which data
table/fragment item.
• Distributed two-phase locking − In this approach, there are a number
of lock managers, where each lock manager controls locks of data items
stored at its local site. The location of the lock manager is based upon
data distribution and replication.

Recovery in Distributed Systems ?


••

Set By
Mr. Palampalli Mohan
Recovery from an error is essential to fault tolerance, and error is a component of
a system that could result in failure. The whole idea of error recovery is to replace
an erroneous state with an error-free state. Error recovery can be broadly divided
into two categories.
1. Backward Recovery:
Moving the system from its current state back into a formerly accurate condition
from an incorrect one is the main challenge in backward recovery. It will be
required to accomplish this by periodically recording the system’s state and
restoring it when something goes wrong. A checkpoint is deemed to have been
reached each time (part of) the system’s current state is noted.
2. Forward Recovery:
Instead of returning the system to a previous, checkpointed state in this instance
when it has entered an incorrect state, an effort is made to place the system in a
correct new state from which it can continue to operate. The fundamental issue
with forward error recovery techniques is that potential errors must be
anticipated in advance. Only then is it feasible to change those mistakes and
transfer to a new state.
These two types of possible recoveries are done in fault tolerance in
distributed system.
Stable Storage :
Stable storage, which can resist anything but major disasters like floods and
earthquakes, is another option. A pair of regular discs can be used to implement
stable storage. Each block on drive 2 is a duplicate of the corresponding block on
drive 1, with no differences. The block on drive 1 is updated and confirmed first
whenever a block is updated. then the identical block on drive 2 is finished.
Checkpointing :
Backward error recovery calls for the system to routinely save its state onto stable
storage in a fault-tolerant distributed system. We need to take a distributed
snapshot, often known as a consistent global state, in particular. If a process P has
recorded the receipt of a message in a distributed snapshot, then there should also
be a process Q that has recorded the sending of that message. It has to originate
somewhere, after all.
Set By
Mr. Palampalli Mohan

Coordinated Checkpointing :
As the name suggests, coordinated checkpointing synchronises all processes to
write their state to local stable storage at the same time. Coordinated
checkpointing’s key benefit is that the saved state is automatically globally
consistent, preventing cascading rollbacks that could cause a domino effect.
Message Logging :
The core principle of message logging is that we can still obtain a globally
consistent state even if the transmission of messages can be replayed, but without
having to restore that state from stable storage. Instead, any communications that
have been sent since the last checkpoint are simply retransmitted and treated
appropriately.
Set By
Mr. Palampalli Mohan

You might also like