Nothing Special   »   [go: up one dir, main page]

Advanced DB Lecture All in One PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 108

Advanced Database Systems Lecture Note

Advanced Database Systems

Chapter one
Transactions processing and concurrency
management
What is a transaction?
 A Transaction is a mechanism for applying the desired
modifications/operations to a database. It is evident in real life that the
final database instance after a successful manipulation of the content of
the database is the most up-to-date copy of the database.
 Action, or series of actions, carried out by a single user or application program,
which accesses or changes contents of database. (i.e. Logical unit of work on the
database.)
 A transaction could be a whole program, part/module of a program or a
single command.
 Changes made in real time to a database are called transactions. Examples
include ATM transactions, credit card approvals, flight reservations, hotel
check-in, phone calls, supermarket canning, academic registration and
billing.
 A transaction could be composed of one or more database and non-
database operations.
 Transforms database from one consistent state to another, although
consistency may be violated during transaction.
 A database transaction is a unit of interaction with database management
system or similar system that is treated in a coherent and reliable way
independent of other transactions.
Transaction processing system
 A system that manages transactions and controls their access to a DBMS is
called a TP monitor. A transaction processing system (TPS) generally consists
of a TP monitor, one or more DBMSs, and a set of application programs
containing transaction.
 In database field, a transaction is a group of logical operations that must
all succeed or fail as a group. Systems dedicated to supporting such
operations are known as transaction processing systems.

In comparison with database transaction, application program is series of


transactions with non-database processing in between.

1
Advanced Database Systems Lecture Note

Distinction between business transactions and online transactions:


 A business transaction is an interaction in the real world, usually
between an enterprise and a person, where something is
exchanged.
 An online transaction is the execution of a program that performs
an administrative or real-time function, often by accessing shared
data sources, usually on behalf of an online user (although some
transactions are run offline in batch).

What we are interested about is the online transaction, which is the interaction
between the users of a database system and the shared data stored in the
database. This transaction program contains the steps involved in the business
transaction.

Transactions can be started, attempted, then committed or aborted via data


manipulation commands of SQL.

Can have one of the two outcomes for any transaction:


 Success - transaction commits and database reaches a new consistent state
 Committed transaction cannot be aborted or rolled back.
 How do you discard a committed transaction?
 Failure - transaction aborts, and database must be restored to consistent
state before it started.
 Such a transaction is rolled back or undone.
 Aborted transaction that is rolled back can be restarted later.
In many data manipulation languages, there are keywords used to indicate
different states of a transaction.

The DBMS has no inherent way of knowing the logical grouping of operations
that are supposed to be done together. Hence, it must provide the users a means
to logically group their operations. Key words such as BEGIN TRANSACTION,
COMMIT and ROLLBACK (or their equivalents) are available in many data
manipulation languages to delimit transactions.

A single transaction might require several queries, each reading and/or writing
information in the database. When this happens it is usually important to be sure
that the database is not left with only some of the queries carried out. For
example, when doing a money transfer, if the money was debited from one
account, it is important that it also be credited to the depositing account. Also,
transactions should not interfere with each other.

2
Advanced Database Systems Lecture Note

A transaction is expected to exhibit some basic features or properties to be


considered as a valid transaction. These features are:

A: Atomicity
C: Consistency
I: Isolation
D: Durability

It is referred to as ACID property of a transaction. Without the ACID


property, the integrity of the database cannot be guaranteed.

Atomicity
Is All or None property
Every transaction should be considered as an atomic process which can not be
sub divided into small tasks. Due to this property, just like an atom which exists
or does not exist, a transaction has only two states. Done or Never Started.
Done - a transaction must complete successfully and its effect
should be visible in the database.
Never Started - If a transaction fails during execution then all its
modifications must be undone to bring back the database to
the last consistent state, i.e., remove the effect of failed
transaction.
No state between Done and Never Started

Consistency
If the transaction code is correct then a transaction, at the end of its execution,
must leave the database consistent. A transaction should transform a database
from one previous consistent state to another consistent state.

Isolation
A transaction must execute without interference from other concurrent
transactions and its intermediate or partial modifications to data must not be
visible to other transactions.

Durability
The effect of a completed transaction must persist in the database, i.e., its updates
must be available to other transaction immediately after the end of its execution,
and is should not be affected due to failures after the completion of the
transaction.

3
Advanced Database Systems Lecture Note

In practice, these properties are often relaxed somewhat to provide better


performance.

The transaction sub system of a DBMS is depicted in the following


figure

Fig. The transaction sub system of the DBMS

State of a Transaction
A transaction is an atomic operation from the users‘ perspective. But it has a
collection of operations and it can have a number of states during its execution.

A transaction can end in three possible ways.


1. Successful Termination: when a transaction completes the execution of all
operations in it and reaches the COMMIT command.
2. Suicidal Termination: when the transaction detects an error during its
processing and decide to abrupt itself before the end of the transaction
and perform a ROLL BACK
3. Murderous Termination: When the DBMS or the system force the
execution to abort for any reason. And hence, rolled back.

4
Advanced Database Systems Lecture Note

Start Ok to Commit
Commit Commit
Database
Modified

No Error

System Detects
Error
Modify Abort End of
Transaction

Error Detected System


by Transaction Initiated Consistent
State
Consistent
State

Error Transaction RollBack Database


Initiated unmodified

Most SQL statements seem to be very short and easy to execute. But the reverse
is true if you consider it as a one command transaction. Actually a database
system interprets a transaction not as an application program but as a logical
sequence of low- level operations read and write (referred to as primitives).

Ways of Transaction Execution


In a database system many transactions are executed. Basically there are two
ways of executing a set of transactions:
(a) Serially: Serial Execution: In a serial execution transactions are executed
strictly serially. Thus, Transaction Ti completes and writes its results to
the database then only the next transaction Tj is scheduled for execution.
This means at one time there is only one transaction is being executed in
the system. The data is not shared between transactions at one specific
time.

5
Advanced Database Systems Lecture Note

In Serial transaction execution, one transaction being executed does not interfere
the execution of any other transaction.
Good things about serial execution
 Correct execution, i.e., if the input is correct then output will be
correct.
 Fast execution, since all the resources are available to the active.
The worst thing about serial execution is very inefficient resource utilization. i.e.
reduced parallelism.

Notations: Read(x) read data item x from database


Write(x) write data item x into the database

Example of a serial execution


Suppose data items X = 10, Y = 6, and N =1 and T1 and T2 are transactions.
T1 T2
read (X) read (X)
X := X+N X := X+N
write (X) write (X)
read (Y)
Y := Y+N
write (Y)

We execute this transaction serially as follows:

6
Advanced Database Systems Lecture Note

Time T1 T2
read (X) {X = 10}
X := X+N {X = 11}
write (X) {X = 11}
read (Y) {Y = 6}
Y := Y+N {Y = 7}
write (Y) {Y = 7}
read (X) {X = 11}
X := X+N {X = 12}
write (X)
Final values of X, Y at the end of T1 and T2: X = 12 and Y = 7.
Thus we can witness that in serial execution of transaction, if we have two transactions
Ti and Ti+1, then Ti+1 will only be executed after the completion of Ti.

(b) Concurrently: is the reverse of serially executable transactions, in this


scheme the individual operations of transactions, i.e., reads and writes are
interleaved in some order.

Time T1 T2
read (X) {X = 10}
read (X) {X = 10}
X := X+N {X = 11}
X := X+N {X = 11}
write (X) {X = 11}
write (X) (X=11)
read (Y) {Y = 6}
Y := Y+N {Y = 7}
write (Y) {Y = 7}

Final values at the end of T1 and T2: X = 11, and Y = 7. This improves
resource utilization, unfortunately gives incorrect result.
The correct value of X is 12 but in concurrent execution X =11, which is
incorrect. The reason for this error is incorrect sharing of X by T1 and T2.
In serial execution T2 read the value of X written by T1 (i.e., 11) but in
concurrent execution T2 read the same value of X (i.e., 10) as T1 did and the
update made by T1 was overwritten by T2‘s update.
This is the reason the final value of X is one less than what is produced by
serial execution.

7
Advanced Database Systems Lecture Note

But, why concurrent execution? Reasons:


 Many systems have an independent component to handle I/O
like l.DMA (Direct Memory Access) module.
 CPU may process other transactions while one is in I/O
operation
 Improves resource utilization
 Increases the throughput of the system.

8
Advanced Database Systems Lecture Note

Problems Associated with Concurrent


Transaction Processing
Although two transactions may be correct in themselves, interleaving of
operations may produce an incorrect result which needs control over
access.
Having a concurrent transaction processing, one can enhance the
throughput of the system. As reading and writing is performed from and
on secondary storage, the system will not be idle during these operations if
there is a concurrent processing.

Every transaction is correct when executed alone, but this would not
guarantee that the interleaving of operations from these transactions will
produce a correct result.

The three potential problems caused by concurrency are:


 Lost Update Problem
 Uncommitted Dependency Problem
 Inconsistent Analysis Problem.

Lost Update Problem


Successfully completed update on a data set by one transaction is
overridden by another transaction/user.
E.g. Account with balance A=100.
 T1 reads the account A
 T1 withdraws 10 from A
 T1 makes the update in the Database
 T2 reads the account A
 T2 adds 100 on A
 T2 makes the update in the Database

 In the above case, if done one after the other (serially) then we have
no problem.
 If the execution is T1 followed by T2 then A=190
 If the execution is T2 followed by T1 then A=190

 But if they start at the same time in the following sequence:


 T1 reads the account A=100
 T1 withdraws 10 making the balance A=90

9
Advanced Database Systems Lecture Note

 T2 reads the account A=100


 T2 adds 100 making A=200
 T1 makes the update in the Database A=90
 T2 makes the update in the Database A=200
 After the successful completion of the operation in this schedule, the
final value of A will be 200 which override the update made by the
first transaction that changed the value from 100 to 90.

Uncommitted Dependency Problem


Occurs when one transaction can see intermediate results of another
transaction before it is committed.
E.g.
 T2 increases 100 making it 200 but then aborts the transaction
before it is committed. T1 gets 200, subtracts 10 and make it
190. But the actual balance should be 90

Inconsistent Analysis Problem


Occurs when transaction reads several values but second transaction
updates some of them during execution and before the completion of the
first.
E.g.
 T2 would like to add the values of A=10, B=20 and C=30. after
the values are read by T2 and before its completion, T1 updates
the value of B to be 50. At the end of the execution of the two
transactions T2 will come up with the sum of 60 while it
should be 90 since B is updated to 50.

As discussed above, the objective of Concurrency Control Protocol is to


schedule transactions in such a way as to avoid any interference between
them. This demands a new principle in transaction processing, which is
serializability of the schedule of execution of multiple transactions.
Related with the inconsistent analysis problem we have the phantom read
and the fuzzy read problems

10
Advanced Database Systems Lecture Note

Serializability
In any transaction processing system, if concurrent processing is
implemented, there will be concept called schedule having or
determining the execution sequence of operations in different
transactions.
Schedule: time-ordered sequence of the important actions taken by
one or more transitions. Schedule represents the order in which
instructions are executed in the system in chronological ordering.

The scheduler component of a DBMS must ensure that the individual


steps of different transactions preserve consistency.

1. Serial Schedule: a schedule where the operations of each


transaction are executed consecutively without any interleaved
operations from other transactions. No guarantee that results of
all serial executions of a given set of transactions will be identical.

2. Non-serial Schedule: Schedule where operations from a set of


concurrent transactions are interleaved. The objective of
serializability is to find non-serial schedules that allow transactions
to execute concurrently without interfering with one another. In
other words, want to find non-serial schedules that are equivalent to
some serial schedule. Such a schedule is called serializable.

If a non serial schedule can come up with a result equivalent to some


serial schedule, we call the schedule Serializable. To mean that the
sequence of the execution of the operations will result in a value
identical with serially executed transaction, which always is correct.

Thus Serializability identifies those concurrent executions of


transactions guaranteed to ensure consistency.

Serialization
 Objective of serialization is to find schedules that allow transactions
to execute concurrently without interfering with one another.
 If two transactions only read data, order is not important.

11
Advanced Database Systems Lecture Note

 If two transactions either read or write completely separate data


items, they do not conflict and order is not important.
 If one transaction writes a data item and another reads or writes the
same data item, order of execution is important
 Possible solution: Run all transactions serially.
 This is often too restrictive as it limits degree of concurrency or
parallelism in system.

Test for Serializability


 Constrained write rule: transaction updates data item based on its old
value, which is first read by the transaction.
 Under the constrained write rule we can use precedence graph to test
for serializability.

A non serial schedule can be tasted for serializability by using a


method called precedence graph.
A precedence graph is composed of:
 Nodes to represent the transactions, and
 Arc(directed edge) to connect nodes indicating the order of
execution for the transactions

So that the directed edge TiTj implies, in an equivalent serial


execution Ti must be executed before Tj.

The rules of precedence graph:


Represent each transaction in the schedule with a node. Then draw a
directed arc from Ti to Tj (that is Ti to Tj ) if:
1. Tj Reads an item that is already Written by Ti
 Ti Write(x) Then Tj Read(x)
2. Tj Writes an item that already has been Read by Ti, or
 Ti Read(x) Then Tj Write(x)
3. Tj Writes an item that already has been Written by Ti
 Ti Write(x) Then Tj Write(x)
After drawing the precedence graph, if there is any cycle in the
graph, the schedule is said to be Non Serializable. Otherwise, if there

12
Advanced Database Systems Lecture Note

is no cycle in the graph, the schedule is equivalent with some serial


schedule and is serializable.

Example1:
R1(x) W2(x) W1(x)

Check: T1T2 and T2T1

Example2:

R1(x)W2(x)W1(x)W3(x)

Check: T1T2, T2T1, T2T3, T3T2, T1T3, and T3T1

13
Advanced Database Systems Lecture Note

Concurrency Control Techniques

Concurrency Control is the process of managing simultaneous operations


on the database without having them interfere with one another. Prevents
interference when two or more users are accessing database
simultaneously and at least one is updating data. Although two
transactions may be correct in themselves, interleaving of operations may
produce an incorrect result.

Three basic concurrency control techniques:


1. Locking methods
2. Time stamping
3. Optimistic

Locking and Time stamping are pessimistic approaches since they delay
transactions.
Both Locking and Time stamping are conservative approaches: delay
transactions in case they conflict with other transactions.

The optimistic approach allows us to proceed and check conflicts at the


end.
Optimistic methods assume conflict is rare and only check for conflicts at
commit.

Locking Method
The locking method is a mechanism for preventing simultaneous access on
a shared resource for a critical operation

A LOCK is a mechanism for enforcing limits on access to a resource in an


environment where there are many threads of execution. Locks are one
way of enforcing concurrency control policies. Transaction uses locks to
deny access to other transactions and so prevent incorrect updates.

Lock prevents another transaction from modifying item or even


reading it, in the case of a write lock.

14
Advanced Database Systems Lecture Note

Lock (X): If a transaction T1 applies Lock on data item X, then X is locked


and it is not available to any other transaction.
Unlock (X): T1 Unlocks X. X is available to other transactions.

Types of a Lock
Shared lock: A Read operation does not change the value of a data item.
Hence a data item can be read by two different transactions
simultaneously under share lock mode. So only to read a
data item T1 will do: Share lock (X), then Read (X), and finally
Unlock (X).
Exclusive lock: A write operation changes the value of the data item.
Hence two write operations from two different transactions
or a write from T1 and a read from T2 are not allowed. A
data item can be modified only under Exclusive lock. To
modify a data item T1 will do: Exclusive lock (X), then Write
(X) and finally Unlock (X).

When these locks are applied, then a transaction must behave in a special
way. This special behavior of a transaction is referred to as well-formed.

Well-formed: A transaction is well- formed if it does not lock a locked


data item and it does not try to unlock an unlocked data
item.

Locking - Basic Rules


 If transaction has a shared lock on an item, it can read but not
update the item.
 If a transaction has an exclusive lock on an item, it can both read and
update the item.
 Reads cannot conflict, so more than one transaction can hold shared
locks simultaneously on same item.
 Exclusive lock gives transaction exclusive access to that item.
 Some systems allow transaction to upgrade a shared lock to an
exclusive lock, or vice-versa.

15
Advanced Database Systems Lecture Note

Examples: T1 and T2 are two transactions. They are executed under locking
as follows. T1 locks A in exclusive mode. When T2 want s to lock A, it finds
it locked by T1 so T2 waits for Unlock on A by T1. When A is released then
T2 locks A and begins execution.
Suppose a lock on a data item is applied, the data item is processed and it
is unlocked immediately after reading/writing is completed as follows.
Initial values of A = 10 and B = 20.

Serial Execution of T1 and then T2 Concurrent Execution of T1 and T2


T1 T2 T1 T2
Lock (A) Lock (A)
read (A) {A = 10} read (A) {A = 10}
A := A + 100 A := A + 100
write (A) (A = 110} write (A) (A = 110}
Unlock (A) Unlock (A)
Lock (B) Lock (B)
read (B) {B = 20} read (B) {B = 20}
B := B + 10 B := B * 5
write (B) {B =30} write (B) {B = 100}
Unlock (B) Unlock (B)
Lock (B) Lock (B)
read (B) {B = 30} read (B) {B = 100}
B := B * 5 B := B + 10
write (B) {B = 150} write (B) {B = 110}
Unlock (B) Unlock (B)
Lock (A) Lock (A)
Read (A) {A = 110} Read (A) {A = 110}
A := A + 20 A := A + 20
Write (A) {A = 130} Write (A) {A = 130}
Unlock (A) Unlock (A)
Final Result: A=130 B=150 Final Result: A=130 B=110

16
Advanced Database Systems Lecture Note

The final result of the two transactions using the two types of transaction
execution (serial and concurrent) is not the same. This indicates that the
above method of locking and unlocking is not correct. This is because
although such kind of locking and unlocking data items increases the concurrency
of execution it violates the isolation and atomicity of transactions. Immediate
unlocking is not trustworthy. Thus, to preserve consistency we have to use
another approach to locking, two-phase locking scheme.

Two-Phase Locking (2PL)


A transaction follows 2PL protocol if all locking operations precede the
first unlock operation in the transaction.
The 2PL protocol demands locking and unlocking of a transaction to have
two phases.
 Growing phase - acquires all locks but cannot release any
locks. Upgrading of shared lock to exclusive locks is done here
if allowed.
 Shrinking phase - releases locks but cannot acquire any new
locks. Down grading of exclusive locks to shared lock is done
here if allowed.

Hence the 2PL protocol allows avoiding the three problems of


concurrent execution.

Locking methods: problems

Deadlock: A deadlock that may result when two (or more)


transactions are each waiting for locks held by the other to be
released.

Deadlock - possible solutions

Only one way to break deadlock: abort one or more of the transactions in
the deadlock.
Deadlock should be transparent to user, so DBMS should restart
transaction(s).

17
Advanced Database Systems Lecture Note

Two general techniques for handling deadlock:


 Deadlock prevention.
 Deadlock detection and recovery.

Timeout
The deadlock detection could be done using the technique of TIMEOUT.
Every transaction will be given a time to wait in case of deadlock. If a
transaction waits for the predefined period of time in idle mode, the DBMS
will assume that deadlock occurred and it will abort and restart the
transaction.

WFG (the Wait For Graph)


 Nodes represent transactions
 A directed edge Ti Tj is drawn if Ti is waiting to lock
an item already locked by Tj.
 Dead lock exists if there is a cycle in the graph.

If every transaction in a schedule follows 2PL, schedule is


serializable. However, problems can occur with interpretation
when locks can be released.
 Suppose Ti aborts, but Tj has read a data item written by Ti
 Then Tj must abort; if Tj had been allowed to commit
earlier, the schedule is not recoverable.
 Further, any transaction that has read a data item written
by Tj must also abort
This can lead to cascading rollback…

In order to avoid such problems, it is recommended to leave the release of


locks until the end of the transaction.

A schedule is recoverable if for each pair of transactions Ti and Tj ,


if Tj reads an item previously written by Ti, then the commit operation
of Ti must precede that of Tj.

18
Advanced Database Systems Lecture Note

Time-stamping Method

Timestamp: a unique identifier created by DBMS that indicates relative


starting time of a transaction.
Can be generated by:
 using system clock at the time transaction started, or
 Incrementing a logical counter every time a new transaction starts.
Time-stamping  a concurrency control protocol that orders transactions
in such a way that older transactions, transactions with smaller time
stamps, get priority in the event of conflict.

 Transactions ordered globally base do their timestamp so that older


transactions, transactions with earlier timestamps, get priority in the
event of conflict.
 Conflict is resolved by rolling back and restarting transaction.
 Since there is no need to use lock there will be No Deadlock.

In timestamp ordering, the schedule is equivalent to the particular serial


order that corresponds to the order of the transaction timestamps.

To implement this scheme, every transaction will be given a


timestamp which is a unique identifier of a transaction.

If Ti came to processing prior to Tj then TS of Tj will be larger than TS


of Ti.

Again each data item will have a timestamp for Read and Write.
 WTS(A) which denotes the largest timestamp of any transaction
that successfully executed Write(A)
 RTS(A) which denotes the largest timestamp of any transaction
that successfully executed Read(A)
These timestamps are updated whenever a new Read (A) or Write
(A) instruction is executed.

Read/write proceeds only if last update on that data item was carried out
by an older transaction. Otherwise, transaction requesting read/write is
restarted and given a new timestamp.

19
Advanced Database Systems Lecture Note

The timestamp ordering protocol ensures that any conflicting read


and write operations are executed in the timestamp order.
The protocol manages concurrent execution such that the time-
stamps determine the serializability order.

Rules for permitting execution of operations in Time-stamping


Method
Suppose that Transaction Ti issues Read(A)
 If TS(Ti) < WTS(A): this implies that Ti needs to read a value of
A which was already overwritten. Hence the read operation
must be rejected and Ti is rolled back.
 If TS(Ti) >= WTS(A): then the read is executed and RTS(A) is set
to the maximum of RTS(A) and TS(Ti).
Suppose that Transaction Ti issues Write(A)
 If TS(Ti) < RTS(A): then this implies that the value of A that Ti is
producing was previously needed and it was assumed that it
would never be produced. Hence, the Write operation must be
rejcted and Ti is rolled back.
 If TS(Ti) < WTS(A): then this implies that Ti is attempting to
Write an object value of A. hence, this write operation can be
ignored.
 Otherwise the Write operation is executed and WTS(A) is set to
the maximum of WTS(A) or TS(Ti).

A transaction that is rolled back due to conflict will be restarted and


be given a new timestamp.

Optimistic Technique
 Locking and assigning and checking timestamp values may be
unnecessary for some transactions
 Assumes that conflict is rare.
 When transaction reaches the level of executing commit, a
check is performed to determine whether conflict has occurred.
If there is a conflict, transaction is rolled back and restarted.
 Based on assumption that conflict is rare and more efficient to
let transactions proceed without delays to ensure serializability.

20
Advanced Database Systems Lecture Note

 At commit, check is made to determine whether conflict has


occurred.
 If there is a conflict, transaction must be rolled back and
restarted.
 Potentially allows greater concurrency than traditional
protocols.

 Three phases:
1. Read
2. Validation
3. Write

1. Optimistic Techniques - Read Phase


 Extends from start until immediately before commit.
 Transaction reads values from database and stores them in local
variables. Updates are applied to a local copy of the data.

2. Optimistic Techniques - Validation Phase


 Follows the read phase.
 For read-only transaction, checks that data read are still current values.
If no interference, transaction is committed, else aborted and restarted.
 For update transaction, checks transaction leaves database in a
consistent state, with serializability maintained.

3. Optimistic Techniques - Write Phase


 Follows successful validation phase for update transactions.
 Updates made to local copy are applied to the database.

Granularity of data items

Granularity is the size of the data items chosen as the unit of protection
by a concurrency control protocol. See figure bellow.
It could be:
 The entire database
 A file
 A page (a section of physical disk in which relations are
stored)(sometimes also called a block)

21
Advanced Database Systems Lecture Note

 A record
 A field value of a record
The granularity has effect on the performance of the system. As locking will
prevent access to the data, the size of the data required to be locked will prevent
other transactions from having access. If the entire database is locked, then
consistency will be highly maintained but less performance of the system will be
witnessed. Is a single data item is locked; consistency maybe at risk but
concurrent processing and performance will be enhanced. Thus, as one go from
the entire database to a single value, performance and concurrent processing will
be enhanced but consistency will be at risk and needs good concurrency control
mechanism and strategy.

Database

File1 File2 File3

Page1 Page2 Page3

Record1

Record2

Record3

Field1

Firld2

Field3

Fig. Granularity of data items

22
Advanced Database Systems Lecture Note

Transaction Subsystem
Refer to the figure on page 4 (transaction sub system)

 Transaction manager
 Scheduler
 Recovery manager
 Buffer manager

 Transaction Manager in DBMS coordinates transactions on behalf of


application programs. Communicates with scheduler (lock manager)
which is responsible for implementing a strategy for concurrency
control.

 The Scheduler in the DBMS ensures that the individual steps of


different transactions preserve consistency.

 Recovery Manager: Ensures that database is restored to the original


state in case failure occurs.

 Buffer Manager: responsible for transfer of data between disk


storage and main memory.

23
Advanced Database Systems Lecture Note

Database Recovery
Database recovery is the process of restoring database to a correct state in
the event of a failure.
A database recovery is the process of eliminating the effects of a failure
from the database.
Recovery, in database systems terminology, is called restoring the last
consistent state of the data items.

 Need for Recovery Control


o Two types of storage: volatile (main memory) and
nonvolatile.
o Volatile storage does not survive system crashes.
o Stable storage represents information that has been
replicated in several nonvolatile storage media with
independent failure modes.
Types of failures
A failure is a state where data inconsistency is visible to transactions if
they are scheduled for execution. The kind of failure could be:
 System crashes, resulting in loss of main memory.
 Media failures, resulting in loss of parts of secondary storage.
 Application software errors.
 Natural physical disasters.
 Carelessness or unintentional destruction of data or facilities.
 Sabotage.

In databases usually a failure can generally be categorized as one of the


following major groups:
1. Transaction failure: a transaction cannot continue with its
execution, therefore, it is aborted and if desired it may be restarted
at some other time. Reasons: Deadlock, timeout, protection
violation, or system error.
2. System failure: the database system is unable to process any
transactions. Some of the common reasons of system failure are:
wrong data input, register overflow, addressing error, power failure,
memory failure, etc.
3. Media failure: failure of non-volatile storage media (mainly disk).
Some of the common reasons are: head crash, dust on the recording
surfaces, fire, etc.

24
Advanced Database Systems Lecture Note

To make the database secured, one should formulate a ―plan of attack‖ in


advance. The plan will be used in case of database insecurity that may
range from minor inconsistency to total loss of the data due to hazardous
events.
The basic steps in performing a recovery are

1. Isolating the database from other users. Occasionally, you may need
to drop and re-create the database to continue the recovery.

2. Restoring the database from the most recent useable dump.

3. Applying transaction log dumps, in the correct sequence, to the


database to make the data as current as possible.

It is a good idea to test your backup and recovery plans periodically by


loading the backups and transaction logs into a test database and verifying
that your procedure really works.

One can recover databases after three basic types of problems: user error,
software failure, and hardware failure.

Each type of failure requires a recovery mechanism. In a transaction


recovery, the effect of failed transaction is removed from the database, if
any. In a system failure, the effects of failed transactions have to be
removed from the database and the effects of completed transactions have
to be installed in the database.

The database recovery manger is responsible to guarantee the atomicity


and durability properties of the ACID property.

The execution of a transaction T is correct if, given a consistent database; T


leaves the database in the next consistent state. Initial consistent state of
the database: S1. T changes the consistent state from S1 to S2. During this
transition the database may go into an inconsistent state, which is visible
only to T. If the system or T fails then database will not be able to reach S2,
consequently the visible state would be an inconsistent state. From this
state we can go either forward (to state S2) or go backward (to state S1).

25
Advanced Database Systems Lecture Note

Example:
The initial value of A=100, B=200 and C=300
The Required state after the execution of T1 is A=500, B=800 and C=700
Thus S1= (100,200,300)
S2= (500,800,700)
Transaction (T1)
Time Operation
1 A=A+200
2 B=B-100
3 C=C-200
Failure
4 A=A+200
5 B=B+700
6 C=C+600

Force writing: the explicit writing of the buffers to secondary storage.

26
Advanced Database Systems Lecture Note

Transactions and Recovery


 Transactions represent basic unit of recovery.
 Recovery manager responsible for atomicity and durability.
 If failure occurs between commit and database buffers being
flushed to secondary storage then, to ensure durability, recovery
manager has to redo (roll forward) transaction's updates.
 If transaction had not committed at failure time, recovery
manager has to undo (rollback) any effects of that transaction for
atomicity.
 Partial undo - only one transaction has to be undone.
 Global undo - all transactions have to be undone.
Transaction Log: Execution history of concurrent transactions.

 DBMS starts at time t0, but fails at time tf. Assume data for
transactions T2 and T3 have been written to secondary storage.
 T1 and T6 have to be undone. In absence of any other information,
recovery manager has to redo T2, T3, T4, and T5.
 tc is the checkpoint time by the DBMS
Recovery Facilities
 DBMS should provide following facilities to assist with recovery:
o Backup mechanism: that makes periodic backup copies of
database.
o Logging facility: that keeps track of current state of
transactions and database changes.
o Checkpoint facility: that enables updates to database in
progress to be made permanent.
o Recovery manger: This allows the DBMS to restore the
database to a consistent state following a failure.

27
Advanced Database Systems Lecture Note

Restoring the database means transforming the state of the database to the
immediate good state before the failure. To do this, the change made on
the database should be preserved. Such kind of information is stored in a
system log or transaction log file.

Log File
 Contains information about all updates to database:
o Transaction records.
o Checkpoint records.
 Often used for other purposes (for example, auditing).
 Transaction records contain:
o Transaction identifier.
o Type of log record, (transaction start, insert, update, delete,
abort, commit).
o Identifier of data item affected by database action (insert,
delete, and update operations).
o Before-image of data item.
o After-image of data item.

 Log management information.


 Log file sometimes split into two separate random-access files.
 Potential bottleneck; critical in determining overall performance.

28
Advanced Database Systems Lecture Note

Check pointing
Checkpoint: is a point of synchronization between database and log file.
All buffers are force-written to secondary storage.

 Checkpoint record is created containing identifiers of all active


transactions.
 When failure occurs, redo all transactions that committed since
the checkpoint and undo all transactions active at time of crash.
 In previous example, with checkpoint at time tc, changes made
by T2 and T3 have been written to secondary storage.
 Thus:
o Only redo T4 and T5,
o Undo transactions T1 and T6.

29
Advanced Database Systems Lecture Note

Recovery Techniques
Damage to the database could be either physical and relate which will
result in the loss of the data stored or just inconsistency of the database
state after the failure. For each we can have a recover mechanism:
1. If database has been damaged:
 Need to restore last backup copy of database and reapply
updates of committed transactions using log file.
 Extensive damage/catastrophic failure: physical media
failure; is restored by using the backup copy and by re
executing the committed transactions from the log up to
the time of failure.

2. If database is only inconsistent:


 No physical damage/only inconsistent: the restoring is
done by reversing the changes made on the database by
consulting the transaction log.
 Need to undo changes that caused inconsistency. May also
need to redo some transactions to ensure updates reach
secondary storage.
 Do not need backup, but can restore database using before-
and after-images in the log file.

Recovery Techniques for Inconsistent Database State

Recovery is required if only the database is updated. The kind of


recovery also depends on the kind of update made on the database.

Database update: A transaction‘s updates to the database can be applied


in two ways:
 Three main recovery techniques:
1. Deferred Update
2. Immediate Update
3. Shadow Paging

30
Advanced Database Systems Lecture Note

Deferred Update
 Updates are not written to the database until after a transaction has
reached its commit point.
 If transaction fails before commit, it will not have modified database
and so no undoing of changes required.
 May be necessary to redo updates of committed transactions as their
effect may not have reached database.
 If a transaction aborts, ignore the log record for it. And do nothing
with transaction having a ―transaction start‖ and ―Transaction
abort‖ log records
 A transaction first modifies all its data items and then writes all its
updates to the final copy of the database. No change is going to be
recorded on the database before commit. The changes will be made
only on the local transaction workplace. Update on the actual
database is made after commit and after the change is recorded on
the log. Since there is no need to perform undo operation it is also
called NO-UNDO/REDO Algorithm
 The redo operations are made in the order they were written to log.

Immediate Update/ Update-In-Place


 Updates are applied to database as they occur.
 Need to redo updates of committed transactions following a failure.
 May need to undo effects of transactions that had not committed at
time of failure.
 Essential that log records are written before write to database. Write-
ahead log protocol.
 If no "transaction commit" record in log, then that transaction was
active at failure and must be undone.
 Undo operations are performed in reverse order in which they were
written to log.

As soon as a transaction updates a data item, it updates the final copy


of the database on the database disk. During making the update, the
change will be recorded on the transaction log to permit rollback
operation in case of failure. UNDO and REDO are required to make the
transaction consistent. Thus it is called UNDO/REDO Algorithm. This
algorithm will undo all updates made in place before commit. The redo
is required because some operations which are completed but not

31
Advanced Database Systems Lecture Note

committed should go to the database. If we don‘t have the second


scenario, then the other variation of this algorithm is called
UNDO/NO-REDO Algorithm.

Shadow Paging
 Maintain two page tables during life of a transaction: current page
and shadow page table.
 When transaction starts, two pages are the same.
 Shadow page table is never changed thereafter and is used to restore
database in event of failure.
 During transaction, current page table records all updates to
database.
 When transaction completes, current page table becomes shadow
page table.
 No log record management
 However, it has an overhead of managing pages i.e. page
replacement issues have to be handled.

32
Advanced Database Systems Lecture Note

Chapter Two
Query Processing and Optimization

DB system features
 Crash Recovery
 Integrity Checking
 Security
 Concurrency Control
 Query Processing and Optimization
 File Organization and Optimization
Query Processing
The aim of query processing is to find information in one or more databases and
deliver it to the user quickly and efficiently. Traditional techniques work well for
databases with standard, single-site relational structures, but databases
containing more complex and diverse types of data demand new query
processing and optimization techniques.

Query Processing can be divided into four main phases:


1. Decomposition:
2. Optimization
3. Code generation, and
4. Execution

Consider the following query from two relations; staff and branch

Staff ( StaffNo, name,position, salary, sex, BranchNo)


Branch (BranchNo,name, city)

Eg. Find all managers who manage a branch at Addis


We can write this query in a SQL as follows
Select S.*
from staff s, branch b
where b.ranchNo=s.brachNo AND (S.position=’Manager’ AND
B.city=’Addis’)

33
Advanced Database Systems Lecture Note

Assume
i. One record is accessed at a time, n staff , m branches, x non-manager,
and y non-Addis branches for some integers n,m,x,y.
ii. intermediate results are stored on disk
iii. ignore about the final result(write) because it is the same for all the
expressions

Then, this high level SQL can be transformed in the following three
low level equivalent relational algebra expressions.
(position=’manager’)(City=’Addis’)(staff.branchNo=branch.branchNo)(StaffXBranch)
Analysis:
i. read each tuple from the two relations n+m reads
ii. create a table of the Cartesian product nXm writes
iii. test each tuple of step 2nXm read
Total No. of Disk access: 2(nXm) +n+m
or
(position=’manager’)(City=’Addis’)(Staff staff.branchNo=branch.branchNo
Branch)
Analysis:
i. read each tuple from the two relations n+m reads
ii. create a table of the Join n writes
iii. test each tuple of step 2n read
Total No. of Disk access: 3(n) +m
Or
(position =’manager’ (Staff )) staff.branchNo=branch.branchNo (
City=’Addis’(Branch))
Analysis:
i. Test each tuple from the two relations n+m reads
ii. create a “manager_Staff” and “addis_Branch” realtions (n-
x) +(m-y) writes
iii. create a join of the two relations at step 2(n-x) + (m-y) reads
Total No. of Disk access: 2(n-x)+2(m-y)+n+m

Which of the expressions given above do you think is best (Optimal)?

34
Advanced Database Systems Lecture Note

Phases of Query processing

Query Decomposition
Query decomposition is the process of transforming a high level query into a
relational algebra query, and to check that the query is syntactically and
semantically correct. Query decomposition consists of parsing and validation

Input: Relational Algebra query on global relations

Typical stages in query decomposition are:


1. Analysis: lexical and syntactical analysis of the query (correctness).
Query tree will be built for the query containing leaf node
for base relations, one or many non-leaf nodes for relations
produced by relational algebra operations and root node for
the result of the query. Sequence of operation is from the
leaves to the root.
2. Normalization: convert the query into a normalized form. The
predicate WHERE will be converted to Conjunctive () or
Disjunctive () Normal form.
3. Semantic Analysis: to reject normalized queries hat are not
correctly formulated or contradictory. Incorrect if
components do not contribute to generate result.
Contradictory if the predicate can not be satisfied by any
tuple. Algorithms: relation connection graph and normalized
attribute connection graph.
4. Simplification: to detect redundant qualifications, eliminate
common sub-expressions, and transform the query to a
semantically equivalent but more easily and effectively
computed form.
5. Query Restructuring More than one translation is possible Use
transformation rules. Re arranging nodes so that the most
restrictive condition will be executed first.
Most real-world data is not well structured. Today's databases typically contain
much non-structured data such as text, images, video, and audio, often
distributed across computer networks. In this complex environment, efficient
and accurate query processing becomes quite challenging.

There could be tons of tricks (not only in storage and query processing, but also
in concurrency control, recovery, etc.) Different tricks may work better in
different usage scenarios. Same tricks get used over and over again in different
applications

35
Advanced Database Systems Lecture Note

Query processing: Execute transactions in behalf of this query and print the
result. Steps in query processing:

Example: Select Customer name


From Customer, Invoice
Where region = ‗Kansas City and Amount > 1000

36
Advanced Database Systems Lecture Note

37
Advanced Database Systems Lecture Note

Query Optimization
What is wrong with the ordinary query?
 Everyone wants the performance of their database to be optimal. In
particular, there is often a requirement for a specific query or object that is
query based, to run faster.
 Problem of query optimization is to find the sequence of steps that produces
the answer to user request in the most efficient manner, given the database
structure.
 The performance of a query is affected by the tables or queries that underlies
the query and by the complexity of the query.
When data/workload characteristics change
 The best navigation strategy changes
 The best way of organizing the data changes

Query optimizers are one of the main means by which modern database systems
achieve their performance advantages. Given a request for data manipulation or
retrieval, an optimizer will choose an optimal plan for evaluating the request
from among the manifold alternative strategies. i.e. there are many ways
(access paths) for accessing desired file/record. The optimizer tries to select the
most efficient (cheapest) access path for accessing the data. DBMS is responsible
to pick the best execution strategy based various considerations.

Query optimizers were already among the largest and most complex modules of
database systems.

Most efficient processing: Least amount of I/O and CPU resources.


Selection of the best method: In a non-procedural language the system does the
optimization at the time of execution. On the other hand in a procedural
language, programmers have some flexibility in selecting the best method. For
optimizing the execution of a query the programmer must know:
 File organization
 Record access mechanism and primary or secondary key.
 Data location on disk.
 Data access limitations.

To write correct code, application programmers need to know how data is


organized physically (e.g., which indexes exist)
To write efficient code, application programmers also need to worry about
data/workload characteristics
One has to cope with change! (Real time changes hence, preferable to give the
responsibility of optimization to the DBMS).

38
Advanced Database Systems Lecture Note

Example: Consider relations r(AB) and s(CD). We require r X s.

Method 1
a. Load next record of r in RAM.
b. Load all records of s, one at a time and concatenate with r.
c. All records of r concatenated?
NO: goto a.
YES: exit (the result in RAM or on disk).
Performance: Too many accesses.

Method 2: Improvement
a. Load as many blocks of r as possible leaving room for one block of s.
b. Run through the s file completely one block at a time.
Performance: Reduces the number of times s blocks are loaded by a factor of
equal to the number of r records than can fit in main memory.

Considerations during query Optimization:


 Narrow down intermediate result sets quickly. SELECT before JOIN
 Use access structures (indexes).

39
Advanced Database Systems Lecture Note

Approaches to Query Optimization


Heuristics Approach
 The heuristic approach uses the knowledge of the characteristics of the
relational algebra operations and the relationship between the operators
to optimize the query.
 Thus the heuristic approach of optimization will make use of:
o Properties of individual operators
o Association between operators
o Query Tree: a graphical representation of the operators, relations,
attributes and predicates and processing sequence during query
processing.
 Query tree is composed of three main parts:
i. The Leafs: the base relations used for processing the
query/ extracting the required information
ii. The Root: the final result/relation as an out put based on
the operation on the relations used for query processing
iii. Nodes: intermediate results or relations before reaching
the final result.
 Sequence of execution of operation in a query tree will start
from the leaves and continues to the intermediate nodes and
ends at the root.

 The properties of each operations and the association between operators is


analyzed using set of rules called TRANSFORMATION RULES. Use of the
transformation rules will transform the query to relatively good execution
strategy.

40
Advanced Database Systems Lecture Note

Transformation Rules for Relational Algebra


1. Cascade of SELECTION: conjunctive SELECTION Operations can cascade
into individual Selection Operations and Vice Versa

(c1c2c3) (R)= c1(c2(c3(R)) where ci is a predicate


2. Commutativity of SELECTION operations

c1(c2(R))= c2(c1(R)) where ci is a predicate


3. Cascade of PROJECTION: in the sequence of PROJECTION Operations, only
the last in the sequence is required

    (R)=  (R)
L1 L2 L3 L4 L1

4. Commutativity of SELECTION with PROJECTION and Vise Versa


If the predicate c1 involves only the attributes in the projection list (L1), then the
selection and projection operations commute

<a1,a2..an>( (R))= (<a1,a2,,,,an>(R))


c1 c1
Where c1€{a1,a2…an}

5. Commutativity of THETA JOIN/Cartesian Product


R X S is equivalent to S X R
Also holds for Equi-Join and Natural-Join

(R c1 S)= (S c1 R)
6. Commutativity of SELECTION with THETA JOIN
a. If the predicate c1 involves only attributes of one of the relations (R)
being joined, then the Selection and Join operations commute

c1 (R c S)= ( (R)) c1 c S)


b. If the predicate is in the form c1  c2 and c1 involves only attributes of
R and c2 involves only attributes of S, then the Selection and Theta Join
operations commute

c1c2 (R c S)= ( (R)) c1 c ( S))


c2

41
Advanced Database Systems Lecture Note

7. Commutativity of PROJECTION and THETA JOIN


If the projection list is of the form L1, L2, where L1 involves only attributes of R
and L2 involves only attributes of S being joined and the predicate c involves
only attributes in the projection list, then the PROJECTION and JOIN operations
commute as

 L1,L2 (R c S)= ( (R))


L1 c ( S))
L2
However if the join condition c contains additional attributes not in projection
list L=L1 U L2 , say M=M1U M2 where M1 is from R and M2 is from S then the
PROJECTION and JOIN operations commute as follows

 (R L1,L2 c S)=  L1,L2 (( R) L1,M1 c (


L2,M2

S))

8. Commutativity of the Set Operations: UNION and INTERSECTION but not


SET DIFFERENCE

RS=SR and RS=SR


9. Associativity of the THETA JOIN, CARTESIAN PRODUCT, UNION and
INTERSECTION.

(R S) T=R (S T) where is one of the operations

10. Commuting SELECTION with SET OPERATIONS

c (R S)= (c(R)  c(S)) where is one of the operations

11. Commuting PROJECTION with UNION

 (SR )=  (S )   (R )
L1 L1 L1

42
Advanced Database Systems Lecture Note

Heuristic Approach will be implemented by using the above transformation


rules in the following sequence or steps.

Sequence for Applying Transformation Rules


1. Use
Rule-1 Cascade SELECTION
2. Use
Rule-2: Commutativity of SELECTION
Rule-4: Commuting SELECTION with PROJECTION
Rule-6: Commuting SELECTION with JOIN and CARTESIAN
Rule-10: commuting SELECTION with SET OPERATIONS
3. Use
Rule-9: Associativity of Binary Operations (JOIN, CARTESIAN,
UNION and INTERSECTION). Rearrange nodes by making the most
restrictive operations to be performed first ( moving it as far down the
tree as possible)
4. Perform Cartesian Operations with the subsequent Selection Operation
5. Use
Rule-3: Cascade of PROJECTION
Rule-4: Commuting PROJECTION with SELECTION
Rule-7: Commuting PROJECTION with JOIN and CARTESIAN
Rule-11: commuting PROJECTION with UNION

Main Heuristic
The main heuristic is to first apply operations that reduce the size (the cardinality
and/or the degree) of the intermediate relation. That is:
a. Perform SELECTION as early as possible: that will reduce the
cardinality (number of tuples) of the relation.
b. Perform PROJECTION as early as possible: that will reduce the
degree (number of attributes) of the relation.
 Both a and b will be accomplished by placing the
SELECT and PROJECT operations as far down the tree as
possible.
c. SELECT and JOIN operations with most restrictive conditions
resulting with smallest absolute size should be executed before
other similar operations. This is achieved by reordering the
nodes with JOIN

43
Advanced Database Systems Lecture Note

Example: consider the following schemas and the query, where the EMPLOYEE
and the PROJECT relations are related by the WORKS_ON relation.

EMPLOYEE (EEmpID, FName, LName, Salary, Dept, Sex, DoB)


PROJECT (PProjID, PName, PLocation, PFund, PManagerID)
WORKS_ON (WEmpID, WProjID)

WEmpID (refers to employee identification) and PProjID (refers to project


identification) are foreign keys to WORKS_ON relation from EMPLOYEE and
PROJECT relations respectively.

Query: The manager of the company working on road construction would like
to view employees name born before January 1 1965 who are working
on the project named Ring Road.

Relational Algebra representation of the query will be:

 <FName, LName> 
( <DoB<Jan 1 1965 WEmpID=EEmpID PProjID=WProjID  PName=’Ring

Road’> (EMPLOYEE X WORKS_ON X PROJECT))

The SQL equivalence for the above query will be:

SELECT FName, LName


FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE DoB<Jan 1 1965  EEmpID=WEmpID  WProjID=PProjID  PName=”Ring Road”

44
Advanced Database Systems Lecture Note

The initial query tree will be:

 <FName, LName>

 (DoB<Jan 1 1965)  (WEmpID=EEmpID)  (PProjID=WProjID)  (PName=’Ring Road’)

X
PROJECT

WORKS_ON

EMPLOYEE

By applying the first step (cascading the selection) we will come up with the
following structure.

 (DoB<Jan 1 1965) ( (
(WEmpID=EEmpID) ( (PProjID=WProjID) (PName=’Ring Road’)

(EMPLOYEE X WORKS_ON X PROJECT) ) ) )

By applying the second step it can be seen that some conditions have attribute
that belong to a single relation ( DoB belongs to EMPLOYEE and PName belongs
to PROJECT) thus the selection operation can be commuted with Cartesian
Operation. Then, since the condition WEmpID=EEmpID base the employee and
WORKS_ON relation the selection with this condition can be cascaded.

45
Advanced Database Systems Lecture Note

( (PProjID=WProjID) ( (PName=’Ring Road’) PROJECT ) X ( (WEmpID=EEmpID)

(WORKS_ON X ( (DoB<Jan 1 1965) EMPLOYEE))))

The query tree after this modification will be:

 <FName, LName>

 (PProjID=WProjID)

 (WEmpID=EEmpID)  (PName=’Ring Road’)

PROJECT
X

 (DoB<Jan 1 1965) WORKS_ON

EMPLOYEE

46
Advanced Database Systems Lecture Note

Using the third step, perform most restrictive operations first.


From the query given we can see that selection on PROJECT is most restrictive
than selection on EMPLOYEE. Thus, it is better to perform selection on PROJECT
BEFOR on EMPLOYEE. rearrange the nodes to achieve this.

 <FName, LName>


( WEmpID=EEmpID)

 ( PProjID=WProjID)  (DoB<Jan 1 1965)

EMPLOYEE
X

 (PName=’Ring Road’) WORKS_ON

PROJECT

47
Advanced Database Systems Lecture Note

Using the forth step, Perform Cartesian Operations with the subsequent Selection
Operation.

 <FName, LName>

( WEmpID=EEmpID)

 (DoB<Jan 1 1965)

( PProjID=WProjID)

EMPLOYEE

 (PName=’Ring Road’)
WORKS_ON

PROJECT

48
Advanced Database Systems Lecture Note

Using the fifth step, Perform the projection as early as possible.

 <FName, LName>

( WEmpID=EEmpID)


< WEmpID >

( PProjID=WProjID)  <FName,LName,EEmpID>

 (DoB<Jan 1 1965)

WORKS_ON

 <PProjID> EMPLOYEE

 (PName=’Ring Road’)

PROJECT

49
Advanced Database Systems Lecture Note

Cost Estimation Approach to Query Optimization

The main idea is to minimize he cost of processing a query. The cost function is
comprised of:
 I/O cost + CPU processing cost + communication cost + Storage cost

These components might have different weights in different processing


environments

The DBMs will use information stored in the system catalogue for the purpose of
estimating cost. The main target of of query optimization is to minimize the size
of the intermediate relation. The size will have effect in the cost of:
 Disk Access
 Data Transpiration
 Storage space in the Primary Memory
 Writing on Disk

The statistics in the system catalogue used for cost estimation purpose are:
 Cardinality of a relation: the number of tuples contained in a relation
currently (r)
 Degree of a relation: number of attributes of a relation
 Number of tuples on a relation that can be stored in one block of memory
 Total number of blocks used by a relation
 Number of distinct values of an attribute (d)
 Selection Cardinality of an attribute (S): that is average number of records
that will satisfy an equality condition S=r/d

By sing the above information one could calculate the cost of executing a query
and selecting the best strategy, which is with the minimum cost of processing.

Cost Components for Query Optimization

The costs of query execution can be calculated for the following major process we
have during processing.
1. Access Cost of Secondary Storage
Data is going to be accessed from secondary storage, as an query will be
needing some part of the data stored in the database. The disk access cost can
again be analyzed in terms of:
 Searching
 Reading, and
 Writing, data blocks used to store some portion of a relation.

50
Advanced Database Systems Lecture Note

The disk access cost will vary depending on the file organization used and
the access method implemented for the file organization. In addition to
the file organization, the data allocation scheme, whether the data is
stored contiguously or in scattered manner, will affect the disk access cost.

2. Storage Cost
While processing a query, as any query would be composed of many
database operations, there could be one or more intermediate results before
reaching the final output. These intermediate results should be stored in
primary memory for further processing. The bigger the intermediate relation,
the larger the memory requirement, which will have impact on the limited
available space. This will be considered as a cost of storage.

3. Computation Cost
Query is composed of many operations. The operations could be database
operations like reading and writing to a disk, or mathematical and other
operations like:
 Searching
 Sorting
 Merging
 Computation on field values

4. Communication Cost
In most database systems the database resides in on station and various
queries originate from different terminals. This will have impact on the
performance of the system adding cost for query processing. Thus, the cost of
transporting data between the database site and the terminal from where the
query originate should be analyzed.

51
Advanced Database Systems Lecture Note

Pipelining

Pipelining is another method used for query optimization. It is sometime


referred to as on-the-fly processing of queries. As query optimization tries to
reduce the size of the intermediate result, pipelining use a better way of reducing
the size by performing different conditions on a single intermediate result
continuously. Thus the technique is said to reduce the number of intermediate
relations in query execution.
Pipelining performs multiple operations on a single relation in a pipeline.
Ex:
Lets say we have a relation on employee with the following schema

Employee(ID, FName, LName, DoB, Salary, Position, Dept)

If a query would like to extract supervisors with salary greater than 2000, the
relational algebra representation of the query will be:

 (Salary>2000)  (Position=Supervisor) (Employee)


After reading the relation from the memory, the system could perform the
operation by cascading the SELECT operation.

1. Approach One

 (Salary>2000) ( (Position=Supervisor) (Employee))


Using this approach we will have the following relations
 Employee
 Relation created by the Operation:

R1 =  Position=Supervisor) (Employee)
(

 The resulting Relation with the Operation

R1 =  (Salary>2000) (R1)
2. Approach Two
One can select a single tuple from the relation Employee and perform both tests
in a pipeline and create the final relation at once. This is what is called
PIPELINING

52
Advanced Database Systems Lecture Note

Chapter three
Database Integrity and Security
Topics:
 Database Integrity Rules

 Database Security Issues


o Discretionary access control
o Mandatory access control

o Statistical database security

In today's society, information is a critical resource not only in the fields of


industry, commerce, education, or medicine, but also in the fields of
military, diplomacy, or governments. Some information is extremely
important as to have to be protected. For example, data corruption or
fabrication in a hospital database could result in patients' receiving the
wrong medication. Disclosure or modification of military information
could endanger national security.
 Privacy – Ethical and legal rights that individuals have with regard
to control over the dissemination and user of their personal
information
 Database security – Protection of information contained in the
database against unauthorized access, modification or destruction
 Database integrity – Mechanism that is applied to ensure that the
data in the database is correct and consistent

A good database security management system has not only the following
characteristics: data independence, shared access, minimal redundancy,
data consistency, and data integrity but also the following characteristics:
privacy, integrity, and availability.
 Privacy signifies that an unauthorized user cannot disclose data
 Integrity ensures that an unauthorized user cannot modify data
 Availability ensures that data be made available to the authorized
user unfailingly
 Copyright ensures the native rights of individuals as a creator of
information.
 Validity ensures activities to be accountable by law.

53
Advanced Database Systems Lecture Note

With a strong enforcement and management of these, it is said that the


database system can effectively prevent accidental security and integrity
threats from system error, improper authorization and concurrent usage
anomalies. In addition to have an efficient system, it should have
prevention on malicious or intentional security and integrity threats where
computer system operator can bypass security as well as programmers as
hackers.

There are certain security policy issues that we should recognize, where
we should consider administrative control policies, decide which security
features offered by the DBMS is used to implement the system, decide
whether the focus of security administration is left with DBA and whether
it is centralized or decentralized. Besides, one should decide on ownership
of shared data as well.

When we talk about the levels of security protection, it may start from
organization & administrative security, physical & personnel security,
communication security and Information systems security

Database security and integrity is about protecting the database from being
inconsistent and being disrupted. We can also call it database misuse.

Database misuse could be Intentional or Accidental, where accidental


misuse is easier to cope with than intentional misuse.
Accidental inconsistency could occur due to:
 System crash during transaction processing
 Anomalies due to concurrent access
 Anomalies due to redundancy
 Logical errors

Like wise, even though there are various threats that could be categorized
in this group,
Intentional misuse could be:
 Unauthorized reading of data
 Unauthorized modification of data or
 Unauthorized destruction of data

Most systems implement good Database Integrity to protect the system


from accidental misuse while there are many computer based measures to

54
Advanced Database Systems Lecture Note

protect the system from intentional misuse, which is termed as Database


Security measures.

Levels of Security Measures


Security measures can be implemented at several levels and for different
components of the system. These levels are:

1. Physical Level: concerned with securing the site containing the


computer system. The backup systems should also be physically
protected from access except for authorized users. In other words, the
site or sites containing the computer systems must be physically
secured against armed or sneaky entry by intruders.
2. Human Level: concerned with authorization of database users for
access the content at different levels and privileges.
3. Operating System: concerned with the weakness and strength of the
operating system security on data files. Weakness may serve as a means
of unauthorized access to the database. No matter how secure the
database system is, weakness in operating system security may serve as
a means of unauthorized access to the database. This also includes
protection of data in primary and secondary memory from
unauthorized access.
4. Database System: concerned with data access limit enforced by the
database system. Access limit like password, isolated transaction and
etc. Some database system users may be authorized to access only a
limited portion of the database. Other users may be allowed to issues
queries, but may be forbidden to modify the data. It is the responsibility
of the database system to ensure that these authorization restrictions
are not violated.
5. Application Level: Since almost all database systems allow remote
access through terminals or networks, software-level security with the
network software is as important as physical security, both on the
Internet and networks private to an enterprise.

Even though we can have different levels of security and authorization on


data objects and users, who access which data is a policy matter rather
than technical.

55
Advanced Database Systems Lecture Note

These policies
 should be known by the system: should be encoded in the system
 should be remembered: should be saved somewhere (th catalogue)

 Database Integrity constraints contribute to maintaining a


secure database system by preventing data from becoming
invalid and hence giving misleading or incorrect results

 Domain Integrity means that each column in any table will


have set of allowed values and can not assume any value
other than the one specified in the domain.

 Entity Integrity means that in each table the primary key


(which may be composite) satisfies both of two conditions:
1. That the primary key is unique within the table and
2. That the primary key column(s) contains no null values.

 Referential Integrity means that in the database as a whole,


things are set up in such a way that if a column exists in two
or more tables in the database (typically as a primary key in
one table and as a foreign key in one or more other tables),
then any change to a value in that column in any one table
will be reflected in corresponding changes to that value where
it occurs in other tables. This means that the RDBMS must be
set up so as to take appropriate actions to spread a change—in
one table—from that table to the other tables where the
change must also occur.

The effect of the existence and maintenance of referential


integrity is, in short, that if a column exists in two or more
tables in the database, every occurrence of the column will
contain only values that are consistent across the database.

 Key constraints in a relational database, there should be some


collection of attributes with a special feature used to maintain
the integrity of the database. These attributes will be named as
Primary Key, Candidate Key, Foreign Key, and etc. These

56
Advanced Database Systems Lecture Note

Key(s) should obey some rules set by the relational data


model.
 Enterprise Constraint means some business rules set by the
enterprise on how to use, manage and control the database

 Database Security - the mechanisms that protect the database


against intentional or accidental threats. Database security
encompasses hardware, software, people and data.

Database Management Systems supporting multi-user database


system must provide a database security and authorization
subsystem to enforce limits on individual and group access
rights and privileges.

Security Issues and general considerations


 Legal, ethical and social issues regarding the right to
access information
 Physical control issues regarding how to keep the database
physically secured.
 Policy issues regarding privacy of individual level at
enterprise and national level
 Operational consideration on the techniques used
(password, etc) to access and manipulate the database
 System level security including operating system and
hardware control
 Security levels and security policies in enterprise level

The designer and the administrator of a database should first


identify the possible threat that might be faced by the system in
order to take counter measures.

57
Advanced Database Systems Lecture Note

Threat may be any situation or event, whether intentional or


accidental, that may adversely affect a system and consequently
the organization

 A threat may be caused by a situation or event involving a


person, action, or circumstance that is likely to bring harm to
an organization
 The harm to an organization may be tangible or intangible
Tangible – loss of hardware, software, or data

Intangible – loss of credibility or client confidence

Examples of threats:
 Using another persons’ means of access
 Unauthorized amendment/modification or copying of data
 Program alteration
 Inadequate policies and procedures that allow a mix of
confidential and normal out put
 Wire-tapping
 Illegal entry by hacker
 Blackmail
 Theft of data, programs, and equipment
 Failure of security mechanisms, giving greater access than
normal
 Staff shortages or strikes
 Inadequate staff training
 Viewing and disclosing unauthorized data
 Electronic interference and radiation
 Data corruption owing to power loss or surge
 Fire (electrical fault, lightning strike, arson), flood, bomb
 Physical damage to equipment
 Breaking cables or disconnection of cables
 Introduction of viruses
 An organization deploying a database system needs to
identify the types of threat it may be subjected to and initiate

58
Advanced Database Systems Lecture Note

appropriate plans and countermeasures, bearing in mind the


costs of implementing each.

Countermeasures: Computer Based Controls


 The types of countermeasure to threats on computer systems
range from physical controls to administrative procedures

 Despite the range of computer-based controls that are available, it


is worth noting that, generally, the security of a DBMS is only as
good as that of the operating system, owing to their close
association

 The following are computer-based security controls for a


multi-user environment:
 Authorization
 The granting of a right or privilege that enables a subject to
have legitimate access to a system or a system‘s object
 Authorization controls can be built into the software, and
govern not only what system or object a specified user can
access, but also what the user may do with it
 Authorization controls are sometimes referred to as access
controls
 The process of authorization involves authentication of
subjects (i.e. a user or program) requesting access to objects
(i.e. a database table, view, procedure, trigger, or any other
object that can be created within the system)

 Views
 A view is the dynamic result of one or more relational
operations operation on the base relations to produce another
relation
 A view is a virtual relation that does not actually exist in the
database, but is produced upon request by a particular user
 The view mechanism provides a powerful and flexible
security mechanism by hiding parts of the database from
certain users

59
Advanced Database Systems Lecture Note

 Using a view is more restrictive than simply having certain


privileges granted to a user on the base relation(s)

 Backup and recovery


 Backup is the process of periodically taking a copy of the
database and log file (and possibly programs) on to offline
storage media
 A DBMS should provide backup facilities to assist with the
recovery of a database following failure
 Database recovery is the process of restoring the database to a
correct state in the event of a failure
 Journaling is the process of keeping and maintaining a log file
(or journal) of all changes made to the database to enable
recovery to be undertaken effectively in the event of a failure
 The advantage of journaling is that, in the event of a failure,
the database can be recovered to its last known consistent
state using a backup copy of the database and the information
contained in the log file
 If no journaling is enabled on a failed system, the only means
of recovery is to restore the database using the latest backup
version of the database
 However, without a log file, any changes made after the last
backup to the database will be lost

 Integrity
 Integrity constraints contribute to maintaining a secure
database system by preventing data from becoming invalid
and hence giving misleading or incorrect results
 Domain Integrity: setting the allowed set of values
 Entity integrity: demanding Primary key values not to assume
a NULL value
 Referential integrity: enforcing Foreign Key values to have a
value that already exist in the corresponding Candidate Key
attribute(s) or be NULL.
 Key constraints: the rules the Relational Data Model has on
different kinds of Key.

60
Advanced Database Systems Lecture Note

Encryption
Authorization may not be sufficient to protect data in database
systems, especially when there is a situation where data should be
moved from one location to the other using network facilities.

Encryption is used to protect information stored at a particular site or


transmitted between sites from being accessed by unauthorized
users.
Encryption is the encoding of the data by a special algorithm that
renders the data unreadable by any program without the decryption
key.
It is not possible for encrypted data to be read unless the reader
knows how to decipher/decrypt the encrypted data.

 If a database system holds particularly sensitive data, it may be


deemed necessary to encode it as a precaution against possible
external threats or attempts to access it
 The DBMS can access data after decoding it, although there is a
degradation in performance because of the time taken to
decode it
 Encryption also protects data transmitted over communication
lines
 To transmit data securely over insecure networks requires the
use of a Cryptosystem, which includes:
1. An encryption key to encrypt the data (plaintext)
2. An encryption algorithm that, with the encryption key,
transforms the plaintext into ciphertext
3. A decryption key to decrypt the ciphertext
4. A decryption algorithm that, with the decryption key,
transforms the ciphertext back into plaintext

 Data encryption standard (DES) is an approach which does


both a substitution of characters and a rearrangement of their
order based on an encryption key.

61
Advanced Database Systems Lecture Note

Types of Cryptosystems

 Cryptosystems can be categorized into two


1. Symmetric encryption – uses the same key for both
encryption and decryption and relies on safe communication
lines for exchanging the key.
2. Asymmetric encryption – uses different keys for encryption
and decryption e.g. RSA
 Generally, symmetric algorithms are much faster to execute on
a computer than those that are asymmetric. In the contrary,
asymmetric algorithms are more secure than symmetric
algorithms.

 RAID technology (Redundant Array of Independent Disks)

 The hardware that the DBMS is running on must be fault-


tolerant, meaning that the DBMS should continue to operate
even if one of the hardware components fails. This suggests
having redundant components that can be seamlessly
integrated into the working system whenever there is one or
more component failures. The main hardware components
that should be fault-tolerant include disk drives, disk
controllers, CPU, power supplies, and cooling fans. Disk
drives are the most vulnerable components with the shortest
times between failures of any of the hardware components.
 RAID works on having a large disk array comprising an
arrangement of several independent disks that are organized
to improve reliability and at same time increase performance
 Performance is increased through data striping
 Data striping – the data is segmented into equal size
partitions (the striping unit) which are transparently
distributed across multiple disks.

62
Advanced Database Systems Lecture Note

Security at different Levels of Data


Almost all RDBMSs provide security at different levels and
formats of data. This includes:
1. Relation Level: permission to have access to a specific
relation.
2. View Level: permission to data included in the view
and not in the named relations
3. Hybrid (Relation/View): the case where only part of a
single relation is made available to users through View.

Any database access request will have the following three major
components
1. Requested Operation: what kind of operation is requested
by a specific query?
2. Requested Object: on which resource or data of the database
is the operation sought to be applied?
3. Requesting User: who is the user requesting the operation
on the specified object?

The database should be able to check for all the three


components before processing any request. The checking is
performed by the security subsystem of the DBMS.

Authentication
 All users of the database will have different access levels and
permission for different data objects, and authentication is the
process of checking whether the user is the one with the
privilege for the access level.
 Is the process of checking the users are who they say they are.
 Each user is given a unique identifier, which is used by the
operating system to determine who they are

63
Advanced Database Systems Lecture Note

 Thus the system will check whether the user with a specific
username and password is trying to use the resource.
 Associated with each identifier is a password, chosen by the
user and known to the operation system, which must be
supplied to enable the operating system to authenticate who
the user claims to be

Authorization/Privilege
Authorization refers to the process that determines the mode in
which a particular (previously authenticated) client is allowed to
access a specific resource controlled by a server.
Most of the time, authorization is implemented by using Views.
 Views are unnamed relations containing part of one or more
base relations creating a customized/personalized view for
different users.
 Views are used to hide data that a user needs not to see.

Forms of user authorization


There are different forms of user authorization on the resource of the
database. These forms are privileges on what operations are allowed
on a specific data object.

User authorization on the data/extension


1. Read Authorization: the user with this privilege is allowed
only to read the content of the data object.
2. Insert Authorization: the user with this privilege is allowed
only to insert new records or items to the data object.
3. Update Authorization: users with this privilege are allowed to
modify content of attributes but are not authorized to delete the
records.
4. Delete Authorization: users with this privilege are only
allowed to delete a record and not anything else.

64
Advanced Database Systems Lecture Note

 Different users, depending on the power of the user, can have one
or the combination of the above forms of authorization on
different data objects.

User authorization on the database schema


1. Index Authorization: deals with permission to create as well as
delete an index table for relation.

2. Resource Authorization: deals with permission to add/create


a new relation in the database.

3. Alteration Authorization: deals with permission to add as


well as delete attribute.

4. Drop Authorization: deals with permission to delete and


existing relation.

Role of DBA in Database Security

The database administrator is responsible to make the database


to be as secure as possible. For this the DBA should have the
most powerful privilege than every other user. The DBA
provides capability for database users while accessing the
content of the database.

The major responsibilities of DBA in relation to authorization of


users are:
1. Account Creation: involves creating different accounts for
different USERS as well as USER GROUPS.

2. Security Level Assignment: involves in assigning different


users at different categories of access levels.

65
Advanced Database Systems Lecture Note

3. Privilege Grant: involves giving different levels of privileges


for different users and user groups.

4. Privilege Revocation: involves denying or canceling


previously granted privileges for users due to various
reasons.

5. Account Deletion: involves in deleting an existing account of


users or user groups. Is similar with denying all privileges of
users on the database.

66
Advanced Database Systems Lecture Note

Approaches to Database Security

There are two broader approaches to security. The two types


of database security mechanisms are:
1. Discretionary security mechanisms
 Grant different privileges to different users and
user groups on various data objects
 The privilege is to access different data objects
 The mode of the privilege could be
 Read,
 Insert,
 Delete,
 Update files, records or fields.
 Is more flexible
 One user can have A but not B and another user
can have B but not A

2. Mandatory security mechanisms


 Enforce multilevel security
 classifying data and users into various
security classes (or levels) and
implementing the appropriate security
policy of the organization.
 Each data object will have certain classification
level
 Each user is given certain clearance level
 Only users who can pass the clearance level can
access the data object
 Is comparatively not-flexible/rigid
 If one user can have A but not B then B is
accessed by users with higher privilege and we
can not have B but not A

67
Advanced Database Systems Lecture Note

The ability to classify user into a hierarchy of groups provide


a powerful tool for administering large systems with
thousands of users and objects.

A database system can support one or both of the security


mechanisms to protect the data.

In most systems it is better to filter those that are allowed


rather than identifying the not allowed. Since if some object
is authorized then it means it is not constrained.

68
Advanced Database Systems Lecture Note

Statistical Database Security


 Statistical databases contain information about
individuals which may not be permitted to be seen by
others as individual records.

 Such databases may contain information about various


populations.

 Example: Medical Records, Personal Data like address,


salary, etc

 Such kind of databases should have special security


mechanisms so that confidential information about people
will not be disclosed for many users.

 Thus statistical databases should have additional security


techniques which will protect the retrieval of individual
records.

 Only queries with statistical aggregate functions like


Average, Sum, Min, Max, Standard Deviation, Mid,
Count, etc should be executed.

 Queries retrieving confidential attributes should be


prohibited.

 Not to let the user make inference on the retrieved data,


one can also implement constraint on the minimum
number of records or tuples in the resulting relation by
setting a threshold.

69
Advanced Database Systems Lecture Note

Chapter Four

Distributed Database Systems

o Before the development of database technology, every


application used to have its own data in the application logic.

o Database development facilitates the integration of data


available in an organization from a number of applications and
enforces security on data access on a single local site. But it is
not always the case that organizational data reside in one
central site. This demand databases at different sites to be
integrated and synchronized with all the facilities of database
approach.

o This is will be made possible by computer networks and data


communication optimized by internet, mobile and wireless
computing and intelligent devices.

o This leads to Distributed Database Systems.

o The decentralization approach to data base mirrors the natural


organizational structure of companies which are logically
distributed in to divisions, departments, projects and so on and
physically distributed in to offices, plants, and factories each of
which maintains its own operational data.

o Distributed Database is not a centralized database.

70
Advanced Database Systems Lecture Note

Data Distribution Strategies


Centralized DB Distributed DB
 Distributed DB stores logically related shared data and metadata
at several physically independent sites connected via network
 Distributed DBMS is the software system that permits the
management of a Distributed DB and makes the distribution
transparent to the user.
 Data allocation is the process of deciding where to allocate/store
particular data item.
 There are 3 data allocation strategies:
1. Centralized: the entire DB is located at a single
site. And computers access through the network.
Known as distributed processing.
2. Partitioned: the DB is split into several disjoint
parts (called partitions, segments or fragments)
and stored at several sites
3. Replicated: copies of one or more partitions are
stored at several sites
3.1 Selective- combines fragmentation (locality
of reference for those which are less updated)
replication and centralization as appropriate for
the data.

71
Advanced Database Systems Lecture Note

3.2 Complete: - database copy is made

available in each site . Snap shot is one method


used here.

 In a distributed database system, the database is logically


stored as single database but physically fragmented on several
computers. The computers in a distributed system
communicate with each other through various communication
media, such as high speed buses or telephone line.

 A distributed database system has the following components.

o Local DBMS
o Distributed DDBMS
o Global System Catalog(GSC)
o Data communication (DC)

 A distributed database system consists of a collection of sites,


each of which maintains a local database system (Local DBMS)
but each local DBMS also participates in at least one global
transaction where different databases are integrated together.

o Local Transaction: transactions that access data only in


that single site
o Global Transaction: transactions that access data in
several sites.

 Parallel DBMS: a DBMS running across multiple processors


and disks that is designed to execute operations in parallel,
when ever possible, in order to improve performance.
Three architectures for parallel DBMS:
 Shared Memory- for fast data access for a limited
number of processors.
 Shared Disk- for application inherently centralized
 Shared nothing.- massively parallel

72
Advanced Database Systems Lecture Note

 What makes DDBMS different is that


o The various sites are aware of each other
o Each site provides a facility for executing both local and
global transactions.

 The different sites can be connected physically in different


topologies.
o Fully /networked,
o Partially Connected,
o Tree Network,
o Star Network and
o Ring Network
 The differences between these sites is based on:
o Installation Cost: cost of linking sites physically.
o Communication Cost: cost to send and receive messages
and data
o Reliability: resistance to failure
o Availability: degree to which data can be accessed
despite the failure.
 The distribution of the database sites could be:
1. Large Geographical Area: Long-Haul Network
 relatively slow
 less reliable
 uses telephone line, microwave, satellite

2. Small Geographical Area: Local Area Network


 higher speed
 lower rate of error
 use twisted pair, base band coaxial, broadband
coaxial, fiber optics

 Even though integration of data implies centralized storage and


control, in distributed database systems the intention is
different. Data is stored in different database systems in a
decentralized manner but act as if they are centralized through
development of computer networks.

73
Advanced Database Systems Lecture Note

 A distributed database system consists of loosely coupled sites


that share no physical component and database systems that
run on each site are independent of each other.

 Those which share physical components are known as Parallel


DBMS.

 Transactions may access data at one or more sites

 Organization may implement their database system on a


number of separate computer system rather than a single,
centralized mainframe. Computer Systems may be located at
each local branch office.

Functions of a DDBMS
 DDBMS have the following functionality.
 Extended Communication Services to provide
access to remote sites.
 Extended Data Dictionary- to store data distribution
details a need for global system catalog.
 Distributed Query Processing - optimization of
query remote data access.
 Extended security- access control to a distributed
data
 Extended Concurrency Control –maintain
consistency of replicated data.
 Extended Recovery Services- failures of individual
sites and the communication line.

74
Advanced Database Systems Lecture Note

Issues in DDBMS

How is data stored in DDBMS?


There are several ways of storing a single relation in distributed
database systems.

 Replication:
o System maintains multiple copies of similar data
(identical data)
o Stored in different sites, for faster retrieval and fault
tolerance.
o Duplicate copies of the tables can be kept on each system
(replicated). With this option, updates to the tables can
become involved (of course the copies of the tables can be
read-only).
o Advantage: Availability, Increased parallelism (if only
reading)
o Disadvantage: increased overhead of update

 Fragmentation:
o Relation is partitioned into several fragments stored in
distinct sites
 The partitioning could be vertical, horizontal or
both.

o Horizontal Fragmentation
 Systems can share the responsibility of storing
information from a single table with individual
systems storing groups of rows
 Performed by the Selection Operation
 The whole content of the relation is reconstructed
using the UNION operation

75
Advanced Database Systems Lecture Note

o Vertical Fragmentation
 Systems can share the responsibility of storing
particular attributes of a table.
 Needs attribute with tuple number (the primary key
value be repeated.)
 Performed by the Projection Operation
 The whole content of the relation is reconstructed
using the Natural JOIN operation using the
attribute with Tuple number (primary key values)

76
Advanced Database Systems Lecture Note

o Both (hybrid fragmentation)


 A system can share the responsibility of storing
particular attributes of a subset of records in a given
relation.
 Performed by projection then selection or selection then
projection relational algebra operators.
 Reconstruction is made by combined effect of Union and
natural join operators

Fragmentation is correct if it fulfils the following


1. Complete: - a data item must appear in at least one
fragment of a given relation R (R1, R2…Rn).
2. Reconstruction:- it must be possible to reconstruct a
relation from the fragments
3. Disjointness: - a data item should only be found in a
single fragment except for vertical fragmentation (the
primary key is repeated for reconstruction).

77
Advanced Database Systems Lecture Note

 Data transparency:
The degree to which system user may remain unaware of the
details of how and where the data items are stored in a
distributed system.

1. Distribution transparency Even though there are


many systems they appear as one- seen as a single,
logical entity.
2. Replication transparency Copies of data floating
around everywhere also seem like just one copy to
the developers and users
3. Fragmentation transparency A table that is actually
stored in parts everywhere across sites may seem
like just a single table in a single
4. Location Transparency- the user doesn‘t need to
know where a data item is physically located.

How does it work?


Distributed computing can be difficult to implement, particularly for
replicated data that can be updated from many systems. In order to
operate a distributed database system has to take care of

 Distributed Query Processing

 Distributed Transaction Management

 Replication Data Management If you are going to have copies


of data on many machines how often does the data get updated
if it is changed in another system? Who is in charge of
propagating the update to the data?

 Distributed Database Recovery If one machine goes down


how does that affect the others.

 Security: Just like any computer network, a distributed system


needs to have a common way to validate users entering from
any computer in the network of servers.

78
Advanced Database Systems Lecture Note

 Common Data-Dictionary Your schema now has to be


distinguished and work in connection to schemas created on
many systems.

Homogeneous and Heterogeneous


Distributed Databases
 In a homogeneous distributed database
 All sites have identical software (DBMS)
 Are aware of each other and agree to cooperate in
processing user requests.
 Each site surrenders part of its autonomy in terms of right
to change schemas or software
 Appears to the user as a single system

 In a heterogeneous distributed database


 Different sites may use different schemas and software
(DBMS)
 Difference in schema is a major problem for query
processing
 Difference in software is a major problem for
transaction processing
 Sites may not be aware of each other and may provide
only limited facilities for cooperation in transaction
processing.
 May need gateways to interface one another.

79
Advanced Database Systems Lecture Note

1. Why DDBMS/Advantages
2. Many existing systems
 Maybe you have no choice.
 Possibly there are many different existing system, with
possible different kinds of systems (Oracle, Informix, others)
that need to be used together.
3. Data sharing and distributed control:
 User at one site may be able access data that is available at
another site.
 Each site can retain some degree of control over local data
 We will have local as well as global database administrator
4. Reliability and availability of data
 If one site fails the rest can continue operation as long as
transaction does not demand data from the failed system
and the data is not replicated in other sites
5. Speedup of query processing
 If a query involves data from several sites, it may be possible
to split the query into sub-queries that can be executed at
several sites which is parallel processing
 Query can be sent to the least heavily loaded sites
6. Expansion (Scalability)
 In a distributed environment you can easily expand by
adding more machines to the network.

Disadvantages of DDBMS
1. Software Development Cost
 Is difficult to install, thus is costly
2. Greater Potential for Bugs
 Parallel processing may endanger correctness of algorithms
3. Increased Processing Overhead
 Exchange of message between sites – high communication
latency
 Due to communication jargons

80
Advanced Database Systems Lecture Note

4. Communication problems
5. Increased Complexity and Data Inconsistency
Problems
 Since clients can read and modify closely related data stored
in different database instances concurrently.

6. Security Problems: network and replicated data security.

Query Processing and Transaction


Management in DDBMS

Query Processing
There are different strategies to process a specific query, which in
turn increase the performance of the system by minimizing
processing time and cost. In addition to the cost estimates we have
for a centralized database (disk access, relation size, etc), we have to
consider the following in distributed query processing:

o Cost of data transmission over the huge network


o Gain of parallel processing of a single query

For the case of Replicated data allocation, even though parallel


processing is used to increase performance, update will have a great
impact since all the sites containing the data item should be updated.

For the case of fragmentation, update woks more like the centralized
database but reconstruction of the whole relation will require
accessing data from all sites containing part of the relation.

Let the distributed database has three sites (S1, S2, and S3). And two
relations, EMPLOYEE and DEPARTMENT are located at S1 and S2
respectively without any fragmentation. And a query is initiated
from S3 to retrieve employees [First Name (15 byte long), Last name

81
Advanced Database Systems Lecture Note

(15 byte long) and Department name (10 byte long) total of 40 bytes
with the department they are working in.

Let:
For EMPLOYEE we have the following information
1. 10,000 records
2. each record is 100 bytes long
For DEPARTMENT we have the following information
3. 100 records
4. each record is 35 bytes long
There are three ways of executing this query:
1. Transfer DEPARTMENT and EMPLOYEE to S3 and perform
the join there: needs transfer of 10,000*100+100*35=1,003,500
byte.
2. Transfer the DEPARTMENT to S1, perform the join there which
will have 40*10,000 = 400,000 bytes and transfer the result to S3.
we need 1,000,000+400,000=1,400,000 byte to be transferred
3. Transfer the EMPLOYEE to S2, perform the join there which
will have 40*10,000 = 400,000 bytes and transfer the result to S3.
We need 3,500+400,000=403,500 byte to be transferred.

Then one can select the strategy that will reduce the data transfer cost
for this specific query. Other steps of optimization may also be
included to make the processing more efficient by reducing the size of
the relations using projection.

Transaction Management
Transaction is a logical unit of work constituted by one or more
operations executed by a single user. A transaction begins with the
user's first executable query statement and ends when it is committed
or rolled back.

A Distributed Transaction is a transaction that includes one or


more statements that, individually or as a group, update data on two
or more distinct nodes of a distributed database.

82
Advanced Database Systems Lecture Note

Representation of Query in Distributed Database

SQL Statement Object Database Domain


SELECT * FROM dept sales midroc.telecom.et;
dept@sales.midroc.telecom.et;

There are two types of transaction in DDBMS to access data from


other sites:

1. Remote Transaction: contains only statements that access a


single remote node. Thus, Remote Query statement is a query
that selects information from one or more remote tables, all of
which reside at the same remote node or site.

o For example, the following query accesses data from the dept
table in the Addis schema (the site) of the remote sales database:

o SELECT * FROM Addis.dept@sales.midroc.telecom.et;

o A remote update statement is an update that modifies data in


one or more tables, all of which are collocated at the same
remote node. For example, the following query updates the
branch table in the Addis schema of the remote sales database:

UPDATE Addis.dept@ sales.midroc.telecom.et;


SET loc = 'Arada'
WHERE BranchNo = 5;

2. Distributed Transaction: contains statements that access more


than one node.

o A distributed query statement retrieves information from two


or more nodes.
o If all statements of a transaction reference only a single remote
node, the transaction is remote, not distributed.

83
Advanced Database Systems Lecture Note

o A database must guarantee that all statements in a transaction,


distributed or non-distributed, either commit or roll back as a
unit. The effects of an ongoing transaction should be invisible
to all other transactions at all nodes; this transparency should
be true for transactions that include any type of operation,
including queries, updates, or remote procedure calls.
o For example, the following query accesses data from the local
database as well as the remote sales database:

SELECT ename, dname


FROM Awassa.emp AW, Addis.dept@ sales.midroc.telecom.et AD
WHERE AW.deptno = AD.deptno;
{Employee data is stored in Awassa and Sales data is stored in Addis,
there is an employee responsible for each sale}

Analyze the following types transactions

 Remote query

select
client_nm
from
clients@accounts.motorola.com;

 Distributed query

select
project_name, student_nm
from
intership@accounts.motorola.com i, student s
where
s.stu_id = i.stu_id

 Remote Update

update intership@accounts.motorola.com set


stu_id = '242'
where stu_id = '200'

84
Advanced Database Systems Lecture Note

 Distributed Update

update intership@accounts.motorola.com set


stu_id = '242'
where
stu_id = '200'
update student set
stu_id = '242'
where
stu_id = '200'
commit
Concurrency Control
o There are various techniques used for concurrency control in
centralized database systems. The techniques in distributed
database system are similar with the centralized approach with
additional implementation requirements or modifications.
o The main difference or the change that should be incorporated is
the way the lock manager is implemented and how it functions.
o There are different schemes for concurrency control in DDBS

1. Non-Replicated Scheme
o No data is replicated in the system
o All sites will maintain a local lock manager (local lock and
unlock)
o If site Si needs a lock on data in site Sj it send message to
lock manager of site Sj and the locking will be handled by
site Sj
o All the locking and unlocking principles are handled by
the local lock manager in which the data object resides.
o Is simple to implement
o Need three message transfers
o To request a lock
o To notify grant of lock
o To request unlock

85
Advanced Database Systems Lecture Note

2. Single Coordinate Approach


o The system choose one single lock manager that resides in
one of the sites (Si)
o All locks and unlocks requests are made at site Si where
the lock manager resides(Si)
o Is simple to implement
o Needs two message transfers
o To request a lock
o To request unlock
o Simple deadlock handling
o Could be a bottleneck since all processes are handled at
one site
o Is vulnerable/at risk if the site with the lock manager fails

In general there are three varieties of the 2PL (Two phase locking)
protocol in the DDBMS environment. Implementing the basic 2PL in
distributed systems assumes that data is distributed across multiple
machines.

Centralized 2PL:
 A single site responsible for granting and releasing locks
 Each site‘s transaction manager communicates with this
centralized lock manager and with its own local data manager
 Has only one lock manager for the entire site.
Primary 2PL:
 Each replicated data item is assigned a primary copy; the lock
manager on that primary copy is responsible for granting and
releasing locks(distributed locking) and updates are propagated
as soon as possible to the slave copies
 Distributes the lock manager to a number of sites
Distributed 2PL:
 Assumes data is completely replicated
 The schedulers (Lock managers) at each site are responsible in
granting and releasing locks as well as forwarding operations to
the local data manager.
 Distributes the lock manager to every site in the DDBS.
 Has communication overhead than the others

86
Advanced Database Systems Lecture Note

Chapter Five

Introduction to Object-Oriented
Database Systems
Object Orientation
• Object Orientation
• Set of design and development principles
• Based on autonomous computer structures known as
objects
• Software be constructed from standard reusable
components
• OO Contribution areas
• Programming Languages
• Graphical User Interfaces
• Databases
• Design
• Operating Systems

Evolution of OO Concepts
• Concepts stem from object-oriented programming
languages (OOPLs)
• Ada, ALGOL, LISP, SIMULA
• OOPLs goals
• Easy-to-use development environment
• Powerful modeling tools for development
• Decrease in development time
• Make reusable code
• OO Attributes
• Data set not passive
• Data and procedures bound together
• Objects can act on itself

87
Advanced Database Systems Lecture Note

Advanced Database Applications


The application of database is being highly practiced in complex
applications demanding different way of modeling and designing.
Some of these applications are:
• Computer Aided Design (CAD)
• Computer Aided Manufacturing
• Computer Aided Software Engineering
• Network Management System
• Multimedia System
• Digital Publication
• Geographic Information System
• Interactive and Dynamic Web Sites

Drawbacks/Weaknesses of Relational DBMS


In addition to the emergence of many advanced database
application areas, there were some drawbacks on the relational
database management system.

1. Poor representation of ‗real world‘ entities


 relations does not correspond to real world objects
 Normalization creates relations which don‘t
correspond to real world object

2. Semantic overloading (semantically Overloaded)


 Representation of complex relationship is difficult
 Eg. M:N relationship is represented by adding one
additional relation (making the relation an entity)
 One cannot distinguish a relation from a
relationship.
 Difficult to distinguish different types of
relationship.

3. Poor support for integrity and enterprise constraints


 Many commercial BMS do not support all integrity
constraints

88
Advanced Database Systems Lecture Note

 The relational model do not support enterprise


constraints it has to be done on the DBMS

4. Homogeneous data structure


 Has vertical and horizontal homogeneity
 Horizontal Homogeneity: Each tuple of a
relation have same number and type of
attributes
 Vertical Homogeneity: Values of an attribute
should come from same domain.
 Field value should be atomic

5. Limited operations
 Relational model has fixed operations on the data
 Does not allow additional/new operations

6. Difficulty in handling recursive queries


 Direct and indirect recursive queries of a single
relation cannot be represented.
 Query to extract information from recursive
relationship between tuples of same entity is difficult
to represent.

7. Impedance mismatch
 Mixing of different programming paradigms
 Mismatch between the languages used

8. Poor navigational process


 Access is based on navigating individual records from
different relations

Object oriented database management systems


definitions
The definition for OODBMSs has three main components. The
definition for object oriented data model (OODM), object oriented

89
Advanced Database Systems Lecture Note

database (OODB) and object oriented data base management systems


(OODBMS)

An Object Oriented Data Model (OODM) is a logical data model


that captures the semantics of object supported in the object oriented
programming.
An Object Oriented Data Base (OODB) is a persistent and sharable
collection of objects defined by an object oriented data model
(OODM).
A OODBMS (Object Oriented Database Management System) is a
manager of a OODB.

OO Concepts
• Object is a uniquely identifiable entity that contains both the
attributes that describes the state of the ‗real world‘ object and
the action that are associated with it.
• OODBMS can manage complex, highly interrelated
information.
• Abstract representation of a real-world entity
• Unique identity
• Embedded properties
• Ability to interact with other objects and self
OID (Object Identity)
• Each object is unique in OO systems
• Unique to object
• Not a primary key (PK is unique only for a relation, PK is
selected from attribute making it dependent on the state)
• Is invariant (will not change)
• Independent of values of attributes ( two objects can have
same state but will have different OID)
• Is invisible to users
• Entity integrity is enforced
• Relationship: embedding the OID of one object into the other (
embed OID for a branch to employee object)

90
Advanced Database Systems Lecture Note

• Advantage of using OID:


• Are efficient
• Are fast
• Cannot be modified by users (system generated)
• Independent of content

Attributes
• Called instance variables
• Domain

Object state
• Object values at any given time
• Values of attributes at any given point in time.

Methods

91
Advanced Database Systems Lecture Note

• Code/function that performs operation on object‘s data


• Has name and body
• Methods define the behavior of object
• Can be used to change the state of the object by
modifying attribute values

Messages
• Means by which objects communicate
• Request from one object to the other to activate one of its
methods
• Invokes method/calls method to be applied
• Sent to object from real world or other object
• Notation: Object.Method
• Eg: StaffObject.updatesalary(slary)

92
Advanced Database Systems Lecture Note

Classes
• Blueprint for defining similar objects
• Objects with similar attributes and respond to same
message are grouped together
• Defined only once and used by many objects
• Collection of similar objects
• Shares attributes and structure
• Objects in a class are called instances of the class
• Eg: Class Definition : defining the class BRANCH

BRANCH
Attributes
brabchNo
street
city
postcode
Methods
Print()
getPostCode()
numberofStaff()

Objects of BRANCH class

brabchNo=B005 brabchNo=B007 brabchNo=B003


street=st1 street=st2 street=st2
city=Addis city=Dire city=Bahirdar
postcode=1425 postcode=452 postcode=85

93
Advanced Database Systems Lecture Note

Object Characteristics

Class Hierarchy
• Super class
• Subclass
Inheritance
• Ability of object to inherit the data structure and behavior
of classes above it
• Single inheritance – class has one immediate super class
• Multiple – class has more than one immediate super class

94
Advanced Database Systems Lecture Note

Method Overriding
• Method redefined at subclass

Polymorphism
• Allows different objects to respond to same message in
different ways. i.e. Use the same method differently.

Object Classification
• Simple
 Only single-valued attributes
 No attributes refer to other objects

• Composite
 At least one multi-valued attribute
 No attributes refer to other object

• Compound
 At least one attribute that references other object

• Hybrid
 Repeating group of attributes
 At least one refers to other object

Characteristics of OO Data Model


• Supports complex objects
• Must be extensible
• Supports encapsulation
• Exhibit inheritance
• Supports object identity

95
Advanced Database Systems Lecture Note

OO vs. E-R Model Components


OO Data Model E-R Model
Type Entity definition
Object Entity Instance
Class Entity set
Instance variable Attribute
N/A Primary key
OID N/A
Class hierarchy E-R diagram
OODBMS
• Object-oriented database technology is a marriage of object-
oriented programming and database technologies.
• Database management system integrates benefits of typical
database systems with OODM characteristics
• Handles a mix of data types (since OODBMS permits new data
definition)
• Follows OO rules
• Follows DBMS rules

OO and Database Design


• Provides data identification and the procedures for data
manipulation

96
Advanced Database Systems Lecture Note

• Data and procedures self-contained entity


• Iterative and incremental
• DBA does more programming
• Lack of standards

OODBMS Advantages
• More semantic information
• Support for complex objects
• Extensibility of data types (user defined data types)
• May improve performance with efficient caching
• Versioning
• Polymorphism: one operation shared by many objects and each
acting differently
• Reusability
• Inheritance speeds development and application: defining new
objects in terms of previously defined objects Incremental
Definition)
• Potential to integrate DBMSs into single environment
• Relationship between objects is represented explicitly
supporting both navigational and associative access to
information.

OODBMS Disadvantages
• Strong opposition from the established RDBMSs
• Lack of theoretical foundation
• No standard
• No single data model
• Throwback to old pointer systems
• Lack of standard ad hoc query language
• Lack of business data design and management tools
• Steep learning curve
• Low market presence
• Lack of compatibility between different OODBMSs

97
Advanced Database Systems Lecture Note

Storing Objects in a relational database


Since the main derive for an object oriented database management system
is object oriented programming languages, one approached to achieve
persistence on object oriented programming is to use RDBMS as the
underlying persistence infrastructure. However, in order to apply this
approach, we need to design two important things.
 Design the relations to reflect the class hierarchy to reflect the real
object relationship.
 Design how objects will be accessed , i.e.
o Writing code to decompose objects in to tuples and store the
decomposed objects in a relation
o Writing a code to read objects from relations and reconstruct
the objects.
In general, there are three basic techniques to map hierarchy of classes in
to relations.
i) Mapping each class or subclass into a relation
ii) Mapping each sub class into a relation
iii) Map the whole hierarchy in to a single relation

And therefore, the code to make objects persistent and to read objects back
from the database depends on the strategy chosen. In all the approaches,
there is some semantics (like ambiguity of which class is supper and which
is sub) that is lost and hence, we have to build that semantics in each
application which is subject to duplication of code and potential
inconsistencies.

98
Advanced Database Systems Lecture Note

Chapter Six
Data warehousing
 Data warehouse is an integrated, subject-oriented, time-
variant, non-volatile database that provides support for
decision making.
 Integrated  centralized, consolidated database that
integrates data derived from the entire organization.
 Consolidates data from multiple and diverse
sources with diverse formats.
 Helps managers to better understand the
company‘s operations.
 Subject-Oriented  Data warehouse contains data
organized by topics. Eg. Sales, Marketing, Finance,
Customer etc.
 Time variant: In contrast to the operational data that
focus on current transactions, the warehouse data
represent the flow of data through time.
 Data warehouse contains data that reflect what
happened last week, last month, past five years, and
so on.
 Snapshot of data in the organization at different
point in time.
 Non volatile  Once data enter the data warehouse,
they are never changed. Because the data in the
warehouse represent the company‘s entire history not
operational data.

 DW is organized around major subjects, such as


customer, product, sales.

99
Advanced Database Systems Lecture Note

 DW focuses on the modeling and analysis of data for


decision makers, not on daily operations or transaction
processing.
 DW provides a simple and concise view around
particular subject issues by excluding data that are not
useful in the decision support process.
 Constructed by integrating multiple, heterogeneous data
sources
 relational databases, flat files, on-line transaction records
 Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources

 Ultimately Information is created from data warehouses. Such


Information becomes the basis for rational decision making.
 The data found in data warehouse is analyzed to discover
previously unknown data characteristics, relationships,
dependencies, or trends.

Differences between database and data warehouse


 Because data is added all the time, warehouse is growing.
 The data warehouse and operational environments are
separated. Data warehouse receives its data from
operational databases.
 Data warehouse environment is characterized by read-
only transactions to very large data sets.
 Operational environment is characterized by numerous
update transactions to a few data entities at a time.
 Data warehouse contains historical data over a long time
horizon.

100
Advanced Database Systems Lecture Note

Data in data warehouse is different form data in current


transaction. Current transaction data is called online data
and is processed using OnLine Transaction Processing
(OLTP)

Differences between data warehouse and data in OLTP


The main difference between these two systems is that, data
ware house contains historical data whereas OLTP contains
current data used in day-to-day transaction processing.

OLTP System Data Warehousing System


Current data Historical data
Detailed data Summarized
Dynamic data Largely static
Repetitive and already known operation Ad hoc and unstructured operations
Predicted pattern of usage Unpredictable pattern of usage
Transaction driven Analysis driven
Application oriented Subject oriented
Support day-to-day decisions Support strategic decision
Large number of operational users Low number of managerial users

101
Advanced Database Systems Lecture Note

Data Warehouse vs. Operational DBMS


 OLTP (On-Line Transaction Processing)
 Major task of traditional relational DBMS
 Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.

 OLAP (On-Line Analytical Processing)


 Major task of data warehouse system
 Data analysis and decision making

 Distinct features (OLTP vs. OLAP):


 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical,
consolidated
 Database design: ER + application vs. star schema +
subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex
queries

 Different aspects of comparing OLTP with OLAP

OLTP System OLAP System


User Clerk, IT professional Knowledge workers
Function Day-to-day operation Decision support
DB Design Application oriented Subject oriented
Data Current, up-to-date, Historical, summarized,
detailed integrated
Usage repetitive Ad-hoc
Access Read/write Lots of scans/read only
Unit of work Short and simple Complex query
transaction
# of records accessed tens millions
# of users thousands hundreds
DB size 100 MB-GB 100 GB-TB
Performance metric Transaction throughput Query throughput

102
Advanced Database Systems Lecture Note

Benefits of Data Warehouse


If data is kept in data warehouse and used for high level
decision making and knowledge extraction, there are lots of
advantages that would be gained which include:
o Potential high returns on investment
o Competitive advantage on market
o Increased productivity

Data Mining
In simple definition, data mining is the extraction or ‗mining‘ of
knowledge from large amount of data. ―Mining of gold from
sand or soil is not sand or soil mining but GOLD mining‖.
Similarly, data mining is actually knowledge mining or
knowledge discovery that is useful for decision making.
 Most organizations, these days, are in data rich but information
poor situation.
 Data mining tools predict future trends and behaviors, allowing
businesses to make proactive, knowledge-driven decisions.
The availability of large amount of data coupled with the need to
have powerful data analysis tool motivate the development of
data mining tools.
 Data mining tools can answer business questions that
traditionally were too time-consuming to resolve.
 Evolution of Data Mining:
 Data collectiondata stored and retrievedmore
information need
Similar terminologies:
• Knowledge Mining
• Knowledge extraction
• Knowledge discovery
• Pattern analysis

103
Advanced Database Systems Lecture Note

• Data archaeology

 Data in the real world is dirty


• incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes
or names
 No quality data, no quality mining results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of
quality data
 The first task in data mining is to preprocess the data
warehouse so that quality data is used for knowledge
extraction.

Major Tasks in Data Preprocessing


• Data cleaning: Fill in missing values, smooth noisy
data, identify or remove outliers, and resolve
inconsistencies
• Data integration: Integration of multiple databases,
data cubes, or files
• Data transformation: Normalization and
aggregation
• Data reduction: Obtains reduced representation in
volume but produces the same or similar analytical
results
• Data discretization: Part of data reduction but with
particular importance, especially for numerical data

Types of knowledge to be mined


• Characterization: to study the behavior or
characteristics of data object for further analysis.

104
Advanced Database Systems Lecture Note

• Discrimination: study patterns so that objects are


grouped for purpose of implementing different
strategy, will discriminate groups.
• Association: the process of analyzing existing
pattern of occurrence of events to see
interdependency between objects and events.
• Classification/ Clustering: The process of dividing
a dataset into mutually exclusive groups such that
the members of each group are as "close" as possible
to one another, and different groups are as "far" as
possible from one another, where distance is
measured with respect to specific variable(s) you
are trying to predict.
• Prediction: A structure and process for predicting
the values of specified variables in a dataset.
• Outlier analysis: preprocessing to deal with data
items found in extreme ends affecting knowledge
discovery

Given databases of sufficient size and quality, data mining


technology can generate new business opportunities by
providing these capabilities:
 Automated prediction of trends and behaviors.
 Automated discovery of previously unknown patterns.

The most commonly used techniques in data mining are:

 Artificial neural networks: Non-linear predictive models


that learn through training and resemble biological neural
networks in structure.

 Decision trees: Tree-shaped structures that represent sets of


decisions. These decisions generate rules for the classification of
a dataset. Specific decision tree methods include Classification

105
Advanced Database Systems Lecture Note

and Regression Trees (CART) and Chi Square Automatic


Interaction Detection (CHAID) .

 Genetic algorithms: Optimization techniques that use


processes such as genetic combination, mutation, and natural
selection in a design based on the concepts of evolution.

 Nearest neighbor method: A technique that classifies each


record in a dataset based on a combination of the classes of the
k record(s). Sometimes called the k-nearest neighbor technique.

 Rule induction: The extraction of useful if…then… rules from


data based on statistical significance.

Or the techniques of data mining can be categorized into the


following modeling techniques.

 Predictive models. These types of models predict how likely


an event is. Usually, the higher a score, the more likely the
event is. For example, how likely a credit card transaction is
to be fraudulent, or how likely an airline passenger is to be a
terrorist, or how likely a company is to go bankrupt.

 Summary models. These models summarize data. For


example, a cluster model can be used to divide credit card
transactions or airline passengers into different groups
depending upon their characteristics.

 Network models. These types of models uncover certain


structures in data represented by nodes and links. For
example, a network model for a terrorist cell might use nodes
representing individuals and links representing meetings.

106
Advanced Database Systems Lecture Note

 Association models. Sometimes certain events occur


frequently together. For example, purchases of certain items,
such as beer and pretzels, or a sequence of events associated
with component failure. Association models are used to find
and characterize these co-occurrences.

107
Advanced Database Systems Lecture Note

108

You might also like