Ernieexample2 PDF
Ernieexample2 PDF
Ernieexample2 PDF
Gabor Bodnar
RISC-Linz, Johannes Kepler University, A-4040 Linz, Austria
email: Gabor.Bodnar@risc.uni-linz.ac.at
www: http://www.risc.uni-linz.ac.at/people/gbodnar
January 23, 2005
Contents
0 Introduction
0.1 Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0.2 Information Sources . . . . . . . . . . . . . . . . . . . . . . . . .
5
5
6
1 Data Modeling
1.1 Introduction . . . . . . . . . . . . . . . . .
1.2 The Entity-Relationship Model . . . . . .
1.2.1 Entities, Attributes, Relationships
1.2.2 Classification of Relationships . . .
1.2.3 Keys . . . . . . . . . . . . . . . . .
1.2.4 Entity-Relationship Diagrams . . .
1.2.5 Entity-Relationship Design . . . .
1.2.6 Exercises . . . . . . . . . . . . . .
1.3 The Relational Model . . . . . . . . . . .
1.3.1 Relational Structure . . . . . . . .
1.3.2 Relational Algebra . . . . . . . . .
1.3.3 Functional Dependencies . . . . . .
1.3.4 Normal forms . . . . . . . . . . . .
1.3.5 Indexing and Hashing . . . . . . .
1.3.6 Exercises . . . . . . . . . . . . . .
1.4 SQL . . . . . . . . . . . . . . . . . . . . .
1.4.1 Data Definition . . . . . . . . . . .
1.4.2 Simple Queries . . . . . . . . . . .
1.4.3 Database Modification . . . . . . .
1.4.4 Views and Joins . . . . . . . . . .
1.4.5 Embedded SQL . . . . . . . . . . .
1.4.6 Exercises . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
10
10
11
12
13
14
17
18
18
20
26
29
31
37
39
39
41
44
45
47
48
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
49
51
58
62
65
66
66
66
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
2.2.3
3 XML
3.1 XML basics . . . . . . . . . . . . . .
3.1.1 Syntax . . . . . . . . . . . . .
3.1.2 XHTML . . . . . . . . . . . .
3.2 Namespaces . . . . . . . . . . . . . .
3.2.1 Uniform Resource Identifiers
3.2.2 Namespaces in XML . . . . .
3.3 XML Schema . . . . . . . . . . . . .
3.3.1 Schema declaration . . . . . .
3.3.2 Types, Elements, Attributes .
3.3.3 Exercises . . . . . . . . . . .
3.4 XPath . . . . . . . . . . . . . . . . .
3.4.1 Data Model . . . . . . . . . .
3.4.2 Location Paths . . . . . . . .
3.4.3 Expressions, Core Functions .
3.5 XSL Transformations . . . . . . . . .
3.5.1 Templates . . . . . . . . . . .
3.5.2 Additional features . . . . . .
3.5.3 Exercises . . . . . . . . . . .
3.6 XML Query . . . . . . . . . . . . . .
3.6.1 Data Model and Types . . .
3.6.2 Expressions . . . . . . . . . .
3.6.3 Query Prolog . . . . . . . . .
3.6.4 Exercises . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
66
67
68
68
72
72
73
74
76
77
78
87
89
90
91
94
96
98
100
105
107
107
109
113
115
Chapter 0
Introduction
0.1
Foreword
CHAPTER 0. INTRODUCTION
0.2
Information Sources
In this section we summarize the information sources these lecture notes rely
on. The sources are by no means the only ones that can be used to study the
topics discussed in these notes, however they provide a good introduction to
them.
For the (relational) data modeling part of the notes a standard reference is
[3], which gives an extensive overview and a thorough introduction to databases.
It is also recommendable to look at the excellent lecture notes of [14].
For the part on on-line databases one can also use [3], and in particular for
search engines [7].
For the XML part a brief, to the point introduction can be found in [2].
However the original specifications are all available on line: [8, 9, 10, 13, 11].
It cannot be emphasized enough to search the Internet for further information. By the nature of the topic, an enormous amount of information is available
on-line, awaiting to be discovered.
Chapter 1
Data Modeling
Data modeling is an important part of database design. The data model focuses on what data should be stored in the database, exploring the relation
between data entities. The goal of the data model is to ensure that all data
object required by the database is accurately represented. It provides the overall logical structure (the blueprint) of the database. In this chapter we take
a look at relational data modeling techniques and SQL, a commonly used data
manipulation language.
1.1
Introduction
Databases are now ubiquitous in any areas where information has to be stored
and processed. Their importance comes from the value of information they
store. The most important criteria for them can be described by two keywords
reliability (covering many subareas from robustness to issues related to concurrency and security) and efficiency (covering not only data manipulation speed
but also its flexibility towards new requirements or the degree to which database
supports application software development).
A Database management system (DBMS) consists of
a collection of structured, interrelated and persistent data,
a set of application programs to access, update and manage the data.
Example 1.1. Let us take tasks of the administration of a university. It has
to maintain files of teachers, students, courses, the relationship between courses
and students for each semester, the grades arising from the previous relationships and countless other things. The administration has to be able to perform
the following tasks: add/remove a student/teacher/course, assign courses to
teachers/students each semester, add/modify grades of a student for the courses
she took, generate the list of grades of students, generate the list of courses given
by a teacher, etc.
This amounts to managing large pieces of data with additional constraints
(for instance courses can build on each other) and numerous reports to be generated. Moreover it should not be too difficult to upgrade the system to meet
new requirements that come from the real world (if the law changes, a new
faculty/institute/branch of studies starts, etc.).
7
Thinking a bit about this example shows the major problems we run into
when trying to solve the problem in an ad-hoc way. Here are the problems
commonly known to be coped with in a DBMS:
Data redundancy and inconsistency.
The same information should not appear at several places of the
system.
Related data should be updated consistently.
Data integrity.
The system should ensure that the data stored in it fulfills certain
prescribed constraints.
The system should adopt to the change of the constraints.
The system should be able to recover from crashes.
Data access.
The system should be able to generate answers to a large class of
queries.
The system should support efficient data retrieval (e.g. by indexing,
hashing).
Data isolation.
Data(-elements) of different types and/or magnitudes should be handled by the system (like text documents, numerical data, photos,
etc.).
Concurrency.
The system should be able to work with multiple users at the same
time.
The concurrent modifications should not disturb the consistency of
the data.
Avoid deadlocks.
Security.
Handle access rights for users.
Secure remote access to the database.
Data abstraction is the most important tool to treat the above problems
conceptually. There are three levels of abstraction (sometimes also referred to
as database schemes):
Physical level : It deals with the storage and retrieval of the data on the
disk (data and index file structures). This level is usually hidden from the
users by the DBMS.
Conceptual level : In this level the questions of what data are stored and
what are the relationships between these data are decided. This is the
level of the database administrator
1.1. INTRODUCTION
View level : It describes a subset of the data in the database for a particular
set of users. The same data can be viewed in many different ways (for
instance the teacher wants list of the students subscribed for his course
while the dean of studies is interested in the number of people attending
the course). This is the level of the users of the database.
Data models are collections of conceptual tools to describe data, data relations, data constraints and data semantics. There are object-based, recordbased and physical data models.
Object-based models:
Describe data on the conceptual and view levels.
Provide flexible structuring capabilities.
Data constraints can be specified explicitly.
Examples are: Entity-Relationship (ER) model, Object-oriented model, Functional model, etc. etc.
Record-based models:
Describe data on the conceptual and view levels.
Used for databases with fixed record structure; the sizes of the fields of
the records are usually also fixed.
There exist query languages for record based models (e.g. SQL).
Examples are: Relational model, Network model, Hierarchical model, etc.
A snapshot of the data contained in a database at a given point in time
is called an instance of the database. In contrast, the database scheme is the
overall design of the database (structure, relations, constraints in general).
An important concept is the one of data independence, which expresses that
changes in the the database scheme at a given level of abstraction should not
affect the schemes at higher levels. Thus we have:
Physical data independence: Changes in the physical level should not make
changes in the data model necessary and therefore application programs
need not be rewritten.
Logical data independence: Changes in the conceptual level should not
make necessary the application programs to be rewritten.
Standard terminology in DBMSes:
Data Definition Language (DDL): Used to describe the database scheme
(structure, relations, constraints). The compiled DDL statements are
called the data directory (the metadata describing the database).
Data Manipulation Language (DML): Used to pose queries (select) or to
modify (insert, update, delete) the database. In nonprocedural DMLs the
user only specifies what data is needed, while in procedural DMLs also the
way it should be retrieved.
10
1.2
The ER model views the real world as the set of entities and relationships
between them. This modeling technique maps well to the relational model (the
constructs of the ER model can easily be turned into corresponding tables)
discussed in Section 1.3.
1.2.1
11
values from the nonempty strings of latin characters which will be the domain of
these attributes (viewing the attributes as functions, from a strict mathematical
point of view what we call here domain is actually the codomain of the function,
but let us not change the well established terminology of the ER model). The
student-ID can be a seven-digit positive numbers and the birth-date can be a
string of the format dd-mm-yyyy with the usual semantics.
An entity (a particular student of the university) can be described then as
{(first-name, Susan),(last-name, Becker),(student-ID, 0256098), (birth-date, 2706-1978)}. Of course, in any kind of information processing the entity appears
only via its description by the set of attributes which are relevant from the point
of view of the task to be carried out.
A relationship represents association between two or more entities.
A relationship set is the set of relationships of the same type. Recall that
a relation over the sets Ei , . . . , Er is a subset of E1 Er (with the ER
terminology this is a relationship set).
Example 1.3. Continuing Example 1.2, let another entity set be the set of
courses offered in the current semester at the university. Let the attributes describing a course be the course-ID, teacher-ID, course-title. Let a course be described by {(course-ID, 322527),(teacher-ID, 024532),(course-title, Analysis I)}.
If the student of the other exercise takes this course, there is a relationship
between the two entities (the student and the course).
If we take the entity sets of all students and all the courses offered, the arising
relationship set will describe which student takes what courses this semester.
It can be easily recognized that a relationship can also be taken as an abstract
entity, and hence a relationship set can be described just as an entity set.
Example 1.4. Continuing Example 1.3, the relationship set that associates
students to the courses they take in a semester can be described via the entity
set whose elements are described by the attributes student-ID and course-ID.
The relationship between Susan Becker and the Analysis I course would then
be described by {(student-ID, 0256098),(course-ID, 322527)}.
1.2.2
Classification of Relationships
12
1.2.3
Keys
Since entities must be distinguishable, we must be able to choose a set of attributes for each entity set, so that there is not two entity in the set described
by the same set of attribute values. Usually the set of attributes is more than
enough to distinguish entities, and what we try to do is to find small subsets of
attributes that still have this property.
A superkey is a nonempty set of attributes which allows us to uniquely
identify an entity of the entity set.
A superkey which does not have a proper subset which is also a superkey is
called a candidate key.
A primary key is a candidate key which is chosen by the database administrator to identify entities in an entity set.
13
Remark 1.9. There might be more than one candidate keys (since set inclusion
is only a partial order).
Which candidate key becomes a primary key is apparently a matter of choice.
An entity set that does not possess sufficient attributes to form a primary
key is called a weak entity set. One that does have a primary key is called a
strong entity set. A weak entity set must be part of a One-to-Many relationship
in a subordinate role while the entity set with the dominant role must be a
strong one so that it allows distinction between the entities of the weak set via
the relation.
A discriminator is a set of attributes in the weak entity set that allows the
distinction of the entities via a given relation with a strong entity set on which
the weak one existence depends. A primary key for a weak entity set is the
primary key of the strong entity set with which it is related together with a
discriminator with respect to this relation.
Example 1.10. Take the entity set of all the telephones in Austria, where an
entity is described with the attribute phone-number (without any country or
area codes). This is a weak entity set because many phones in different areas can
have the same number (although the entities, that is the phones, are different).
Make the system work we have to introduce the entity set of phone areas, where
an entity is described by the attribute area-code.
The One-to-Many relation associates to an area each phone in it. Since the
areas cover Austria completely, no phones exist without an area, thus there is
an existence dependency in which the areas are the dominant and the phones
are the subordinate entities.
The area-code is a superkey, a candidate key, and (after we declare it so) a
primary key of the entity set of phone areas. The phone-number is a discriminator of the entity set of telephones with respect to the primary key we have
chosen for the phone areas, via the relation we established between areas and
phones. And the area-code with the phone number forms a primary key for the
entity set of telephones in Austria.
The primary key of a relationship set is formed from the primary keys of the
entity sets in the relationship set.
A foreign key is a set of attributes for an entity set in a One-to-Many binary
relation on the Many side, which identifies the corresponding parent entity
for each entity of the set. Foreign keys provide means to maintain data integrity in databases (called referential integrity). For instance in the example
of Figure 1.1, if we add member-ID to the attributes of the Book entity set
(with a special null value for books not borrowed) it becomes a foreign key that
identifies the borrower of any book in the library.
1.2.4
Entity-Relationship Diagrams
14
Book
Borrows
memberID
firstname
lastname
bookID
author
title
area-name
Linz
Salzburg
Innsbruck
Bregenz
Wien
phone-number
4554323
243486
875678
875678
34452
.
.
.
owner-first-name
Monika
Thomas
Hans
Hans
John
owner-last-name
Beispiel
Muster
Schmidt
Schmidt
Example
area-code
01
0732
0662
0732
05574
1.2.5
Entity-Relationship Design
15
Example 1.11. Take the entity sets of students and courses and the Many-toMany relationship set between them that associates students and the courses
they attend. We introduce then a new entity set with name Attendance with
two attributes: student-ID, course-ID, whose entities express the fact that a
certain student attends a given course. Then establish a relationship between a
student and an entity in Attendance if the student-ID is the same, and we act
analogously for courses and entities of Attendance. The ER diagram is shown
in Figure 1.2.
Please note that the new ER diagram requires that for each Attendance
entity a student and a course entity exists.
Student
Course
Attends
studentID
firstname
lastname
courseID
coursetitle
Takes
Attended by
Student
Attendance
Course
studentID
firstname
lastname
studentID
courseID
courseID
coursetitle
16
Attends
Student
Course
courseID
coursetitle
studentID
lastname
firstname
Workson
Assignment
assignmentID
difficulty
Student
Takes
studentID
lastname
firstname
Attendedby
Attendance
Course
studentID
courseID
courseID
coursetitle
Does
Assigns
CourseWork
studentID
courseID
assignmentID
Usedin
Assignment
assignmentID
difficulty
Student
studentID
lastname
firstname
Takes
Attendedby
Attendance
Course
studentID
courseID
courseID
coursetitle
Assignedto
CourseWork
studentID
courseID
assignmentID
Usedin
Assignment
assignmentID
difficulty
17
personal cars (adding for instance the attributes: PS and color), trucks (adding
maximal-weight and type-of-load), buses (adding nr-of-seats and comfort-level),
etc.
In practice, there is the important phase of requirement analysis before the
phase of data modeling. Very briefly, requirement analysis is used to achieve
the following goals:
Determine what data (viewed as elementary objects) is necessary to include in the database.
Collect the descriptive information about these objects.
Identify and classify relationships between these objects.
Identify what transactions will be executed on the database and how will
they affect the data objects.
Identify data integrity rules.
The last two points concern already the functional view of the database.
Then in the data modeling phase, using the ER model, the tasks are the
following:
Identification of entity and relationship sets.
Drafting the initial ER diagram.
Refining the ER diagram by resolving complex relationships.
Add key attributes to the diagram.
Adding non-key attributes.
Constructing generalization hierarchies.
Adding integrity rules to the model.
The ER modeling method provides a wide variety of possibilities and there
is no standardize way to achieve optimal designs. One has to balance between
possibilities like to describe a class of objects by an entity set or via a relationship
set, whether to allow weak entity sets or prefer strong ones, whether to apply
generalization or aggregation or omit them to keep the ER model simpler, and
so on.
1.2.6
Exercises
Since 1 October 2003 car dealers can offer models of competitors in the same
exhibition area in the EU. A dealer is contracted with certain car manufacturers
who supply him with certain models which are eventually bought by certain
customers.
Taking this simplified model of car selling at a car dealer do the following.
Exercise 1.2.1. Identify the entities and the entity sets they belong in this
model.
18
Exercise 1.2.2. Which attributes would you choose to describe entities in the
entity sets you identified previously?
Of course the sets of attributes can be made quite big if one wants to model
reality more and more accurately. Here it is enough to introduce a few attributes
without the desire of completeness.
Exercise 1.2.3. Enumerate the relationships between the entities in this model.
Exercise 1.2.4. Sketch a first ER diagram of this model.
Exercise 1.2.5. Resolve complex relationships (relationships with degree > 2
and N:M relationships) and draw the refined ER diagram of the model.
Exercise 1.2.6. Decide which are the most important entities in the model
and choose primary keys for them.
You might want to introduce new attributes at this step: remember that,
for instance, student IDs are introduced artificially to describe students (as
important entities) in the corresponding model.
Exercise 1.2.7. Prescribe integrity constraints for the models. (Which entity
existence depends on which other? Are there foreign keys?)
1.3
The relational model was formally introduced by E. F. Codd in 1970 and became the dominant modeling technique in the IT sector. The relational model
represents data in the form of two-dimensional tables, and a relational database
is a collection of such tables each having a unique name.
1.3.1
Relational Structure
Let Di be domains (sets) for attributes (denoted by distinct names, say Ai ) for
i = 1, . . . , n. Then an n-ary relation over the Di is a subset of their cartesian
product:
R D1 D n .
If R is a finite set, it can be described as a table with m = #R rows and n
columns:
d11 . . . d1n
d21 . . . d2n
..
.
dm1
...
dmn .
A row (or tuple) represents that the attribute values appearing in it are in
relationship with each other by R. The attribute values are arranged in each
row that the jth column contains the value fromDj .
Remark 1.13. We typeset attribute and table names in typewriter style, with
the first letter of table names capitalized.
19
20
It is also commonly required that attributes of primary keys are never null
(entity integrity). And for foreign keys it is required that (in a row) either all
attribute values in the key are null or they match a row in the table for which
the foreign key is the primary key (referential integrity).
1.3.2
Relational Algebra
fname
Werner
Andrea
Daniela
Attendance
sid
cid
0251563 327456
0251563 327564
0245654 327456
lname
Schmidt
Ritter
Schmidt
Course
cid
327456
327564
title
Analysis I
Algebra I
Werner
Daniela
Schmidt
.
Schmidt
The atomic predicates allowed in the parameter of are =, 6=, <, , <, ,
moreover with the logical connectives , compound formulas can be
build.
project The operation projects the relation onto the set of specified arguments
and additionally eliminates multiple rows. It is denoted by . For example
fname,lname (Student)
results the relation
Werner
Andrea
Daniela
Schmidt
Ritter
Schmidt
21
cartesian product The operation combines each tuple from its first argument
with each tuple from the second. The result is a relation whose attributes
are the disjoint union of the attributes of the arguments. The number of
rows of the cartesian product is the product of the numbers of rows of the
arguments. It is denoted by . For example
Student Course
results the relation
0251563
0251563
0245654
0245654
0245675
0245675
Werner
Werner
Andrea
Andrea
Daniela
Daniela
Schmidt
Schmidt
Ritter
Ritter
Schmidt
Schmidt
327456
327564
327456
327564
327456
327564
Analysis I
Algebra I
Analysis I
.
Algebra I
Analysis I
Algebra I
Whenever ambiguity may arise we can prepend the relation name before
attribute names to classify the intended attributes explicitly. For instance,
if we had a Teacher relation, with attributes fname and lname, and we
took StudentTeacher, for the last names in the resulting table we would
need to specify if we want to take Student.lname or Teacher.lname.
rename The operation just changes the name of its argument to the one given
to it in its parameter. This is useful, for instance, when we need to take
the cartesian product of a relation with itself. It is denoted by . For
example if we want to select the students with identical last names, we
could use the following
Student.fname,Student.lname (P (Student Student2 (Student)))
where P is the predicate
Student.lname = Student2.lname Student.sid 6= Student2.sid.
The result is the relation
Werner
Daniela
Schmidt
.
Schmidt
union The operation takes the set theoretic union of two compatible relations,
where compatible means that the sets of attributes and the corresponding
domains are the same. It is denoted by . Let us define a new relation
for teachers
Teacher
fname
Wolfgang
Sabrina
lname
Schiffer
Kaiser
22
Schiffer
Schmidt
Kaiser .
Schmidt
Ritter
difference The operation takes the set theoretic difference of two compatible
relations. It is denoted by .
intersection The operation takes the set theoretic intersection of two compatible relations. It is denoted by and it can be defined as A B =
A (A B).
natural join The operation combines a cartesian product a selection and a
projection operation into a single operation. This is justified by its frequent applications and that it can be more efficiently implemented. It
is denoted by 1. Let A and B be relations with attribute sets R and S
respectively and let {r1 , . . . , rn } = R S, then
A 1 B = RS (A.r1 =B.r1 A.rn =B.rn (A B)).
If R S = then A 1 B = A B; we also have A 1 B = B 1 A, and the
operation binds from left to right.
For example Student 1 Attendance results
sid
0251563
0251563
0245654
fname
Werner
Werner
Andrea
lname
Schmidt
Schmidt
Ritter
cid
327456
327564
327456
fname
Werner
Werner
Andrea
lname
Schmidt
Schmidt
Ritter
cid
327456
327564
327456
title
Analysis I
Algebra I
Analysis I
23
Werner
Schmidt .
The relation of the dividend can be found above with all the five attributes.
The common attributes of this relation and Course is cid and title.
The projection results only one tuple which comes from many tuples in
the dividend whose cid and title attributes matched every tuple in the
divisor. In still simpler words there is only one student who takes all the
courses which can be found in the database.
assignment It is denoted by , and works as the assignment operator in
any imperative programming languages. If a name of a relation of the
database stands on the left hand side of the operator, the statement results
the modification of the database. On the other hand, if a new name
gets assigned to a relation, it does not mean to introduce a new relation
into the database. The new name will act only as a temporary variable
which can be used to simplify expressions in subsequent statements. For
example the division operation could be rewritten as tmp1 RS (A),
tmp2 RS ((RS (A) B) A) and A B = tmp1 tmp2 .
There are also the following extended operations defined:
generalized projection This operation is analogous to projection with the
additional feature that arithmetic expressions formed by the attributes
and constants from the corresponding domains are allowed as parameters
for the operation.
For example, let us assume that we have generated a relation Analysis-result
(via the operations above) for students that contain the attributes sid,
pts1, pts2, pts3, where the ptses are the points the students reached
on the first, second and third exam of the Analysis I course. Then the
generalized projection
sid,(pts1+pts2+pts3)/3 (Analysis-result)
will describe the average score of the students. With a more complicated
expression one might even define the grade immediately.
24
inner join The joins we discussed so far are inner joins. The inner qualification is often skipped because a join is an inner-join by default. An inner
join, opposed to an outer-join, includes only those rows of the operands in
the result that satisfied the join condition. For instance, in the Student
1 Attendance operation the student Daniela Schmidt does not appear in the result, because there is no matching row for her tuple in the
Attendance table.
outer join This operation comes in three variants (listed below). The purpose of the operation is to include in the result not only those rows of
the left/right/both operands that satisfy the join condition, but also ones
that do not have matching pairs. In this sense an outer join is an extension of the join operation which guarantees that all rows of the chosen
operands will appear in the result (possibly with NULL values padded for
the attributes coming from the other operand).
left outer-join the result will contain also those tuples from the first
(left) argument of the operation that did not match any tuple from
the second (right) argument, and pads the missing attributes with
null values.
right outer-join analogously to the previous case, the tuples of the
second (right) argument are carried over and padded with null values
as necessary.
full outer-join this operation, analogously to the previous cases, carries over every tuples from its arguments and pads them with null
values (on the left or on the right) as necessary.
For instance in the example for join the tuple (0245675, Daniela, Schmidt)
could not be matched with any tuples from the other relation; therefore it
does not appear in the result. A left outer-join (Student 1 Attendance)
1 Course would result
sid
0251563
0251563
0245654
0245675
fname
Werner
Werner
Andrea
Daniela
lname
Schmidt
Schmidt
Ritter
Schmidt
cid
327456
327564
327456
NULL
title
Analysis I
Algebra I
Analysis I
NULL
self join A join operation is called self join if both operands are the same (they
can be made formally different by a rename operation). For instance, let
us consider the relation People with relation scheme id, fname, lname,
boss-id, where the id is an internal identifier of the people of the company and boss id is the identifier of the boss of the employee (for the
top ranking people the boss id is NULL). Then the following self join
produces the table in which a row contains an employee and his boss:
People 1People.boss id=Peopel2.id People2 (People).
To obtain accumulative information from the database we can use the notion
of aggregation operations. In general such an operation has the following form
G1 ,...,Gn GF1 A1 ,...,Fm Am (E),
25
cid
327456
327564
327456
credit
3
2
3
where credit gives the number of credits the student with sid obtained for the
course with cid. Then the aggregation operation
sid Creditsumsum credit (Credits)
5
.
3
26
1.3.3
Functional Dependencies
Integrity constraints are used to express conditions the database have to fulfill
in order the data it holds to be consistent. However complex constraints can
27
drastically set back performance, because they have to be checked for each
modification of the database.
We discuss briefly: domain constraints, referential integrity, assertions, triggers and functional dependencies.
Domain constraints restrict the possible values of attributes. They are usually easy to check and thus provided by many DBMSes. A simple example is to
specify the attribute credit in the Credits relation to be of numeric and take
values from the range from 0 to 5.
Let A and B be relations with describing attribute sets R and S respectively.
Then P R is a foreign key for a candidate key L S of B if for every tuple
t in A there is a tuple u in B such that P t = L u. Such a condition is a
referential integrity constraint. It can also be expressed as P A L B.
Example 1.19. For example, take the relations Student and Attendance from
Section 1.3.2. In Attendance the attribute sid is a foreign key for the primary
key sid of Student. Back in Example 1.11 we expressed the referential integrity
in the ER diagram by a line crossing the arrow from the Student entity set to
the Attendance one.
Modifications in the database can destroy referential integrity; therefore,
using the previous notation we must have
when a tuple t is inserted into A, there must already be a tuple u in B
with P t = L u,
when a tuple u is deleted from B and there is still a tuple t in A with P t =
L u, then either abort the deletion or delete also t from A (depending on
the application),
when tuples in A are updated analogous actions has to be taken as in
the case of insertion and when updating B as in the case of deletion (to
maintain referential integrity in both cases).
An assertion is a predicate the database should always satisfy. The previous
integrity constraints are special cases of assertion. Using the university database
example, an assertion could be to require that no teacher has more than five
courses in a semester.
A trigger is a statement (in the query language) the DBMS executes automatically whenever a set of conditions becomes true. For instance, deletion of a
tuple from the Student relation could trigger deletion of referring tuples from
the Attendance relation.
Functional dependencies
Functional dependencies represent integrity constraints on sets of attributes in
the database. They are in strong correspondence with causal-, temporal-, etc.
dependencies in the real world; therefore finding them out is an important part
in the modeling process.
Let R be a set of attributes and let P and S be subsets of R. Then S
functionally depends on P , denoted as P S, if for any legal relation A with R
being its set of describing attributes, for any tuples t and u, P t = P u implies
S t = S u.
28
In the definition above legal means any relation that describes the real
world situation we are modeling. Viewing it the other way around, prescribing
functional dependencies for the relations of the database means constraining
the set of legal relations, that is, functional dependencies are a tool to describe
what relations are legal.
We say that S irreducibly (or fully) functionally depends on P if P S
and we do not have Q S for any proper subset of Q P . We call P S
irreducible if Q 6 S for every Q P .
Remark 1.20. We can see that this concepts generalizes the concept of superkeys: the fact that K is a superkey can be expressed as K R.
Example 1.21. Consider the relation (Student 1 Attendance) 1 Course,
shown on page 22. In this relation {sid} {fname, lname }, {cid} {title},
and the key candidate is {sid, cid} from which the set of all attributes functionally depends.
A functional dependency is called trivial if it is satisfied by all relations.
Armstrongs axioms for functional dependencies say that for P, Q, S, T R
we have
Q P implies P Q (reflexivity),
P Q implies P S Q S (augmentation),
P Q and Q S implies P S (transitivity).
These axioms are sound (consistent) and complete.
Some additional rules derived from the axioms are:
P P (self-determination),
P Q S implies P Q and P S (decomposition),
P Q and P S implies P Q S (union),
P Q and S T implies P S Q T (composition),
P Q and Q S T implies P S T (pseudotransitivity).
A functional dependence P S is called transitive (via Q) if there exist
functional dependencies P Q and Q S with S 6 Q and Q 6 P .
If F is a set of functional dependencies the set of all dependencies it implies
is called the closure of F , denoted with F + .
The set of attributes of R determined by P R under F is called the closure
of P , denoted as P +
Let F and G be two sets of functional dependencies, if F + G+ then we
say that G covers F . If F and G mutually cover each other, they are called
equivalent.
A set of dependencies is called irreducible if
the right hand side of every functional dependency in F is a singleton,
no attribute can be discarded from the left hand side of any functional
dependency in F without changing F + ,
29
1.3.4
Normal forms
30
the latter. This can be easily checked via functional dependencies. Moreover
from the results of Example 1.22, we get that it is also a dependency preserving
decomposition.
A relation scheme is in first normal form (1NF), if all the attribute values in
it are atomic (by definition there cannot be two identical tuples in the relation).
For defining 2NF and 3NF we assume that the relation scheme has only
one candidate key which is the primary key. An attribute not member of any
candidate key is referred to as non-key attribute.
A relation scheme is in second normal form (2NF) (with respect to a set of
functional dependencies F ), if it is 1NF and every non-key attribute irreducibly
functionally depends on the primary key.
A relation scheme is in third normal form (3NF) (with respect to a set
of functional dependencies F ), if it is 2NF and every non-key attribute nontransitively depends on the primary key.
A relation scheme is in Boyce-Codd normal form (BCNF) (with respect to
a set of functional dependencies F ), if the left hand side of every nontrivial
irreducible functional dependency is a candidate key.
Example 1.25. Consider first the usual example from page 22 with relation
scheme {sid, fname, lname, cid, title}. The irreducible set of functional
dependencies under consideration is the one at the end of Example 1.22. The
only candidate key is the primary key {sid, cid}.
Before further discussion, please also note that now we are concerned with
relation schemes and the properties have to be fulfilled by any legal relation on
them. Therefore the table on page 22 is just a particular example.
The relation scheme is in 1NF, but not in 2NF because, for instance {fname}
functionally depends on {sid} which is a proper subset of the primary key.
Let us consider then the relation scheme {sid, pts1, pts2, grade} of a
teacher giving only one course, which describes that on the course a student
achieved so and so many points on the tests and finally obtained a grade. The
primary key is {sid}. This scheme is in 2NF, but not in 3NF because of {pts1,
pts2} {grade}.
The main idea of 3NF and BCNF, informally speaking, is that when one
considers the graph of nontrivial functional dependencies the arrows should
point from candidate keys to non-key attribute sets. The difference is that
3NF does not fulfill this requirement when there are more than one composite
candidate keys and they overlap. If we drop the requirement that the relation
scheme has only one candidate key we can see this difference via a somewhat
artificial example.
Let us assume that course names also include some kind of unique identifier,
so that in Course the attribute title is also a candidate key. Then in the
relation scheme {sid, cid, title, grade} describing that a student on a
course given with its ID and name got some grade. This relation might be
created to list grades publicly, so that no names of students should appear,
however the name of the course might be more informative than just its ID.
The candidate keys are {sid, cid}, {sid, title} (let the first be the primary
key) and we have the following functional dependencies:
31
{sid,cid}
{title}
{sid,cid}
{grade}
{sid,title} {cid}
{sid,title} {grade}
{cid}
{title}
{title}
{cid}.
Allowing more than one candidate keys, the relation scheme with this set of
functional dependencies is 3NF (the only non-key attribute is grade which irreducibly and non-transitively depends on the primary key). But it is not BCNF
because the left hand sides of the nontrivial irreducible functional dependencies
{cid} {title} and {title} {cid} are not candidate keys.
We could decompose this relation schemeas {sid, cid, grade}, {cid,
title}to get the resulting relation schemes in BCNF. (End of Example 1.25)
As we saw, BCNF is more rigorous then 3NF so it might not always be the
optimal solution to try to achieve it. Therefore the goals of good relational
database design is to achieve (by decomposition) that the relation schemes of
the database scheme are in BCNF, are lossless join decompositions and are
dependency preserving. If this is cannot be achieved, 3NF is also acceptable
instead of BCNF.
Remark 1.26. There are also higher normal forms denoted by 4NF and 5NF.
We do not cover them in this discussion.
1.3.5
In this section we give a brief overview on some of the issues that are important
from the practical point of view. They are concerned with efficient information
retrieval from the database by using files with special structure to accelerate
search in the tables. In this section, we will talk about files and records, instead
of tables and tuples, to indicate that the data structures are considered on the
physical level.
Indices
The time it requires to process a query can become very costly as the files of
the database grow. To speed up the retrieval process DBMSes provide data
structures, called index or hash files, for several kind of fast search algorithms.
Indices accelerate queries involving attributes whose domains are ordered, while
hashes map any kind of attribute to an address space (e.g. numbers from 0
to 255) allowing us to split up the search for the records with the requested
attribute values.
The techniques can be evaluated from several viewpoints: types of access
supported efficiently (search for specific values, value ranges, etc.), access time,
insert/delete/update times, space overhead.
The set of attributes on which an index (or hash) file is created is called the
search key for the index (or hash). Relations of the database can have multiple
indices and hashes assigned to them.
Given a file of the database whose records are ordered sequentially according
to a set of attributes. The index whose search key is this set of attributes is
called the primary index. In this case the indexed file is also said to be in index
32
sequential order with respect to this index. Such a search key is usually (but
not necessarily) the primary key of the relation stored in the file.
An index is called dense if an index record appears (in the index file) for
every search key value in the indexed file. Otherwise, when only some of the
search key values appear in the index file, the index is called sparse.
For sparse indices we first locate the search key value in the index file which
is not greater than the search key value we are looking for. This provides a
pointer to the indexed file and have to continue the search in the indexed file
from the pointed record on. This is slower as if we had a dense index, however
it can also require a lot less space for the index file.
Another solution to cope with the index size problem is to use multi-level
indices. In this approach the outer index file is a sparse index for the search key
that indexes not the data file but another dense or sparse index on the same
search key. There can be a hierarchy of indices in which the lowest will index
the data file.
It is also advantageous to fit indices with physical storage units (i.e. disks,
cylinders, blocks), such that one can store the outermost index file in memory
and the index hierarchy will guide the DBMS to the right disk, cylinder and the
appropriate block on it which should be read in.
Insertion and deletion requires updating the affected index files. If we delete
a record from a data file, we have to check for each index file (on this data file)
whether the deleted record was the last with the given search key value. If yes
the index record for that value has to be deleted from the index file. For sparse
indices deleting a search key value from the index file can be substituted with
updating it to the next larger value in the data file, if this larger value is not
already in the index file. If such an update is not possible, the search key value
must be deleted from the index file.
When a new record is inserted in the data file, for dense indices we have to
insert the new search key value in those index files where the it is not already
included. For sparse indices the index file has to be updated only if a new block
is created with the insertion of the new record; i.e. this first record of the new
block will be indexed. (Remark: it depends on the indexing policy what is
considered a new block: e.g. a block can consist of all the records in the table
for which the lname attribute starts with the same letter).
Indices that define not the sequential order of the file are called secondary
indices. Such indices have to be dense, because records of the indexed file are
not sorted by the search key. If the search key is not a candidate key in the
indexed file, the index file can be organized as a two level index. We search for
the given search key value on the outer level, which points to the appropriate
part of the inner level, that stores pointers to the records of the data file with
the given search key value.
B-trees
There are several data structures used in index files each suitable for different
kind of search methods. The most popular ones are B- and B + -trees. The most
important feature of B-trees is that they are always balanced, thus the depth
of the tree grows only logarithmically in the growth of the indexed records.
To construct a B-tree we fix an even integer n and require that
the root stores minimum 1 and maximum n key values,
33
cid
327456
327564
327456
327456
327564
Pointer
1
3
2
4
5
title
Analysis I
Algebra I
Analysis I
Analysis I
Algebra I
34
Between search key values in a node pointers to subtrees are placed. The subtrees have the property that each value stored in them lie between the two values
neighboring the pointer that links the subtree to the node. (Of course, to each
search key value belongs a bucket of pointers to the indexed file, addressing the
records that have the given search key value. We do not explicitly display these
pointers in the examples.) An example can be found in Figure 1.5 with n = 4.
Insertion into a B-tree occurs at leaf nodes. If there is enough space on the
nodeas in the case of inserting 3 in the example of Figure 1.5we just add
the value to its appropriate place. If the new element causes a node overflow
(the number of key values become greater than n on the node)as in the case
of inserting 17 in the example of Figure 1.5we split the node into two, taking
out the middle element and inserting it into the parent. We also update the
neighboring pointers so that they refer to the subtrees we just created by the
splitting. If the parent also overflows, we continue the spitting and propagating
till we get a node with at most n elements. In the worst case we have to create
a new root.
The initial B-tree
5
10
15
18
Inserting 3
5
10
15
18
Inserting 17
PSfrag replacements
10
15
17
18
15
10
17
18
35
the sibling has less than n elements altogether. Then we unify the two nodes
and the element in between from the parent node to form a single new node,
and discard the taken element of the parent (see the deletion of 8 in the example
of Figure 1.6). If this action causes node underflow in the parent, we have to
try to pull over element from siblings or if it does not work, unify the node with
a sibling. Of course this can cause node underflow in the parent of the parent;
therefore the process can propagate up to the root of the tree, and if it happens
to become empty after a unification step, we just drop it and the new root will
be the unified node.
5
15
10
17
18
15
10
17
18
Deleting 5 and pulling over elements from the left sibling via the parent
4
15
10
17
18
15
10
17
18
15
10
17
18
36
possible between the addresses. If a hash function satisfies the latter requirement
we can choose the size of the address space and of the buckets according to the
expected number of records. However, bucket overflows are usually unavoidable
in practice.
Dynamic hashing copes with the problems of bucket overflow by allowing
dynamic splitting and merging of the buckets in which the pointers are kept
according to the fluctuation of the size of the hashed file.
Extendible hashing is a specific type of dynamic hashing which results fairly
low performance overhead. The idea behind is similar to the one of B-trees.
Let the function h map to b-bit unsigned binary integers (e.g. b = 32). We use
only the i initial bits of the hash values to determine the bucket address, where
i can initially be 1 and it increases/decreases with the size of the database.
The i-bit hash values are listed in the bucket address table with pointers to
buckets, i.e. for each hash value the pointer shows us the bucket in which we
can find the pointers to the records whose search key values hash to the given
hash value. The bucket with index j is addressed by the first ij bits of the first
i bits of the hash values (clearly ij i). This means that 2iij rows of the
bucket address table points to the jth bucket (the bucket address table has 2 i
rows altogether). The ij numbers are stored in the buckets.
Locating records with given search keys should be clear.
Inserting a pointer to a record with a new key value K into the hashing
scheme requires first to compute h(K) and determine from the first i bits the
bucket the pointer goes to; let it be of index j. If the bucket is not yet full, we
insert the pointer into it.
If it is full, and ij < i, there are at least two entries in the bucket address
table pointing to j. We create a new bucket, say k, and set ik = ij + 1 and
increase ij by one. We distribute the initial segments of hash values that were
formerly mapped to j equally between j and k (e.g. it maps to j if the last bit
is 0 and maps to k otherwise). Then we rehash all records that belonged to
bucket j and redistribute them in the same way between j and k.
There is a small chance that all of them will go into one of the buckets (i.e.
the last bit of the first ik -bit segment of the hash values is the same for all the
records). In this case we have to increase number of bits, split and rehash again.
If the bucket is full and ij = i we have to increase i and ij by one. This
means that the size of the bucket address table doubles. We set the pointers of
each row of the new table to the bucket to which the row identified by the first
i 1 bits pointed in the old table (i is considered with its new increased value
here). Then we have that two entries point to each bucket, in particular to the
jth one, and we can act as we did in the previous case (see Figure 1.7).
Deletion is done similarly by applying unifying strategies to merge buckets
with only a few pointers in them. The strategy can be taken from B-trees.
For a final word on indexing and hashing we can remark that the decision
on which one to use for which search keys has to regard:
If the periodic re-organization of index and hash file structures pays off.
What is the relative frequency of insertion and deletion?
Do we want better average access times or worst case access times?
The type of queries we expect on the database.
PSfrag replacements
37
i1 = 1
1
10
i2 = 1
01
i2 = 1
i=2
i1 = 1
00
00
01
10
11
11
i1 = 2
1
i3 = 2
3
i2 = 1
2
Splitting up bucket 1,
redistributing and inserting 7
h(K1 ) = 00, h(K2 ) = 10, h(K3 ) = 01, h(K4 ) = 00, h(K5 ) = 11, h(K6 ) = 00
Figure 1.7: Insertion into an extendible hash scheme (max bucket size is 4)
The last two items can decide whether we should use an index or a hash. Indices
usually does not provide as good average access times as hashes for a particular
search key, on the other hand they are better in worst case timings. Also if the
queries are asking for whole range of values, indices are usually a better choice.
1.3.6
Exercises
Let us move on from selling cars (Section 1.2.6) to maintaining the administration of already sold cars. Let us assume that the simplified database of the police
contains the following tables: Person, Car, Category, Penalty, Speeding.
Let the relation schemes be the following.
For Person: id (9 digit number), fname,lname,address (character data
storing the first and last names and the adresses respectively).
For Car: rnr (character data; the registration number), id (the ID of the
owner), model (character data identifying the model of the car).
For Category: code (character data identifying the penalty category), fee
(a number giving the amount in EUR to be payed in the corresponding penalty
dategory).
For Penalty: slimit (a number describing a speed limit), speed (a number
with slimit < speed, describing the minimum speed value of the penalty zone),
code (character data identifying the penalty category).
For Speeding: slimit (a number giving the speed limit on the road where
the radar is installed), rnr (the registration number of a car), speed the speed
value of the car. For example:
Person
id
129432374
123695034
132487549
fname
Michaela
Gerhold
Hans
Car
rnr
L 3452 AS
W 2336 KL
FR 543 HS
I 5436 S
lname
Schumacher
Berger
Brecht
id
129432374
123695034
143547895
132487549
address
4040 Linz, Freistaedter Str. 45
1010 Wien, Altstrasse 12
6020 Innbruck, Technikerstr. 36
model
VW Passat
Mercedes C220
Nissan Micra
BMW 530D
Category
code fee
S1
30
S2
50
S3
100
S4
500
38
speed
55
60
80
85
100
120
code
S1
S2
S4
S1
S3
S4
Speeding
slimit rnr
50
L 3452 AS
50
I 5436 S
80
W 2336 KL
80
L 3452 AS
50
W 2336 KL
80
W 2336 KL
speed
80
72
109
93
54
115
Let us extend the list of predicates which can be used to describe conditions
with the pattern matching operator LIKE(pattern, arg), where pattern is a
string value in which % will match any sequence of characters and ? will
match any single character. The predicate is true if the argument string arg
matches the pattern.
Determine relational algebra expressions that carry out the following computations.
Exercise 1.3.1. Compute another pernalty table where instead of the code of
the penalty category the fee appears.
Exercise 1.3.2. Compute the table of first and last names of people who own
a Mercedes.
Exercise 1.3.3. Compute the table of registration numbers and models of cars
in the area of Linz.
Exercise 1.3.4. Compute the table of first and last names, addresses, speed
limits and speeding values from the data captured by the radar.
Exercise 1.3.5. Compute the table of names of people who were captured by
a radar violating every speed limit value in the Penalty table (i.e. both 50 and
80).
Let us assume that the penalty system is accumulative, so that a driver has
to pay the sum of the penalty fees of the categories his speeding falls into. For
example, driving with 85 on a road with speed limit 50 falls into the S1, S2, S4
categories (because all the three rows of the Penalty table for limit 50 applies);
thus the driver has to pay 30 + 50 + 500 = 580 EUR.
Exercise 1.3.6. Let v denote a given speed. Compute a penalty table for v
which contains the speed limits and the accumulated sum of fees one has to pay
if one drives with speed v under the given speed limit.
Exercise 1.3.7. Compute the table of first and last names, addresses, registration numbers, speed limits, speeding values and accumulated fees from the
data captured by the radar.
Exercise 1.3.8. Delete all entries from the Speeding table which do not require
paying a fee.
Exercise 1.3.9. Choose primary keys for all the tables. Check that the set of
all attributes irreducibly functionally depends on the chosen keys.
1.4. SQL
39
Do you think that in the Person and Car tables the attributes address and
model were chosen appropriately? Wouldnt it have been simpler to select all
the cars in the area of Linz if the address attribute were split up into relevant
parts?
Exercise 1.3.10. Redesign the relation schemes of the Person and Car tables
so that the problems mentioned above are easier to handle.
Exercise 1.3.11. After solving the previous exercise, normalize the database
scheme to 3NF.
Exercise 1.3.12. After solving the previous exercise, set up the data in the
example tables using now the new database scheme.
Exercise 1.3.13. Construct a B-tree with maximum node size 4, which contains the search key values: 1,3,5,7,9,10,16,17,20,21. Then insert the following
values one after the other 11, 12, 13, 14. Then delete 10 and finally 12.
Exercise 1.3.14. From the insertion and deletion procedures for B-trees, derive
insertion and deletion procedures for B + -trees.
1.4
SQL
1.4.1
Data Definition
A part of SQL is a data definition language (DDL), that allows us to define/modify/delete tables (i.e. relation schemes) with domains for attributes,
define or drop indices, specify integrity constraints, authorization information
and storage structure information.
The standard domains (i.e. types) for attributes are the following:
char(n) a character string of fixed length n,
varchar(n) a variable length character string of maximal length n,
int an integer (length is hardware dependent),
40
NOT NULL
NOT NULL
Integrity constraints could be expressed with the CHECK clause. For instance, checking the existence dependency via the foreign keys can be done as
follows.
1.4. SQL
41
1.4.2
Simple Queries
42
43
1.4. SQL
Example 1.30. First let us note in general that usually there are numerous
ways to get the same result and there can be great differences in the performance
of the queries that we can use. The ones of this example need not be optimal
from this point of view because their purpose is to demonstrate the constructs
they use and not the practical application.
This example selects the intersection of the students who take the Analysis I
course with the ones who take the Algebra I course.
(SELECT Student.fname, Student.lname
FROM Student, Attendance
WHERE Student.sid = Attendance.sid AND Attendance.cid = 327456)
INTERSECT
(SELECT Student.fname, Student.lname
FROM Student, Attendance
WHERE Student.sid = Attendance.sid AND Attendance.cid = 327564)
The next example selects the names of the students attending at least one
course.
SELECT fname, lname
FROM Student
WHERE sid IN (
SELECT Student.sid
FROM Student, Attendance
WHERE Student.sid = Attendance.sid)
44
1.4.3
Database Modification
1.4. SQL
45
Example 1.32. Inserting a new row into the Student table could be done in
the following way.
INSERT INTO Student
VALUES (0245768, Alexander, Wurz)
After the table definition
CREATE TABLE Namelist (
fname
char(20)
, lname
char(20))
we can fill the new table with the names of students using the following insert
clause.
INSERT INTO Namelist
SELECT fname, lname FROM Student
Adding a student whose first name is not yet known (to be added later), could
be done as follows.
INSERT INTO Namelist (lname)
VALUES (Hackl)
Then later . . .
UPDATE Namelist
SET fname = Hans
WHERE fname IS NULL AND lname = Hackl
Deleting all students whose last names do not start with R or S (e.g. to leave
the students the teacher got in his exercise group after distributing them by the
alphabetical order) can be done with following statement.
DELETE FROM Namelist
WHERE NOT (lname LIKE R% OR lname LIKE S%)
1.4.4
Finally, we briefly discuss how to construct views and joins in SQL. The CREATE VIEW statement has the following structure.
CREATE VIEW view name AS query
The query is any legal query defined in Section 1.4.2. After its definition the view
name can, in principle, appear everywhere where the name of a regular relation
can. However in certain situations this can lead to problems (see Example 1.33);
therefore most DBMSes allow modifications via views only if the view is defined
using a single table.
The DROP command also has a version for views.
DROP VIEW view name
We have the following join types: inner join, left outer join, right outer join,
full outer join. The join operation is specified by a further condition (given
by the NATURAL, ON, USING clauses) so that the full join specification is as
follows.
46
1.4. SQL
47
sid and cid data are missing (the user of Sc does not even know about these
attributes), and if we wanted to substitute null values for them, the new row
wouldnt even appear in the Sc table (as join cannot be formed on null values).
This is the reason why database modification via views is allowed only if the
view is defined in terms of a single table.
1.4.5
Embedded SQL
48
1.4.6
Exercises
Get access to an RDBMS (PostgerSQL, MySQL, MSAccess, etc.) and try the
commands out in practice.
Exercise 1.4.1. Create the example tables of Section 1.3.6 using SQL.
Exercise 1.4.2. Translate all the relational algebra expressions which you obtained as solutions for exercises in Section 1.3.6 into SQL queries and try them
on the example tables.
Exercise 1.4.3. Create a view for listing speeding values with the data computed in Exercise 1.3.7.
Chapter 2
Information Systems
On-Line
In this chapter we take a short outlook on problems and applications that occur
usually in information systems working on a network, serving (or collecting
information from) a large class of users. Typical examples are: serving web pages
with dynamic content, providing 24/7 services (commerce, booking, shopping,
etc.), Internet search engines, and so on. Usually these systems are supported
by powerful RDBMSes, thus this chapter is also related with Chapter 1.
In Section 2.2 the reader is pointed to a few free software systems that
can be used to build on-line information systems. These software, though cost
nothing, in many respect represent the state of the art in their classes, and thus
are widely used.
2.1
On-Line Databases
2.1.1
Security Control
Security control in DBMSes has the basic forms of discretionary and mandatory
control. Basically all the DBMSes support the first kind and some additionally
support the second kind. Defining a discretionary control scheme consists of
defining authorities: named security clauses that grant certain privileges to
certain users on certain tables (or views). The corresponding SQL command is:
GRANT { privilege [, privilege]* | ALL [ PRIVILEGES ] }
ON [ TABLE ] tablename [, tablename]*
TO user [, user]* [ WITH GRANT OPTION ]
where
49
50
51
performed, but they should forbid to obtain (or even to deduce) information
about individual rows. (Think, for instance, of a DB with collected medical
data from which one can determine how many people suffered from a given
illness last year in the country, but one should not be able to find out the
names of these people.) Such DBs try to protect individual data by methods
like allowing only queries where at least n rows are involved in the aggregation
(for some n > 1). However, this does not give full protection (try to construct
a counterexample!).
Once the security scheme is implemented on the database, the security subsystem of the DBMS can identify each user by a login name, authenticated by
a password. However, if the clients need not be in a closed network, it can also
be mandatory to communicate via a secure channel.
One of the most common protocols is SSH (Secure SHell), which can establish secure connections between hosts, encrypt data that are exchanged also
in the authentication phase, and provide secure tunneling capabilities for a
wide variety of applications. Another such protocol is SSL (Secure Sockets
Layer), which provides secure data exchange similarly to SSH, and used mostly
in WWW applications. Both protocols allow to use public-key cryptography,
usually based on the RSA cryptographic scheme. These cryptographic tools are
considered strong nowadays (relying on the difficulty of factoring integers), and
the corresponding protocols are widely used.
Remark 2.1. In short the RSA (Rivest-Shamir-Adleman) algorithm can be
summarized as follows:
Key generation: generate two random primes p, q of approximately the same
bit-length such that n = pq is of the required length (e.g. 1024 bits). Set
h = (p 1)(q 1), choose an integer 0 < e < h such that gcd(e, h) = 1 and
compute 0 < d < h such that ed 1 mod h. The public key is the pair (n, e)
and the private key is (n, d); the values p, q, h must also be kept secret.
Encryption: Let the message be m such that m < n (in practice a long
message can be broken up into segments with length smaller than log 2 (n) and
the bit sequence is just understood as the number m). The (segment of the)
encrypted message corresponding to (the segment represented by) m is: c = m e
mod n.
Decryption: Let (a segment of) the encrypted message be c, the corresponding (segment of the) decrypted message is: m = cd mod n.
2.1.2
Transaction Management
52
53
54
if the name of a student contains a typo in the Student table, the database
containing it is consistent but not correct. On the other hand, what will be at
the center of our interest in the rest of this section is to guarantee that a correct
database gets transferred into a correct one by an execution of individually
correct transactions concurrently.
Transactions which are valid on their own, when executed concurrently can
leave the database in an incorrect or even in an inconsistent state. There are
three basic kind of interference:
Lost update The transactions read and update the same row of a table concurrently as in the diagram below.
Time
t1
t2
t3
t4
Transaction 1
retrieve r
Transaction 2
retrieve r
update r
update r
TA 1
TA 2
update r
retrieve r
rollback
Time
t1
t2
t3
TA 1
TA 2
update r
update r
rollback
Transaction 1
retrieve r1
aggregate r1
Transaction 2
update r3
retrieve r2
aggregate r2
update r1
retrieve r3
aggregate r3
The result of the aggregate function is wrong no matter how we view it,
because it computed with the old value of r1 and the new value of r3 ;
therefore reflects a state of the database which need not be correct.
55
Such situations can be handled with locking mechanisms, that deny access
to a row, or a set of rows, in a table for other transactions while modifications
are being done. We distinguish the following lock types:
Exclusive lock (write lock, X-lock) Requests for any other lock on the object
must wait till the active exclusive lock is released.
Shared lock (read lock, S-lock) Requests for other shared locks on the objects
are immediately granted, exclusive locks must wait until the object is
released.
The locking is usually done implicitly by the transaction manager depending on
the operations in the transaction.
The locking protocol can then be defined as follows. A transaction that
intends to retrieve a row must first acquire an S-lock on it. If a transaction
intends to modify a row first it must acquire an X-lock on it, or, if it already
has an S-lock on it, it must upgrade it to an X-lock. If a locking request from
a transaction cannot be immediately granted, the request goes into a waiting
queue and from this the requests are granted on a first-come first-served basis.
If the object has an S-lock on it, other S-lock requests (for this object) can
be immediately granted. However, this can force the X-lock requests in the
waiting queue to wait for a long time (potentially forever). This phenomenon
is called starvation. The system should be prepared to avoid starvation, e.g.
by not granting out of queue locks anymore if the first in the waiting queue is
waiting for more than a given amount of time.
In this scenario a new problem can occur, if two or more transactions wait
on each other to release a lock. This situation is called a deadlock. We can see
this in the diagram of the lost update problem now with locking applied.
Time
t1
t2
t3
t4
t5
t6
t7
Transaction 1
acquire S-lock on r
retrieve r
Transaction 2
acquire S-lock on r
retrieve r
X-lock request on r
wait
wait ...
X-lock request on r
wait ...
To predict and avoid deadlocks can be very difficult, and the detection of a deadlock is also not trivial (finding circles in graphs that represent which transaction
waits for which others). So in practice, the transaction managers just assume
that if a transaction has not done any work for more than a given time limit,
it is deadlocked. In such a case the system chooses a victim transaction from
the set of presumably deadlocked ones, which will suffer an implicit rollback,
and therefore releases all the locks it held. (This is also a reason why a valid
transaction can be rolled back in an OLTP environment.)
Another important concept in showing correctness of an execution of a given
set of transactions is serializability. An (interleaved) execution (also called as a
schedule) of a set of transactions (which are individually correct by assumption)
is called serializable if and only if it produces the same result as some serial
execution of the same set of transactions; and this does not depend on the (a
priori correct) initial state we start from.
56
Transaction 1
Transaction 2
acquire X-lock on r
update r
release X-lock on r
acquire S-lock on r
retrieve r
release S-lock on r
commit
rollback
Since the second transaction is rolled back the first one should also be rolled
back because it used a value of r which was updated by the second and then
restored. However, it is not possible to roll the first transaction back since it
has already committed.
Therefore to achieve recoverability, we require that a transaction which uses
data modified by another transaction must not commit before the other terminates (and if that rolls back, the first must be rolled back too). Such a schedule
is called cascade-free.
Remark 2.5. To increase the transaction throughput of the DBMS, in practice
systems do not require full serializability and cascade-freeness. Of course, this
can lead to interference in critical situations. The systems control the tightness
of the requirements via isolation levels.
Another approach to avoid starvation (used frequently together with isolation levels) is the multi-version concurrency control. In this approach each
transaction sees the snapshot (or version) of the database which is created at
its starting time. Reads do not block writes and vice versa, only writes on
the same row can block each other. The implementation of a multi version
system is much more complicated than of a traditional system, however higher
performance can be achieved with fewer compromises.
Finally the requirement for an online transaction processing system can be
summarized in the ACID testAtomicity, Consistency (or Correctness), Isolation and Durability.
Atomicity Results of transactions are either all committed or rolled back (all
or nothing principle).
57
Consistency (or Correctness in the more rigorous sense) The database is transformed by a transaction from a valid state into another valid state. (Consistency, however need not be preserved at intermediate points of the transactions.)
Isolation (or serializable and non-cascaded in the more rigorous sense) The result of a transaction is invisible for other transactions until the transaction
commits.
Durability Once a transaction has committed, its results are permanently
recorded in the database, so that they survive any subsequent system
crashes.
Example 2.6. Although the concepts should be clear on their own, here are
a few wordy examples to illustrate what these requirements mean when the
on-line information system is, say, a discussion forum.
Atomicity: When a new user registers he might have to type in lots of data
through possibly several pages. Whenever a page is submitted in the progress of
the registration process, some of the tables of the system might get updated right
away. If the user changes his mind at the last page and cancels the registration
the whole transaction is rolled back as if the registration process have never
been started.
Consistency: Let the consistency criteria require that all postings on the
forum should have a user ID which is a foreign key for the table of users. If a
careless administrator tries to delete the account of a user without first deleting
all his postings the forum, the DBMS should trigger an implicit rollback on the
transaction, since it would create orphaned postings (postings without a valid
user ID) violating integrity constraints.
Isolation: When a new user registers (as in the point of Atomicity), as long as
the registration process is not completed, that is the corresponding transaction
is not committed, the updates which are made in the tables of the database are
not visible for other transactions. For instance, if the administrator of the site
starts a query to produce some statistics on the registered users of the forum,
the new user will not appear in the statistics because his registration was not
yet committed when the administrators query started; although the users data
might have already been inserted into some of the tables of the database.
Durability: Let us suppose that the site of the forum also sells things on-line
and let a user order something. He sends his request, which the e-commerce
module of the system processes successfully, but a millisecond after transaction
commits a fatal malfunction in the UPS cuts off the power for the computers
of the service provider. The HW staff manages to bypass the smoking UPS
in about 10 minutes and after another 5 minutes of booting and initialization
the system is up and running again. The e-commerce subsystem continues to
process the new order (inserted by the committed transaction) and the user gets
his gadget in a few days.
Example 2.7. Typical examples of OLTP systems are: e-commerce (on-line
shopping-, booking-, banking systems), e-trading, on-line service providing (discussion forums, email, homepage, etc.), distributed source archiving systems (for
software development, publishing, etc.), e-governments and e-administrations
and countless other applications.
58
2.1.3
Static Archives
59
the initial distribution of the independent and dependent attributes does not
reflect reality that well as another distribution. In this case it might occur
that an independent and a dependent attribute has to swap roles (this is called
pivoting). Also the independent attributes might form hierarchies (e.g. hour,
day, week, month, year) with respect to which the dependent values can be
aggregated. Moving from a lower level of aggregation to a higher one is called
drill up and the opposite is called drill down.
The advantage of multi-dimensional databases is that they can give fast
responses to queries. Their disadvantage is the phenomenon of data explosion, against which developers try to apply sparse representation techniques.
Also multi-dimensional databases are not as matured as relational ones, thus in
OLAP the underlying DBMS is often a relational one.
When the multi dimensional view of the data is based on traditional relational databases the subject is referred to as ROLAP (R stands for relational).
The multi dimensional model is represented in the simplest case by a star scheme
(over the tables of the database) in the relational model. At the corners of the
star are the dimension tables, each representing an independent dimension (e.g.
date, product, etc.) and in the middle is the fact table which connects all the
dimension tables with a multiple join. A row in the fact table contains foreign
keys to each dimension tables, providing the coordinates of the corresponding
cell in the multidimensional model and the values of the dependent attributes
at the given coordinates (see Figure 2.1). When the hierarchy of the dimensions
are represented explicitly via normalization the scheme is named snowflake.
Customers
Products
customerID
productID
customer_address
customer_name
Fact table
productID
product_name
price
storeID
customerID
Dates
date
date
units_sold
storeID
year
store_address
month
store_category
day
Stores
60
The ROLLUP scheme takes the initial sublists of a list of attributes to group
on including the empty list (which results an overall sum).
GROUP BY ROLLUP (col name [,col name]*)
The CUBE scheme takes all the possible subsets of the given set of attributes.
GROUP BY CUBE (col name [,col name]*)
Remark 2.9. Assume there there are columns with names a and b. Please
note that
GROUP BY ROLLUP (a, b)
is equivalent to
GROUP BY GROUPING SETS ((a, b), (a), ())
and
GROUP BY CUBE (a, b)
is equivalent to
GROUP BY GROUPING SETS ((a, b), (a), (b), ())
In the resulting tables of the queries using these special groupings the attributes on which no grouping is made are set to NULL. In these cases, the
meaning of the null value is not the same as originally.
Example 2.10. Let us take the specification Result table from Example 1.31
and let it contain the following data.
Result
sid
0251563
0251563
0245654
cid
327456
327564
327456
grade
1
2
1
cid
NULL
NULL
327456
327564
avg grade
1.5
.
1.0
1.0
1.5
61
Data mining is an important application relying on the integrated and consistent data that data warehouses can provide. The task of it is exploratory
data analysis. Data mining tools usually can also cooperate with OLAP tools
and utilize their strength in query processing and in the multi-dimensional view
of the data.
Data mining tools apply statistical techniques to the large amount of data
available in the data warehouse and look for patterns (trends, regularities) in
the data. That is, the purpose of data mining is to discover new information in
the data. Commonly used methods come from the area of machine learning:
Association rules They are used for identification of attribute values that
occur more frequently together than it would be expected if the values (or
the real objects they represent) were independent.
For example, if we analyze the archives of a supermarket, we might find
that after looking at 200000 bills, in 70% of the 100000 cases when somebody bought bred she also bought milk. We say then that the association
rule (bread = milk) has 70% confidence with 50% support on a 200000
population.
Genetic algorithms They simulate the process of evolution in nature. The
model works with a population of individual entities (data items), operations of mutation and crossover and a fitness function on individuals. A
cycle of the algorithm usually consists of evaluating the fitness function on
each individual and select the best ones. Then compute a new population
using the mutation and crossover operations on the selected subset and
replace the old one with it.
A common application of genetic algorithms is multi-dimensional optimization (the fitness function is being optimized) in cases where there are
no fast special purpose algorithms available.
Decision trees A classification method to isolate homogeneous classes of data
for some target attributes and describe the classes with as few as possible
test attribute values. Initially the whole population of data items are
assigned to the root of the tree and all test attributes are considered
available. Then at an intermediate step, the algorithm takes a leaf node
of the tree and if all the data items have the same target attribute values
the leaf is declared to be a class and is not divided further. Otherwise,
an available test attribute is chosen which maximizes some measurement
of the information gain when the data items of the node are split up into
groups distinguished by the value of the chosen attribute. The subsets
in the resulting partitioning of the items become new leafs of the tree,
attached to the node which was split up.
Clustering Algorithms in this category associate the data items which are
similar to each other (with respect to some metric) by forming groups of
them. There are several kind of clustering algorithm and they can also
be parametrized to fine tune their behavior (e.g. setting thresholds that
determine when to split a cluster into smaller ones and when to do the
reverse process, etc.).
62
2.1.4
Search Engines
A special large scale database application is the well known category of search
engines, also known as information retrieval systems (IR systems for short).
Search enginology is a science on its own, including not only database management but several steps from data acquisition to ranking and presentation of
the results. The internal details and algorithms of specific search engines are
industrial secrets, so there is very little information available publicly. In this
section we take a very brief outlook on how search engines work.
The first important component of a search engine is the database. It is
an inverted index, where the search keys are the terms we can search for and
the entries are pointers to documents which contain them. Moreover there is
also additional information stored that reflect the relevance of the document in
context of the given search key.
Data Acquisition
A primary task in building up and maintaining such a database is data acquisition. This means not only to find valid URLs on the web but also relevant
information about them which can be used in the indexing process. There are
two basic kind of data acquisition, the first is the human powered one, in which
the authors of the web pages submit their URLs or the editors of the company
collect the information and submit it to the document processing unit. The
second is the automated way, in which programs, also called web crawlers,
spiders or robots, roam the Internet going from link to link and collect
information on the web pages they find.
These programs often referred to as software agents, to emphasize that they
have an otherwise unusual degree of autonomy. They usually start with an
archived list of URLs (considered as the best ones in the database) and follow
the links found on these pages. They provide data for the parser and indexer
subsystem that extends/updates the database (also called the index) whenever
appropriate. They might filter the data they find on a particular page, selecting
only the keywords or other meta tags, or sending only a limited amount of data
from the beginning of the page, etc. Recent programs even dig into the (doc,
ppt, pdf, ps, etc.) documents they find available on the Internet to explore the
domain of information that exists in these special formats.
Most of these agents are well behaved in the sense that they respect the
wish of the webmaster if he wants to keep them away from certain parts of
his site. The policy for the software agents can be set up by creating a file
robots.txt in the root of the served directory structure. The file has a fixed
structure which is parsed by the robot. For instance, with the following file the
administrator can exclude all robots from the cgi-bin and the tmp directories.
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Document Processing
Once the data has been retrieved the parsing and indexing phase comes (as we
saw, some preliminary parsing of filtering also be done in the crawler). This is
63
64
processor.
The first step is the tokenization of the input query. Usually in this case it
just means to break the query string down into alphanumeric substrings separated by whitespaces and/or punctuation signs.
Then the tokenized string is parsed to detect control keywords like Boolean
operators (AND, OR, etc.), proximity operators (NEAR), and reserved punctuation (quotation marks). After this the system might immediately form the
query for the database and execute it.
If more preprocessing is required then usually stop word removal and stemming comes next. This, just as in the case of the document processor, can
simplify the query a lot, especially search engines that apply natural language
processing (NLP).
After this the query is formed for the inverse index. The way it is done can
depend on the type of matching the system usesthe process of searching the
database for documents that satisfy the requirements of the query. For instance
if Boolean matching is used, a tree of a logical expression is created (leafs are
search terms and other nodes are logical connectives). This is again a point
where the query can be executed on the database.
Advanced search engines may go further and before the execution, the query
is expanded. The expansion includes search terms that did not appear in the
original query but are synonyms, generalizations or specializations of terms
that appear. For instance searching for car might also include auto and
vehicle.
Moreover, if there are more than one query terms, they can be weighed by
the query processor (or in unusual cases the weight information can come from
the users). After this step the expanded and weighed query is executed on the
database. The matching in most of the cases boils down to a binary search.
Ranking of Results
The next important subsystem is the one that ranks the (usually very large)
set of results obtained for a query. These ranking algorithms assign scores
to each document that reflect the relevancy for the given query. The scoring
algorithms are also treated as industrial secret, but they usually regard the
presence/absence, weights, frequency, proximity and location of query terms in
the document, the size, age, attached meta data of the document, the links that
point outward from the document and the links that point to the document from
other documents, and many other things (like the payed entries in Google).
An interesting ranking scheme based on the inward/outward links of documents is the HITS algorithm (Hyperlink Induced Topic Search). It is based
on the simple observation that if a document A has a link to document B then
the author of A considers B as a valuable information source. Moreover, if A
points to a lot of good documents then the opinion of the author of A is more
valuable, which also increases the value of B.
A document which points to many others is a good hub, and a document
to which many others point is a good authority. The goodness of documents is
established in an iterative process.
For a given query the HITS algorithm needs an initial set of documents
called the root set. Then this set gets extended with documents that point to
documents in the root set (using the database) and with documents to which
65
documents from the root set point. Then each page gets assigned a hub number
and a authority number, both initialized to 1. In an iteration of the algorithm
these numbers are renewed by the following rules: the new authority number of
a page P is the sum of the old hub numbers of the pages point to P, and the new
hub number of P is the sum of the old authority numbers of the pages P points
to. After the hub and authority numbers are updated they are normalized to
avoid their limitless increase. The iteration is repeated until the system gets to
a equilibrium-close state.
This algorithm ranks pages by popularity, and it also has its drawbacks:
mutually reinforcing relationships between hosts (if they have many links to
each other) and topic drifting (as the extension of the root set might include
documents which are not relevant for the original query). These issues are solved
or alleviated by modifications of the original algorithm.
Based on the ranking the results are listed for the user. One more information source for the sophisticated search engines is the user himself. They might
allow the user to provide feedback or to modify the query and search again.
Sometimes the user gives feedback just by clicking on the link of the chosen
page, which points to the site of the search engine and contains some additional
data to redirect the browser to the requested document. With this the user
ends up where he wanted and the system at the site of the search engine also
gets to know the choice of the user.
Finally we can mention that there are plenty of documents (easily locatable
with any search engine) that explain how to use search engines effectively and
how to improve the relevance of the search results by well posed queries. As
this topic is out of the scope of the lecture we do not discuss it here.
2.1.5
Exercises
Transactions are supported by many state of the art DBMSes; therefore solving
these exercises are not at all difficult. More complicated tasks would be to
design the transaction blocks in time critical systems for optimal throughput.
The exercises are recommended to be solved in practice, using your favorite
DBMS and programming language (e.g. PostgreSQL/Perl).
The context is the exercise database introduced in Section 1.3.6.
Exercise 2.1.1. Design a simple procedure (function/subroutine) for deletions
of entries from the Car table which enforces the following constraints.
The entry of a car cannot be deleted if there is more than 50 EUR accumulated unpayed fee in the database. Remarks:
The total fee can come from many different speedings.
Payed penalties are assumed to be removed from the Speeding table.
If the total amount of unpayed penalties is at most 50 EUR, the entry of
the car can be removed after all the speeding entries of the car is removed
from the Speeding table.
The procedure has to use transactions to modify the database so that the deletion operation transforms the database from correct state to correct state.
66
Exercise 2.1.2. Design a procedure for deletion of entries from the Person
table with an input boolean parameter cd. If cd is false the person cannot be
deleted as long as he has a car registered, if cd is true, the person requested
to unregister his car(s), so also the entries in the Car table that belong to the
person in question should be deleted. (Use the procedure of Exercise 2.1.1 to to
the deletion and to decide whether the person can unregister his car(s)). The
procedure has to use transactions for the database modification.
Exercise 2.1.3. Design a procedure for updating registration numbers. Input
parameters are the old and the new numbers. The procedure has to do the
modification in a transaction to ensure consistency of the database.
Exercise 2.1.4. Find as many free DBMSes as you can that fulfill the ACID
test. Find their official web-pages, the nearest mirror for download, and documentation for it in German (or other languages except English).
2.2
In Practice
2.2.1
Web Servers
2.2.2
2.2.3
Information about Perl can be found in [4] and about PHP in [5].
Chapter 3
XML
Extensible Markup Language (XML) is a globally accepted, vendor independent standard for representing text-based data. The organization behind XML
and many other web related technologies (like the Hypertext Markup Language:
HTML, Cascaded Style Sheets: CSS, etc.) is the World Wide Web Consortium
(W3C). The W3C can be described best as a committee where different representatives of the development community collaboratively decide standards for
the Web. XML was derived from the Standard Generalized Markup Language
(SGML), which exists for decades, via substantial simplifications.
The most important goals XML was designed to achieve are:
Simplicity XML documents should be strictly and simply structured.
Compatibility XML is platform independent. It should be easy to write or
update applications that make use of XML.
Legibility XML documents should be human readable.
Applications of XML include (but not limited to)
Data storage Providing a standard way for storing text data in a hierarchically structured way.
Data exchange Serving as a common language to which text based data can
be transformed from various formats.
Templating Creating template documents that describe various fields and attributes, for instance in case of business forms.
HTML translation Data stored in XML format can easily be transformed
into HTML for Web publication.
SOAP Simple Object Access Protocol to exchange messages in a platform independent way.
In this section we give a brief introduction to the basics of XML, we discuss
namespaces, then document templates, after this we take a look at accessing
data in XML documents and finally we discuss transformations of XML documents.
67
68
3.1
CHAPTER 3. XML
XML basics
3.1.1
Syntax
As we can see from the example above, when we convert data (e.g. emails) into
XML, its structure is indicated by additional text (adhering certain rules), or in
other words the original text is marked up. The first kind of text, that carries the
data being represented, is called character data or PCDATA (parsed character
data) and the second kind, delimited by < and > is the markup. The markup
consists of tags, for instance the pair <cc> and </cc> form a tag, where the
former is the starting tag and the latter is the ending tag.
The outermost tag is called the root element (in the example it is folder,
the actual tags are formed by adding the <, > signs and prepending / for the
end-tag). There must be exactly one root element in an XML document, into
which all the other tags are nested. The nesting must be a proper one, in the
sense that for each non-root element there is a unique element which contains
it and which contains no other element that also contains it.
An XML document may also contain a prolog, which is just the text before
the root element. The prolog is not part of the structured data, it usually
contains compiler directives or self identification of the document.
Elements
Elements are the primary structuring units of XML documents. They are
marked with start- and end-tags, where the name of the end tag must match
69
the name of the start tag with a / prepended. The element, as a logical unit,
contains the start-tag, its content and the end-tag. Between the start- and
end-tags, elements can contain any combination of character data and other
elements.
The content of elements can be
element if the element contains only elements (e.g. folder in the previous
example),
character if it contains only character data (e.g. to in the previous example),
mixed if it contains both (e.g. email in the previous example),
empty if it contains nothing (e.g. <x></x>).
Empty-content elements can be abbreviated by putting a slash before the >
sign of their start-tag and omitting the end-tag (e.g. <x/>).
The relationships between the elements can be classified quite naturally.
Child element It is an element of another one in the first nesting level (e.g.
both email elements are children of folder.
Parent element It is the reverse of the child relationship (e.g. folder is the
parent of both email elements).
Sibling element These are elements with the same parent (e.g. the elements
from, to, cc, subject inside an email element are siblings).
Descendant element It is an element in the transitive closure of the child
relationship (e.g. any of the to elements is a descendant of folder).
Ancestor element It is an element in the transitive closure of the parent relationship (e.g. folder is an ancestor of any other elements, except itself,
which is not surprising as it is the root element).
Please note that the parent is uniquely defined in consequence of the proper
nesting we required from XML documents.
Names for elements can be chosen by the user according to the following
rules.
Names are taken case sensitive.
Names cannot contain spaces.
Names starting with xml (in any case combination) are reserved for
standardization.
Names can only start with letters or with the , : characters.
They can contain alphanumeric characters and the , -, :, . characters.
The colon character, although allowed in names, should only be used for namespaces (see Section 3.2).
70
CHAPTER 3. XML
Attributes
As the example at the beginning of the section shows, elements can also contain
attributes, which are name-value pairs, listed in the start-tags of elements. Here
are the rules for attribute insertion into elements.
The naming rules of elements apply also for attributes.
Elements can contain zero or more attributes.
The names of the attributes must be unique within a start-tag.
Attributes cannot appear in end-tags.
Attribute values must be enclosed in single or double quotes.
Attributes can be resolved into elements and character data can be put into
attributes (though might not make much sense in many cases). For instance in
the example we could have written for the first email the following.
<email>
<date>20 Aug 2003</date>
<from>robert@company.com</from>
<to>oliver@company.com</to>
<subject>Meeting</subject>
Could we meet this week to discuss the interface problem
in the NTL project? --Rob
</email>
Or, for instance, the second email could have been formed as follows.
<email date=21 Aug 2003 from=oliver@company.com
to=rob@company.com cc=amy@company.com>
<subject>Re: Meeting</subject>
On 20 Aug 2003 rob@company.com wrote
> Could we meet today to discuss the interface problem
> in the NTL project? --Rob
OK. What about today at 16:00?
</email>
There are no strict rules of when to use attributes and when elements so it is
up to the style of the user to decide in this question. As a generic rule one can
put those data into attributes which are not important for most applications
(or humans).
Additional XML syntax
Since characters like < and & have special meaning in XML, they cannot appear
directly in character data. Whenever such a special character has to be included
anyway, one has to substitute it with the corresponding entity reference that
starts with a & and ends in a semicolon. The next table summarizes the entity
references which can be used to substitute special characters.
71
&
Entity reference
<
>
"
'
&
In attribute values the same kind of quotation mark as the one used as the
delimiter must be replaced with the corresponding entity reference. For example
the following are equally valid attribute settings.
<player name=Thomas "Giant" Stevens />
<player name="Thomas "Giant" Stevens" />
Comments can be included in the XML document anywhere outside other
markup with the following syntax.
<!-- Comment text comes here. -->
Here <!-- and --> are the delimiters of the comment and obviously, the text of
the comment should not contain the character sequence of the ending delimiter.
The first line of the example (<?xml version="1.0" encoding="UTF-8"?>)
is the XML declaration, which is not required, however it is recommended to
include it. Currently (summer 2003) only the 1.0 version of XML is in effect
so that attribute is fixed. The encoding specifies the usual ASCII one (with
UTF-16 it is the Unicode).
The prolog may also include processing instructions that certain applications
might interpret. For instance the line
<?xml-stylesheet href="my.css" type="text/css"?>
will tell for a web-browser to apply the my.css stylesheet for formatting the
XML document when it displays it.
A CDATA section can be used, outside other markup, to include text literally
(special characters like <, & are not recognized) into XML documents. This
comes good when one wants to insert larger text blocks and does not want to
replace all special characters with their corresponding entity references. The
starting string is <![CDATA[ and the enclosing is ]]>. For example one could
use it to include HTML code into XML documents.
<example>
<![CDATA[
<html>
<head><title>My first web-page</title></head>
<body><p>Hello world!</p></body>
</html>
]]>
</example>
Obviously, the ]]> character sequence should not appear in the included text.
To summarize the most important requirements for well formed XML documents we recall the following.
72
CHAPTER 3. XML
XML elements are marked with start- and end-tags whose names must
match.
There can be only one root element.
The elements must be properly nested into each other.
Elements can have attributes to store additional character data.
We have to obey certain naming rules for elements and attributes.
Special characters must be replaced by entity references, except inside a
CDATA block.
3.1.2
XHTML
3.2
Namespaces
Namespaces are useful in solving the following complications. First of all, with
their help naming conflicts can be resolved. For instance if an XML document
contains data about publishers and authors and both has a describing attribute:
name, which appears as an element with tag name in the document. The instances
<name>Springer</name>
<name>Donald E. Knuth</name>
would describe entities of different categories, while this fact would be very
difficult for a computer program to recognize.
The ad-hoc solution in this example would be to use different tags for authors
and publishers. However, in general this might be not so easy, for instance if
the document has to be created by merging the XML document of publishers
and one of authors, ad-hoc name changes of tags might be undesirable.
Namespaces are a standard way to solve this problem. In this example, the
documents of the publishers and of the authors should have unique namespaces
declared in them. These can be thought of as prefixes to be prepended for names
in the document. Then merging the two files is no problem anymore, because
the different prefixes will identify to which category the element belongs.
The other problem in which namespaces can help is to divide the structured
document further into application-specific groups. If the data the applications
need from the document are separable, it is advantageous to indicate which
data is used by which application via putting the corresponding tag-names into
different namespaces.
3.2. NAMESPACES
3.2.1
73
Before we go into the details of XML namespaces, let us shortly discuss URIs
(Uniform Resource Identifiers) and URLs (Uniform Resource Locators) as they
will often be used in namespace definitions.
URIs are character sequences with restricted syntax to reference things
with identity (e.g. mailto:gbodnar@risc.uni-linz.ac.at). URLs are a special subclass of URIs, the ones that identify a resource via a representation
of their primary access mechanism (e.g. http://www.risc.uni-linz.ac.at/
education/courses/).
We distinguish absolute and relative URIs. Absolute URIs refer to the resource independent of the context in which the identifier is used. Relative URIs
describe the difference between the current context and the absolute identifier
of the resource in a hierarchical namespace.
Example 3.1. Take for instance the URL http://www.risc.uni-linz.ac.
at which is an absolute identifier for the main homepage of RISC. From any
webpage over the Internet, if a link uses this URL as a pointer, a click on it
will take the browser to the homepage of RISC. The absolute URL for the
page with the list of RISC related courses in the current semester is http:
//www.risc.uni-linz.ac.at/education/courses/.
Within the RISC home page the link Education in the middle column uses
the relative URL /education/courses/ (actually the implementation uses only
/education/, which is redirected to the above given URL). This describes how
to get the target URL from the URL of the current page, namely by appending
it to current URL. If this relative URL appears on another homepage, it will
point to the corresponding location within that site.
Please note that if you move the cursor over the link the browser usually
displays the appended absolute URL in the status bar (viewing the page source
reveals the used URLs).
To determine an absolute URI from a relative one requires a context, which
is called the base URI. There are the following methods to do this
The base URI may be embedded into the document so that relative URIs
in the document have the context explicitly set.
If not in the previous case, the base URI can be determined by the retrieval
context of the entity (resource) of one encapsulation level higher (e.g. the
resource is a table which appears in a web page).
If not in the previous cases, i.e. the resource is not encapsulated into
another entity, the last URI, used for retrieving the document which
contains the current URI, is used as a base URI (e.g. the case of the
/education/courses/ example).
If none of the above conditions apply, the base URI is defined by the
context of the application (in this case different applications can interpret
the same URI differently).
URIs can have up to four components: scheme, authority, path, query. Each
component has a rigorous syntax which it has to obey, but we will not present
74
CHAPTER 3. XML
these rules in full detail here (the interested reader is advised to consult the
RFC 2396 document).
The scheme component defines the semantics for the rest of the string of the
URI. The scheme component starts with a lower case letter and is separated
from the rest of the URI by a colon. Relative URIs are distinguished from
absolute ones in that they do not have a scheme component. An example is the
http character sequence in the URLs that indicates the scheme for hypertext
transfer protocol used on the WWW.
The authority component defines the top hierarchical element in the namespace. It is typically defined by an Internet based server in the form
[userinfo@]host[:port]}
a few examples follow.
gbodnar@risc.uni-linz.ac.at
secure.access.com:443
193.170.37.129
The path component contains data specific to the given authority and/or
scheme, identifying the resource within their context. The hierarchical structure is indicated by / signs, which separate path segments from each other. In
Example 3.1 the relative URL /education/courses/ is actually a path component that appears also in the corresponding absolute URL.
The query component is a string to be interpreted by the resource. It is
separated from the path component with a question mark. A natural usage of
query components is for instance encoding data filled in by the user at a web
form, and passing it as the set of input parameters to the program which is
supposed to process them. The example is an input query for the LEO online
dictionary.
http://dict.leo.org/?search=namespace&lang=en
In this example the so calledGET method is applied for parameter passing. In
this standard the parameters are listed in the query string as attribute=value
pairs separated by ampersands.
3.2.2
Namespaces in XML
3.2. NAMESPACES
75
Of course, uniqueness can be guaranteed only if the company owns the domain
name and within the company the URIs are maintained centrally so that no two
developer will invent and use the same URI.
Namespaces can be declared within any element of an XML document, using
the reserved attribute name xmlns. For instance
<list xmlns:p="http://big.company.com/product-info">
<p:product>
<p:name>Disney Mouse Pad</p:name>
<p:id>231069948</p:id>
<p:price>3.49 EUR</p:price>
</p:product>
</list>
declares that inside the list element the prefix p (the string between xmlns:
and the equation sign) will be used as an alias for the namespace http://big.
company.com/product-info. Prefixes are to simplify the notation in an XML
document, as it would be cumbersome and unreadable to apply the long URI
each time.
Whenever a prefix is given in a namespace declaration, all descendants of
the element which belong to the namespace must be named with the prefix
prepended and separated by a colon from the tag name (e.g. p:id). This is
also called qualifying the name (e.g. id is qualified to be in the namespace with
alias p) and the name with a prefix is called a qualified name. Both start- and
end-tag names have to be qualified (consistently). Attribute names can also be
qualified.
If no prefix is given, so that the attribute is just xmlns, in a namespace
declaration, it will declare the default namespace for the element. In all descendants of the element unqualified tag names will automatically fall into the
default namespace. Therefore the list of products example above can also be
written as follows.
<list xmlns="http://big.company.com/product-info">
<product>
<name>Disney Mouse Pad</name>
<id>231069948</id>
<price>3.49 EUR</price>
</product>
</list>
However there is a difference between the two ways: In the first case the list
element is not in the namespace while in the second it is.
Default namespace declarations are available to the element in which the
declaration takes place and to its descendants. With a prefixed namespace
declaration the element in which the declaration takes place belongs to the
namespace only if its name is qualified with the prefix we define. So an equivalent
qualified version of the previous default namespace declaration example is the
following.
<p:list xmlns:p="http://big.company.com/product-info">
<p:product>
<p:name>Disney Mouse Pad</p:name>
76
CHAPTER 3. XML
<p:id>231069948</p:id>
<p:price>3.49 EUR</p:price>
</p:product>
</p:list>
The section of the XML document to which the namespace declaration applies is called its scope. If one uses a qualified name outside the scope of the
corresponding namespace, the qualification will be simply ineffective. For instance, in the following XML document the second book element is not in the
namespace http://www.mybookstore.com/catalog, the string b:book in that
case stands just for itself.
<catalog>
<b:book xmlns:b="http://www.mybookstore.com/catalog"
isbn="0-321-19784-4">
<b:author>C. J. Date</b:author>
<b:title>An Introduction to Database Systems</b:title>
</b:book>
<b:book isbn="0-8129-9191-5">
<b:author>G. Brill</b:author>
<b:title>Codenotes for XML</b:title>
</b:book>
</catalog>
Remark 3.3. Attribute names are not automatically in the namespace of the
including elements. For instance in the example above the isbn attribute is not
a member of the declared namespace.
Prefixes used as aliases for namespaces can be reassigned in an XML document, however it is not advised to do so in order to keep the document as
readable as possible.
3.3
XML Schema
XML schemas are templates for XML documents. With XML schemas one is
allowed to specify what structures the modeled XML documents may have and
impose restrictions on their contents. XML schemas are XML documents themselves that must follow strict syntactical rules, defined in the Schema Primer
Recommendation, available at http://www.w3.org/TR/xmlschema-0/. This
specification can be considered as a schema for schemas.
This concept replaces the one of Document Type Definition (DTD), which
was used originally for the same purpose. There are several advantages of
schemas over DTDs.
While DTD syntax was inconsistent with XML, schemas are valid XML
documents themselves, so there is no additional parser necessary in XML
applications.
XML schemas provide a wide range of data types (integers, dates, strings,
etc.) for element content and attribute values.
Schemas allow the users to define custom data types, and support inheritance between types.
77
3.3.1
Schema declaration
78
CHAPTER 3. XML
3.3.2
The main part of XML schemas define what elements and attributes can an
instance have and what types of data are allowed to be contained by them.
First we discuss elements, then attributes and finally types.
The most trivial elements are the simple content ones, which can contain only
character data, possibly in some prescribed format, and they cannot contain
child elements. This is an example of a simple element declaration where the
tag-name that identifies the element must be simple and the content is some
text.
<xsd:element name="simple" type="xsd:string"/>
79
Please note that it is also assumed that the schema definition is as in Example 3.4, so xsd prefix is bound to the schema definition namespace.
The element tag is defined in the xsd namespace and denotes that an element definition follows. The name attribute describes the name of the element
being declared and the type attribute its data type which is also identified with
a qualified name (i.e. string is defined in the meta-schema).
Complex element definitions allow us to specify the internal structure of the
declared elements. We will use the following XML schema constructions:
sequence This requires that the instance contains the elements that appear in
the schema in the same order and by default exactly once.
all This allows that elements in the definition group appear in any order and by
default exactly once. The all element can occur only in global declarations and only as a single element group declarator. The children of a all
element should be individual elements (no element groups are allowed).
choice This allows only one of the elements in the definition group to appear
in an instance document inside the complex element.
The following two attributes can appear in local element declarations and control
the number of occurrence of the declared element in the instance document:
minOccurs Specifies the minimal number of occurrences of the declared element in its parent element.
maxOccurs Specifies the maximal number of occurrences of the declared element in its parent element.
Optional element can be prescribed via the attribute minOccurs="0", and elements that can occur arbitrarily many times can be prescribed via the attribute
maxOccurs="unbounded".
Example 3.6. The following definition (schema fragment) prescribes a fixed
structure for the product element.
<xsd:element name="product">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="id" type="xsd:unsignedInt"/>
<xsd:element name="price" type="xsd:float"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
An example instance of this element definition can be taken from Section 3.2
with the slight modification of dropping the EUR characters from the price
element.
<product>
<name>Disney Mouse Pad</name>
<id>231069948</id>
<price>3.49</price>
</product>
80
CHAPTER 3. XML
In the next example each child element must appear exactly once, but the
order is arbitrary.
<xsd:element name="product_extension">
<xsd:complexType>
<xsd:all>
<xsd:element name="made_in" type="xsd:string"/>
<xsd:element name="oem" type="xsd:boolean"/>
</xsd:all>
</xsd:complexType>
</xsd:element>
An example instance of this element can be the following.
<product_extension>
<oem>true</oem>
<made_in>Japan</made_in>
</product_extension>
In the following example the occurrence constraints on child elements are
prescribed explicitly.
<xsd:element name="product_sold_in">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="country" type="xsd:string"
minOccurs="1" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
An example instance of this element can be the following.
<product_sold_in>
<country>Germany</country>
<country>United Kingdom</country>
</product_sold_in>
With the group element we can define element groups on the global level
and we can refer to them in other declarations.
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="payment">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="customer" type="xsd:string"/>
<xsd:element name="address" type="xsd:string"/>
<xsd:group ref="payment_method"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:group name="payment_method">
81
<xsd:choice>
<xsd:element name="credit_card" type="xsd:string"/>
<xsd:element name="bank_transer" type="xsd:string"/>
</xsd:choice>
</xsd:group>
</xsd:schema>
A valid instances is
<payment>
<customer>K. Boehm</customer>
<address>Goethestr 5, Linz</address>
<credit_card>VISA1208549856745634/0508</credit_card>
</payment>
and another is
<payment>
<customer>K. Boehm</customer>
<address>Goethestr 5, Linz</address>
<bank_transfer>IBAN57486739485934590000/34660</bank_transfer>
</payment>
So far the elements were either simple or complex with only child element content. One can also declare an element with mixed content using the
mixed=true attribute setting in the declaration. For instance, the following declaration allows that the element note contains arbitrarily many emp elements
and character data in between.
<xsd:element name="note">
<xsd:complexType mixed="true">
<xsd:sequence>
<xsd:element name="emp" type="xsd:string"
minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
The following element is then an instance of this scheme.
<note>The extra lecture will be <emp>tomorrow at 16:00</emp>
in the lecture hall <emp>HS 13</emp>.</note>
The nillable attribute can be used to declare that an instance of the element can have a special attribute nil (from the schema instance namespace)
which marks explicitly that the instance has no meaningful value (similarly to
the NULL values in RDBMSes). The following example declares that the date
element is nillable.
<xsd:element name="date" type="xsd:date" nillable="true"/>
Then the following instance is valid.
<date xsi:nil="true"></date>
82
CHAPTER 3. XML
Attributes are declared similarly to child elements with the difference that
attributes always have simple content. In an attribute definition the name and
type attributes will prescribe the corresponding qualities of the attribute being
declared. The use attribute controls requirements for the declared attribute
and the value attribute can set default or fixed values for it. The use attribute
can have the following values:
required The declared attribute must be explicitly defined every time its element occurs in the instance of the schema.
default The declared attribute is optional and if it is not given the value
attribute specifies its value.
optional The declared attribute is optional, no default is given.
fixed The declared attribute is optional, but if it appears, its value has to be
the one given in the value attribute.
prohibited The declared attribute must not appear in any occurrence of the
corresponding element and it has no predefined value.
Attributes can be defined only inside complex element declarations, so if
we want to define a simple element with attributes, we have to define it as a
complex element with simple content using the simpleContent element. The
attribute declarations always come after the child element declarations.
Example 3.7. In the first declaration of Example 3.6 we could define the
currency for the price tag as an attribute as follows.
<xsd:element name="product">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="id" type="xsd:unsignedInt"/>
<xsd:element name="price">
<xsd:complexType>
<xsd:simpleContent>
<xsd:extension base="xsd:float">
<xsd:attribute name="currency" type="xsd:string"
use="required"/>
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
An example instance of this element can be taken from Example 3.6 and now
we can (actually we have to) add the currency as an attribute of price.
<product>
<name>Disney Mouse Pad</name>
83
<id>231069948</id>
<price currency="EUR">3.49</price>
</product>
Or using Example 3.6, we include a time stamp attribute for the product extension
element.
<xsd:element name="product_extension">
<xsd:complexType>
<xsd:all>
<xsd:element name="made_in" type="xsd:string"/>
<xsd:element name="oem" type="xsd:boolean"/>
</xsd:all>
<xds:attribute name="time_stamp" type="xsd:dateTime"
use="optional"/>
</xsd:complexType>
</xsd:element>
Reusing the corresponding instance from Example 3.6 we can have the following
elements.
<product_extension>
<oem>true</oem>
</product_extension>
<product_extension time_stamp="2003-05-30T13:20:00.000+01:00">
<oem>false</oem>
<made_in>Japan</made_in>
</product_extension>
In local declarations we can refer to global ones via the ref attribute. This
is just to say that the definition of the element should be taken as the definition of the referenced one. This makes schema declarations more readable,
modularizing the definitions.
For instance, we could have taken out the definition of the price element in
the first declaration of Example 3.7 to the global level and then in the declaration
of product we could have written the following.
...
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="id" type="xsd:unsignedInt"/>
<xsd:element ref="price"/>
...
Alternatively, we can define a new type on the global level, say price type,
and use it as we use the standard types in the element declarations. In this
case, if a target namespace is set, price type will become a member of the
target namespace, thus we have to find means to reach it when we refer to it in
the element declaration. This can be done by defining an alias with the same
namespace as the target namespace. The whole schema looks like:
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://big.company.com/product-info"
xmlns:p="http://big.company.com/product-info">
84
CHAPTER 3. XML
<xsd:element name="product">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="id" type="xsd:unsignedInt"/>
<xsd:element name="price" type="p:price_type"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:complexType name="price_type">
<xsd:simpleContent>
<xsd:extension base="xsd:float">
<xsd:attribute name="currency" type="xsd:string"
use="required"/>
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
</xsd:schema>
Here is an instance of this schema.
<p:product xmlns:p="http://big.company.com/product-info">
<name>Disney Mouse Pad</name>
<id>231069948</id>
<price currency="EUR">3.49</price>
</p:product>
Think it over why certain elements are qualified and others are not.
XML Schema supports modularity by enabling a schema to include other
schemas using the include element.
<xsd:include schemaLocation="http://big.company.com/schemas/s1.xsd"/>
This will include the schema definitions in the file locatable by the given URI
into the current schema. It is required that the target namespace of the included
and including schemas be the same.
It is also possible to use named types from other schemas having target
namespaces other than the target namespace of the current schema. In this
case we have to assign an alias to the target namespace of the imported schema
and use the import element.
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://big.company.com/product-info"
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:e="http://big.company.com/employee-info">
<xsd:import namespace="http://www.w3.org/1999/xhtml"/>
<xsd:import namespace="http://big.company.com/employee-info"
schemaLocation="http://big.company.com/schemas/e.xsd"/>
<!-- Rest of the schema. -->
</xsd:schema>
85
The declarations of the imported schemas are accessible via the aliases of corresponding namespaces.
XML schema provides more than forty built-in data types, from which we
have already seen a few above (the complete list is available under the URL at
the beginning of this section). We also created new types in an anonymous way
so that they could be used only at the place of definition. Moreover we have
seen how XML schema allows to introduce new named types that can be used
as the built-in ones after their definition.
Example 3.8. Here is another example of using named type. From the first
declaration of Example 3.6 we can extract the complex type definition as follows.
<xsd:complexType name="product_type">
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="id" type="xsd:unsignedInt"/>
<xsd:element name="price" type="xsd:float"/>
</xsd:sequence>
</xsd:complexType>
And then the declaration of the product element could simply be
<xsd:element name="product" type="product_type"/>
The main mechanisms to define new data types are restriction and extension.
Built-in data types provide one or more facets through which one can create
new types by restriction. An example can be the type of strings that match
a given regular expression. The following example restricts the strings of the
new type to be of the format nnn-nn-nnnn where in place of an n any digit can
stand (e.g. 323-90-0982).
<xsd:simpleType name="serial_number">
<xsd:restriction base="xsd:string">
<xsd:pattern value="\d{3}-\d{2}-\d{4}"/>
</xsd:restriction>
</xsd:simpleType>
The complete list of facets for each built-in type can be found at http://www.
w3.org/TR/xmlschema-0/.
For complex data types one can also use extension to add new elements and
attributes to an already existing type. This example extends the product type
declaration, from Example 3.8, with the elements of product extension.
<xsd:complexType name="extended_product_type">
<xsd:complexContent>
<xsd:extension base="product_type">
<xsd:sequence>
<xsd:element name="made_in" type="xsd:string"/>
<xsd:element name="oem" type="xsd:boolean"/>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
86
CHAPTER 3. XML
87
3.3.3
Exercises
88
CHAPTER 3. XML
<xsd:all>
<xsd:element name="bid">
<xsd:complexType>
<xsd:simpleContent>
<xsd:extension base="xsd:integer">
<xsd:attribute name="timestamp" type="xsd:dateTime"
use="required"/>
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
</xsd:element>
<xsd:element name="name" type="xsd:string" minOccurs="1"
maxOccurs="1"/>
</xsd:all>
<xsd:attribute name="id" type="xsd:integer" use="required"/>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Find the points at which the following XML document violates the prescriptions
of the previous schema.
<member id="A2342" item-id="3489873">
<name>Thomas Keller</name>
<bid timestamp="2003-12-30T18:20:35" currency="EUR">34</bid>
<bid timestamp="2003-12-30T18:28:32" currency="EUR">36.5</bid>
</member>
Exercise 3.3.2. Modify the XML schema of Exercise 3.3.1 in a way that the
example XML document becomes valid with respect to the modified schema.
Exercise 3.3.3. Consider the following XML document, with the intended
restrictions that the id attribute of the entry element is mandatory, the author,
title, year elements have to be present, where there can be more than one
author of a publication. No namespaces are used.
<bib>
<entry id="Cox:00">
<author>
<fname>David</fname>
<lname>Cox</lname>
</author>
<title>Update on Toric Geometry</title>
<year>2000</year>
</entry>
<entry id="Becker_Weispfenning:93">
<author>
<fname>Thomas</fname>
<lname>Becker</lname>
</author>
<author>
<fname>Volker</fname>
3.4. XPATH
89
<lname>Weispfenning</lname>
</author>
<title>Groebner bases</title>
<publisher>Springer</publisher>
<year>1993</year>
</entry>
</bib>
Write an XML Schema for such kind of documents, that validates, in particular,
this example.
Exercise 3.3.4. Consider the following XML Schema
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="mdb">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="movie" maxOccurs="unbounded">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="title" type="xsd:string"/>
<xsd:element name="subtitle" type="xsd:string" minOccurs="0"/>
<xsd:element name="director" type="xsd:string"/>
<xsd:element name="cast">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="role" maxOccurs="unbounded">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="actor" type="xsd:string" minOccurs="0"/>
<xsd:element name="voice" type="xsd:string" minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Invent an XML document which is valid with respect to this schema.
3.4
XPath
XPath is a part of the XSL (eXtensible Stylesheet Language) family. Its primary
purpose is to provide common syntax and functionality to address parts of XML
90
CHAPTER 3. XML
documents. Beside this, it can also be used for basic manipulation of strings,
numbers and booleans. XPath operates on the logical structure of an XML
document and uses a syntax to address their parts that resembles to the path
constructions in URLs.
XPath models an XML document as a tree of nodes, which can be imagined
as an extension of the tree of elements of the XML document. In this extended
tree not only the XML elements, but also their attributes, the character data
they may contain, the namespaces and the processing instructions appear as
nodes.
Beside the data model, another important concept of the language is the
XPath expression. With expressions it is possible to compute objects (like
strings, numbers, sets of nodes) from the data of XML documents. The advanced expressions of XPath rely on a library of functions, which is general
enough to allow simple text and numerical data manipulation.
Expressions are always evaluated in some context, which is described by the
context node, context size and context position (think of a list of nodes on which
an expression is to be evaluated; the size of the list is the context size and when
the expression is evaluated on the ith element on this list, the context position
is i), variable bindings and namespace declarations.
XPath is prominently used in XSLT to access and apply simple transformations on data of XML documents, e.g. to convert them to HTML files, which
can be then displayed in web browsers. XPath expressions can appear as attribute values in XML documents, in this case the special characters must be
substituted by entity references.
3.4.1
Data Model
The nodes in the tree of the XPath data model can be of the following types:
root, element, text, attribute, namespace, processing instruction, comment.
Each node represents a part of the text of the underlying XML document. The
nodes can be ordered by the document order, in which a node precedes another
if the first character that belongs to that node comes before the first character
of the other node in the XML document. The nodes in the data model do not
share children, moreover, every node (except the root node) has exactly one
parent node which is an element node or the root node.
Remark 3.11. Please note that the root node is the first one in the document
order, and that an element and its attributes always precede the descendants of
the element. Namespace nodes precede the attribute nodes by definition.
Every node has an associated string value which can be computed in the
following ways. The string value of the root node is the concatenation of all the
string values of its descendants in document order.
Every XML element has an element node in the XPath model. The string
value of an element node is the concatenation of the of all text node descendants
in document order.
An attribute node in the XPath data model has the element node corresponding to the XML element in which the attribute is defined as its parent.
However the attribute node is not considered as a child of its parent element.
The string value of an attribute is simply its value as a string (after some nor-
3.4. XPATH
91
3.4.2
Location Paths
Location paths are special expressions for selecting a set of nodes, possibly but
not necessarily relative to the context node. A location path consists of location
steps composed together from left to right, separated by / (slashes). The first
location step selects a set of nodes relative to the context node. Then each
resulting node is used as a context node for the next location step which results
again a set of nodes, and so on. If the first character of a location path is /,
the initial context node will be the root node. In this case the location path is
called absolute, otherwise it is relative.
Example 3.13. A first example could be the way one navigates in a hierarchically organized directory structure on a file system (the analogy is so natural
that even the syntax is taken over in the abbreviated notation of path expressions). The following location path specification selects all the summary elements
which are children of any element in the reports child of the parent node of
the context node. The context node can be imagined as the current working
directory if we use the file system analogy.
../reports/*/summary
92
CHAPTER 3. XML
On a UNIX file system this would correspond to selecting all the summary files
in any subdirectory of the reports directory of the parent directory.
Please note that there is still an important difference between file systems
and XML documents: in a file system a directory cannot contain more than
one subdirectories or files with the same name, however, an XML element can
have many children with the same name. So while the ../reports specification
selects at most one subdirectory on a file system, the same selects every reports
children of the parent of the context node when considered as a location path
for an XML document.
Location steps have the following parts:
axis It specifies the (in-tree) relationship between the context node and the
nodes selected by the location step.
node test Specifies the node type for the nodes selected by the location step
(it is separated by :: from the axis).
predicates It specifies further expressions with boolean value, to refine the
selected node set. This part can be omitted.
The following axes are available: child, descendant, parent, ancestor, following-sibling, preceding-sibling, following, preceding, attribute, namespace, self,
descendant-or-self, ancestor-or-self. Most of the names are self explaining, e.g.
the descendant axis contains the descendants of the context node (recall that
attribute and namespace nodes are excluded), the following-sibling axis contains all the following siblings of the context node (with higher context positions). The attribute axis contains the attributes of the context node.
The following (resp. preceding) axis contains all nodes in the same XML
document that come after (resp. before) the context node in the document order
(obtained by left to right depth first search) excluding descendants (resp. ancestors) and attribute and namespace nodes. The following-sibling, precedingsibling, attribute and namespace axes are empty for non-element nodes. The
self axis contains the context node itself.
There are three principal node types for axes. If the axis can contain elements, the principal node type is element. For the attribute and namespace
axes the types are attribute and namespace respectively, and for other axes
it is element again. Only those nodes pass a node test which have the same
principal node type as the axis preceding the node test. One can specify exact
matching of node names for a given one by adding it after the axis specifications separated by ::. For example child::x will select all the children nodes
of the context node with name x. The node test denoted as * selects all nodes
on the given axis that match the principal node type of the axis. For instance
child::* selects all element children of the context node (because the principal
node type of the child axis is element). The text() node test selects text
data nodes on the given axis.
Example 3.14. Let us consider the example XML document with emails at
the beginning of Section 3.1. Figure 3.1 on page 93 illustrates a part of the
node tree of the document (the parsing direction is clockwise/top-down instead
of left to right). Namespace nodes are omitted to simplify the figure.
The absolute path
93
3.4. XPATH
date
(attribute)
from
(element)
"robert@company.com"
(text data)
email
(element}
to
(element)
"oliver@company.com"
(text data)
subject
(element)
"Meeting"
(text data)
folder
(element)
document root
date
(attribute)
"Could we..."
(text data)
xml declaration
(processing instruction)
email
(element)
...
94
CHAPTER 3. XML
Abbreviation
*
@*
@name
.
..
/
//
Meaning
by default the child:: prefix can be omitted
stands for for all
selects all attributes of the context node
selects the name attribute of the context node
selects the context node
selects the parent of the context node
selects the root node
selects as descendant-or-self
Table 3.1: XPath abbreviations
Example 3.15. The following location path selects the from elements of the
email elements for which the date attribute is 21 Aug 2003. We use here the
abbreviated notation.
/folder/email[@date = 21 Aug 2003]/from
3.4.3
The simplest expressions are numerical, string literals, variable references and
function calls. Numbers mean double precision IEEE floating point numbers,
and string literals define themselves with the possibility of using special encodings (like Unicode). Variables can be assigned using other XML technologies, for
instance XSLT, and they differ from variables in usual programming languages
in that we cannot change freely their value after the assignment. The value the
variable x holds can be referenced by $x.
The basic arithmetic operations are available for numbers. But please note
that usually the operators should be separated from the names by whitespaces.
For example x-y evaluates to the node set that contains the children of the
context node with this name. On the other hand, x - y will take the first child
of the context node with name x and the first one with name y and will attempt
to convert their string value to numbers and evaluates to the difference of these
numbers.
More complicated expressions are location paths (discussed in Section 3.4.2)
and boolean expressions. The latter are expressions that evaluate to true or
false. The basic boolean expressions contain comparisons =, ! =, >, <, <=, >=
with the usual semantics for non-node set operands (where strings can be converted to numerical types and numerical types to booleans as needed). If at least
one of the operands is a node set then a comparison expression is true if there
exists at least one node in the node set whose string value (perhaps after the
necessary conversion) satisfies the predicate. More complicated formulas can be
formed from atomic logical expressions using the and and or connectives.
Function calls consist of a function name, which identifies a function in
the function library determined by the expression evaluation context, and a
parameter list whose elements can be converted to the type expected by the
function at the given position of the list.
The core function library provides a minimal set of function specifications
that has to be supported by any XPath implementation. Here we summarize
3.4. XPATH
95
only a small subset of this library (first comes the type of the value the function
returns, then the name of the function):
Node set functions
number last() returns the context size from the expression evaluation context.
number position() returns the context position from the expression
evaluation context.
node-set id(object) if the argument is a node set then it returns
with union of this id function applied to the string value of each
element of the set. If the argument is of other type, it is converted to
a string, tokenized by whitespace characters, and the function returns
with the node set containing the elements of the same document as
the context node that have unique IDs given in the list of tokens.
String functions
sting string(object?) converts an object to a string. The default
argument is the node set in the context of the expression evaluation.
A node set is converted into a string by returning the string value of
the first element of the set in document order. The objects of type
number and boolean are converted using the usual semantics.
number string-length(string?) returns the number of characters
in the string, which is by default the string value of the context node.
Boolean functions
boolean boolean(object) converts an object to a boolean value.
Numbers different from zero or NaN (Not a Number), nonempty
node sets, strings with positive length are converted to true.
boolean not(boolean) negation.
Number functions
number number(object?) converts an object to a number. Strings
are converted to numbers by the usual semantics. The boolean false
is converted to 0 and true is converted to 1. A node set is converted
to string and then it is converted to a number. The default argument
is the node set in context.
number sum(node-set) returns the sum of the values obtained by
converting each element in the node set into a number.
number round(number) rounding the argument by the usual semantics.
The full specification of the functions in the library can be found in [10].
96
CHAPTER 3. XML
3.5
XSL Transformations
The XSL Transformation language, or XSLT for short, is another member of the
XSL family, whose primary purpose is to make possible the transformation of
XML documents into other XML documents or even into non-XML text data
(e.g. HTML documents). There are three components of the transformation
process. One is an XSLT stylesheet, which is a valid XML document that
adheres the grammar of the XSLT specifications. The other is the source XML
document and the third is the resulting document, which is generated in the
transformation process by applying the stylesheet on the source document. The
structure of the output document can be completely different from the structure
of the input one.
The XSLT document contains a collection of templates that prescribe patterns and transformation rules for the elements of an XML document that match
the pattern. In the pattern specification the XPath language gets a prominent
role. The XSLT language also provides additional functionality, for instance
named templates (whose usage is analogous to the one of functions in programming languages), if-then and for-each structures, sorting of elements in node
sets, etc.
In practice XSLT is very often used to transform XML documents into
HTML ones that can be displayed by web-browsers. Some browsers are actually XSLT-compatible. In such cases an XML document can be processed in
the following workflow.
The server sends the requested XML document to the browser.
The browser parses the document and requests from the server the XSLT
stylesheet the document references.
The server provides the XSLT document.
The browser applies the stylesheet on the XML document obtaining an
HTML document. This can still reference a CSS (Cascaded Style Sheet)
that contains further formatting information.
If necessary, the browser requests the CSS from the server and applies it
on the HTML document.
Finally the browser displays the result.
Another alternative is that the XSLT stylesheet is applied on the XML document
on the side of the server. This is necessary to support non-XSLT compatible
browsers.
Example 3.16. Perhaps the best is to start with an example which shows
how a simple XML document can be transformed into an HTML one via a
straightforward XSLT. Let the XML file contain the following.
<?xml version="1.0"?>
<info>
<name>Josef Keller</name>
<email>jkeller@jku.at</email>
</info>
97
98
CHAPTER 3. XML
3.5.1
Templates
99
Example 3.17. Let us consider again the XML document containing emails at
the beginning of Section 3.1. We want to transform this document into HTML
format to display it in a nicer form. Here is a stylesheet for that.
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="folder">
<html><body>
<h1>Emails</h1>
<xsl:apply-templates select="email"/>
</body></html>
</xsl:template>
<xsl:template match="email">
<p>
From: <xsl:value-of select="from/text()"/><br/>
To: <xsl:value-of select="to/text()"/><br/>
Date: <xsl:value-of select="@date"/><br/>
Subject: <xsl:value-of select="subject/text()"/><br/><br/>
<pre><xsl:apply-templates select="text()"/></pre>
</p>
</xsl:template>
</xsl:stylesheet>
When the stylesheet is applied to the XML document, the root element is
matched by the default template which just requests template application for
each elements. For the processing instruction the default template applies which
results no output. Then the folder child comes and the first template matches
for it. Please note that the folder pattern selects all element children of the
context node, which is the root in this case.
With the first template the frame of the HTML document is generated, and
template application is requested for each email children (of element type) of
the folder node. For the template search the context node will be now the
folder node (and not the root).
The email elements of the folder element will be processed sequentially,
applying the second template of the stylesheet. This just forms some heading
including the text data from the children element named from, to, subject
and the value of the attribute date. Finally it also appends the text data of the
email node itself, which is just the body of the message. The result (modulo
whitespaces) is the following.
<html><body>
<h1>Emails</h1>
<p>
From: robert@company.com<br/>
To: oliver@company.com<br/>
Date: 20 Aug 2003<br/>
Subject: Meeting<br/><br/><pre>
Could we meet this week to discuss the interface problem
in the NTL project? --Rob
</pre></p>
100
CHAPTER 3. XML
<p>
From: oliver@company.com<br/>
To: rob@company.com<br/>
Date: 21 Aug 2003<br/>
Subject: Re: Meeting<br/><br/><pre>
On 20 Aug 2003 rob@company.com wrote
> Could we meet today to discuss the interface problem
> in the NTL project? --Rob
OK. What about today at 16:00?
</pre></p>
</body></html>
Templates can be applied recursively in a natural way. For instance the
following template will apply itself as long as there are item children of the
context node. Please note that the depth of the recursion is always finite.
<xsl:template match="item">
<div style="margin-left:20px">
<xsl:value-of select="text()"/>
<xsl:apply-templates select="item"/>
</div>
</xsl:template>
3.5.2
Additional features
One might wish to include attribute values in the output document which are
generated from data of the source document. This would not be possible by
the standard constructions discussed so far, without violating XML rules in the
stylesheet. To solve this problem, two constructions are available in XSLT. The
first is the xsl:attribute instruction which inserts the named attribute with
the retrieved value into the element created by the first element creator ancestor
of the xsl:attribute element in the stylesheet. The second construction is
more compact, using the {, } delimiters to include the XPath expression
right in the place it goes to.
Example 3.18. Assume that one wants to define the text color in the HTML
output according to the value of the color attribute of the message elements in
the source XML document. The corresponding template can be the following.
<xsl:template match="message">
<font>
<xsl:attribute name="color">
<xsl:value-of select="@color"/>
</xsl:attribute>
<xsl:value-of select="text()"/>
</font>
</xsl:template>
Using the more compact notation the same can be achieved by the following
template.
<xsl:template match="message">
101
<font color="{@color}">
<xsl:value-of select="text()"/>
</font>
</xsl:template>
Elements can be generated not only by including them literally in the stylesheet
but also via the xsl:element stylesheet element. It has a mandatory attribute
name that specifies the name of the element to be created, and in which we can
use the dynamic attribute value expansion discussed above. That is, we can
also create elements in the output document dynamically.
Let us consider the following XML document.
<?xml version="1.0" encoding="utf-8"?>
<system>
<stamp>12-03-02 23:13</stamp>
<msgs>
<msg type="info">System started</msg>
<msg type="info">Logging in user maryk</msg>
<msg type="warn">User bobm not found</msg>
</msgs>
</system>
Let the goal be to create output XML documents from source documents of the
type above, so that the structure is not changed, except that we replace the msg
elements with ones having names specified by the type attributes in the source
document. An XSLT stylesheet doing this is the following.
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="system">
<system>
<xsl:apply-templates select="stamp"/>
<xsl:apply-templates select="msgs"/>
</system>
</xsl:template>
<xsl:template match="stamp">
<stamp><xsl:value-of select="text()"/></stamp>
</xsl:template>
<xsl:template match="msgs">
<msgs>
<xsl:apply-templates select="msg"/>
</msgs>
</xsl:template>
<xsl:template match="msg">
<xsl:element name="{@type}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
The transformation of the above example document with this stylesheet looks
as follows.
102
CHAPTER 3. XML
103
<td><xsl:value-of select="@isbn"/></td>
</tr>
</xsl:template>
Here is the triple color version too, just to demonstrate the xsl:choose instruction.
<xsl:template match="book">
<tr>
<xsl:choose>
<xsl:when test="position() mod 3 = 0">
<xsl:attribute name="bgcolor">grey</xsl:attribute>
</xsl:when>
<xsl:when test="position() mod 3 = 1">
<xsl:attribute name="bgcolor">blue</xsl:attribute>
</xsl:when>
<xsl:otherwise>
<xsl:attribute name="bgcolor">white</xsl:attribute>
</xsl:otherwise>
</xsl:choose>
<td><xsl:value-of select="author/text()"/></td>
<td><xsl:value-of select="title/text()"/></td>
<td><xsl:value-of select="@isbn"/></td>
</tr>
</xsl:template>
We mention also the possibility to sort the node set matched by a pattern
of a template or selected by a for-each instruction. The sort instruction has
several attributes; the node set can be defined via the select attribute and the
ordering will be done on the string values of the nodes in the node set (except
if the data-type attribute prescribes something else). Ascending or descending sorting can be specified via the order attribute. If an xsl:template or
xsl:sort element has more than one sort children then in the order of appearance the first defines the primary sort key, the second defines the secondary sort
key and so on. The template will process the elements of the node set in the
sorted order.
Example 3.21. Assume that we have an XML document with the following
structure.
<products>
<product>
<id>ASN32255</id>
<price currency="EUR">12.5</price>
</product>
<product>
<id>ASN54345</id>
<price currency="EUR">3.67</price>
</product>
<product>
<id>ASN31345</id>
<price currency="EUR">24.22</price>
104
CHAPTER 3. XML
</product>
</products>
We would like to produce an HTML table from these products with the prices
being in ascending order. Here is an XSLT stylesheet to do that.
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="products">
<html><body>
<h1>Product list</h1>
<table>
<tr><th>ID</th><th>Price</th></tr>
<xsl:for-each select="product">
<xsl:sort select="price" order="ascending" data-type="number"/>
<xsl:sort select="id" order="ascending"/>
<tr><td><xsl:value-of select="id"/></td>
<td align="right">
<xsl:value-of select="price/text()"/>
<xsl:value-of select="price/@currency"/>
</td></tr>
</xsl:for-each>
</table>
</body></html>
</xsl:template>
</xsl:stylesheet>
Products will be sorted by their price and equally priced products are sorted by
their ID. Please note also that the first xsl:value-of instruction selects only
the id child of a product element while the second selects the text data node of
the price child. The second is clearly what we want and the first is also correct
because the default template will just fetch the text data of the id element.
Example 3.22. Finally this simple example shows how can one use the XSLT
facilities to establish references between elements of the source XML document.
Let us assume that the source XML file contains letter drafts which should be
formatted and displayed. Let the source document store the dates by numbering
months. We want to convert the dates to the following format month-name dd,
yyyy. Here is the sketch of the source XML file.
<root>
<letter>
<date y="2003" m="08" d="28"/>
<from>Sender</from>
<to>Recipient</to>
<address>Address of recipient</address>
Letter body.
</letter>
<months>
<month id="01">January</month>
<!-- And so on -->
105
<month id="12">December</month>
</months>
</root>
The last element declares which month number gets which character name (as at
this point of the XML introduction provided by these notes we cannot establish
references between elements of different XML documents, we have to add the
months element as a child of the source document). The XSLT stylesheet could
be the following.
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="root">
<html><body>
<xsl:apply-templates select="letter"/>
</body></html>
</xsl:template>
<xsl:template match="letter">
<p>
To: <xsl:value-of select="to"/><br/>
<xsl:value-of select="address"/><br/><br/>
Dear <xsl:value-of select="to"/>,<br/><br/>
<xsl:apply-templates select="text()"/><br/><br/>
<xsl:value-of select="concat(
/root/months/month[@id=current()/date/@m],
, date/@d, ,, date/@y)"/><br/>
Best regards,<br/>
<xsl:value-of select="from"/>
</p>
</xsl:template>
</xsl:stylesheet>
In the date retrieval we can use the concat function to produce the concatenation of the resulting string values. The current function retrieves the current
context outside the square brackets in the location path, so that current()/date/@m
will refer to the m attribute of the date element of the email element currently
processed (and not to an attribute of some descendant of the months element).
3.5.3
Exercises
106
CHAPTER 3. XML
3.6
107
XML Query
The purpose of XML Query (XQuery for short) is to provide a language for
extracting data from XML documents. The queries operate on single documents or fixed sets of documents and can select whole documents or subtrees of
documents that match conditions on content and structure.
The query language is functional (but it also includes universal and existential quantifiers), supports simple and complex data types defined in XML
Schema, defines aggregate functions, handles null values and fully compliant
with XML namespaces.
Just as in XSLT, the expressions play the central role in XQuery. The
value of an expression is always a sequence, which is a list of zero or more
atomic values (as in XML Schema) or nodes (of the given node types, as in
XPath). The xs: prefix is assumed to be bound to the URI http://www.w3.
org/2001/XMLSchema, the fn: prefix to the URI http://www.w3.org/2003/
11/xpath-functions and the xdt: prefix to the URI http://www.w3.org/
2003/11/xpath-datatypes.
3.6.1
The data model of XQuery is an extended version of the data model of XPath
(also referred to as the XPath 2.0 data model). The root node type is called
document-node type (identified by the unique URI of the document, so that
collections of document nodes can also be treated). The data model can contain
various extra information for each node, e.g. the parent, the children, the inscope namespaces, an indication whether the node is nilled, the type, etc. These
information can be obtained via accessors (interface specifications for functions
which should be included in any implementations). Here we briefly discuss only
the type annotation of nodes in the data model.
Type information can be assigned to element or attribute nodes either using
the post schema validation infoset (PSVI) which, as its name shows, is obtained
by validating the document against schema(s), or, if the PSVI cannot provide
type information for a node, just using the xdt:untypedAny type for element
nodes and the xdt:untypedAtomic type for attribute nodes (see below). For
anonymous types (having no names assigned in the schema) the parser assigns
internal identifiers.
The typed value of a node can be extracted by applying the fn:data function
on the node, which corresponds to the dm:typed-value accessor, (the string
valuerecall the notion from the discussion of XPathcan be obtained by
applying the fn:string function). The typed value of a node is computed in
the following way:
For text, document, and namespace nodes, the typed value of the node is
the same as its string value, as an instance of the type xdt:untypedAtomic.
The typed value of a comment or processing instruction node is the same
as its string value, as an instance of the type xs:string.
The typed value of an attribute with type annotation xdt:untypedAtomic
is its string value as an instance of this type. With other type annotation
108
CHAPTER 3. XML
it is derived from the string value in a way consistent with the schema
validation.
For element nodes the typed value can be computed in the following way:
If the element has a type of xdt:untypedAtomic or a complex type with
mixed content, the typed value is the string value of the node as an instance
of xdt:untypedAtomic.
If the element has a simple type or a complex type with simple content,
the typed value is a sequence of zero or more atomic values derived from
the string value of the node and its type in a way that is consistent with
the schema validation.
If the node has a complex type with empty content, the typed value is the
empty sequence.
If the node has a complex type with element only complex content, its
typed value is undefined.
XQuery is a strongly typed language. It uses the XML Schema data types,
which have to be qualified with the predefined xs prefix (e.g. xs:integer), and
the following additional data types (with predefined xdt prefix):
xdt:anyAtomicType an abstract data type of all atomic values,
xdt:untypedAny a concrete data type for values of element nodes not having
any specific dynamic type assigned to them,
xdt:untypedAtomic a concrete data type for atomic values not having more
specific type assigned to them,
xdt:dayTimeDuration and xdt:yearMonthDuration both are concrete subtypes of xs:duration respectively.
Expressions are evaluated in two phases, called static analysis and dynamic
evaluation.
In the static analysis phase, which depends only on the expression and statically available information (one which is available before the actual evaluation
of the expression, e.g. in-scope namespaces, defined variables, available functions), a concrete or an abstract data type may be assigned to the expression
in advance. This phase is also useful to quickly detect static errors in the
expression.
In the dynamic evaluation phase, that comes after the static analysis, the
value of the expression is computed with respect to the dynamic context (analogous to the concept of context in XPath, containing additional information on
the available documents and node collections, and the current date and time).
The type determined in this phase is called the the dynamic type of the expression.
Since expressions always have sequence values, to simplify the notation we
do not make distinction between an atomic value or a node and the singleton
sequence containing it. However when a sequence of values is longer than one,
we have to talk about sequence types. The main difference between types and
sequence types is the possibility of prescribing occurrence constraints on the
109
appearance of the given type in the sequence. The character ? denotes zero or
one, * denotes zero or more and + denotes one or more occurrence. The node
kinds are denoted by their names with a () suffix, node() standing for a node
of any kind. For the detailed specification we refer to [11]. A few examples are:
xs:date refers to the built-in Schema type date,
attribute()? refers to an optional attribute,
element() refers to any element,
element(po:shipto, po:address) refers to an element that has the qualified name po:shipto and has the type annotation po:address (or a subtype of that type),
node()* refers to a sequence of zero or more nodes of any type.
Type matching is also a part of the expression evaluation process, i.e. a sequence of values with given sequence type has to be checked against an expected
sequence type (e.g. input argument list of a function). The type matching results true if the given sequence type is known and matches the known expected
sequence type, or if it can be derived by extensions or restrictions from the
expected type. In other cases the result is false, except if the expected type is
unknown or it is not possible to determine if the given type can be derived from
the expected one, in which case a type error is risen.
3.6.2
Expressions
110
CHAPTER 3. XML
Function calls consist of the (possibly qualified) name of the function followed
by a parenthesized list of arguments. The core function library of XPath is available, but one can also define new functions in the query prolog. The input parameters undergo a type matching, in which certain values can be split into a sequence of atomic values (atomization) and values with type xdt:untypedAtomic
are attempted to be type casted to the required types. If the type matching was
successful, the formal parameters of the function are bound to the corresponding input values, however if the input value is a subtype of the prescribed type
of the formal parameter, it keeps its more specific type.
Path expressions are basically known to us from XPath (see location paths
in Section 3.4).
Sequence expressions can be formed via the comma operator applied on
(possibly sequence) expressions. The result is always a flat sequence, so that
the elements of the nested sequences are shifted up to the outermost level. For
instance:
(1,(2,3),(),(3,4)) results (1,2,3,3,4),
(book) results the sequence of all book children of the context node.
Sequences can also be combined by the operators: union, intersect, except
performing the corresponding set-theoretic operations (eliminating duplicates
from the resulting sequence).
Arithmetic expressions can be built from atomic values that can be type
casted to an accepted input type of the applied operator, which can be: addition
+, subtraction -, multiplication *, division div (integer division idiv),
and modulus mod.
Comparison expressions can be of the type: value comparisons (eq, ne,
lt, le, gt, ge), general comparisons (=, !=, <, <=, >, >=) and node comparisons (is, <<, >>). Value comparisons require atomic types, e.g. book/author
eq "Kennedy" is true if and only if the book element node has exactly one
author child and its typed value is "Kennedy" as an instance of xs:string or
xdt:untypedAtomic.
General comparisons can also be applied when the atomization of the operands
produce longer sequences. The result of the comparison is true if and only if
there is a pair of atomic values from the operands which are in the required relationship (applying the corresponding value comparisons for the atomic values,
e.g. eq for =).
In node comparisons the operands must be single nodes or the empty sequence. The is operator test if the operands are the same, in the sense of having
the same identity (recall that in XML documents an element can have many children with the same name and the same content, each having its own identity).
The << and >> operators compare nodes with respect to the document order. For
instance: /book[@isbn="0-387-94269-6"] is /book[@id="L23432"] is true
if and only if the two path expressions evaluate to the same single node in the
document.
Logical expressions can be formed from subexpressions that evaluate to
boolean type using the and and or operators.
111
Constructors are provided for every kind of node types to create XML structures in a query. There are direct and computed constructors, where the direct
constructors resemble to the corresponding concept in XSLT, where it is possible
to embed dynamically computed values by enclosing the defining expressions in
{}. For instance, the following direct element constructor creates a new node
with its own identity in the result of the evaluation.
<p id="{$b/@isbn}">Book:<br/>
{string($b/title)}, {string($b/author)} [{string($b/@isbn)}]</p>
If the variable b is bound to the node
<book isbn="0-812909191-5">
<title>Codenotes for XML</title>
<author>G. Brill</author>
</book>
the result is the following.
<p id="0-812909191-5">Book:<br/>
Codenotes for XML, G. Brill [0-812909191-5]</p>
The curly braces can be included in the result either by doubling them: {{
and }} will represent a { and a } respectively, or by using the corresponding
character references: { and }.
The direct element constructor automatically validates the created element
against the available in-scope schemas (depending on the set validation mode
this might result in an error or assigning the type xdt:untypedAny to the element and its children if the validation fails).
Computed constructors start with a keyword, identifying the node type to
be created (element, document, text, processing-instruction, comment, namespace), followed by the name (or an expression, in braces, resulting a name, which
is possible only for non-namespace nodes) for element, attribute, namespace or
processing instruction node. Finally the content expression defines the content
of the node. Computed constructors can be nested, just as the direct ones. The
previous example with computed constructors would look as follows.
element p {
attribute id {$b/@isbn},
"Book:", element br {},
string($b/title), ",", string($b/author), ", [",
string($b/@isbn), "]"
}
An important use of computed constructors is to be able to assign the name of
the created element dynamically.
FLWOR expressions support iteration and binding variables to intermediate
results. The acronym was created from the parts of such expressions: for, let,
where, order by, return. The for and let clauses in a FLWOR expression
generate a sequence of tuples of bound variables, called the tuple stream. The
where clause serves to filter the tuple stream, retaining some tuples and discarding others. The order by clause imposes an ordering on the tuple stream.
The return clause constructs the result of the FLWOR expression.
112
CHAPTER 3. XML
The following example (from [11]) of a FLWOR expression includes all of the
possible clauses. The for clause iterates over all the departments in an input
document, binding the variable $d to each department number in turn. For
each binding of $d, the let clause binds variable $e to all the employees in the
given department, selected from another input document. The result of the for
and let clauses is a tuple stream in which each tuple contains a pair of bindings
for $d and $e ($d is bound to a department number and $e is bound to a set
of employees in that department). The where clause filters the tuple stream
by keeping only those binding-pairs that represent departments having at least
ten employees. The order by clause orders the surviving tuples in descending
order by the average salary of the employees in the department. The return
clause constructs a new big-dept element for each surviving tuple, containing
the department number, headcount, and average salary.
for $d in fn:doc("depts.xml")//deptno
let $e := fn:doc("emps.xml")//emp[deptno = $d]
where fn:count($e) >= 10
order by fn:avg($e/salary) descending
return
<big-dept>
{$d,
<headcount>{fn:count($e)}</headcount>,
<avgsal>{fn:avg($e/salary)}</avgsal>}
</big-dept>
The for clause can contain more than one variables, and then the constructed tuple stream will contain variable bindings for each combination of
values for the variables from the Cartesian product of the sequences over which
the variables range. The let clause can also contain more than one variable,
however it binds each variable to the result without iteration. This example
illustrates the difference:
for $i in (<x/>,<y/>,<z/>)
return <a>$i</a>
results
<a><x/></a><a><y/></a><a><z/></a>
while
let $i := (<x/>,<y/>,<z/>)
return <a>$i</a>
results
<a><x/><y/><z/></a>
Each variable bound in a for clause can have associated positional variable
that iterates on the integers from 1 on as the variable iterates on the sequence.
The positional variable comes after the ordinary one separated by the at keyword. For instance
for $author at $i in ("G. Brill", "C. J. Date"),
$book at $j in ("Codenotes for XML", "Database Systems")
113
=
=
=
=
1,
1,
2,
2,
$author
$author
$author
$author
=
=
=
=
"G.
"G.
"C.
"C.
Brill", $j =
Brill", $j =
J. Date", $j
J. Date", $j
1, $book =
2, $book =
= 1, $book
= 2, $book
The where clause just contains an expression which is evaluated for each
tuple in the tuple stream, and which should provide a boolean value. The tuples
result in false get filtered out. The return clause will then be evaluated for each
remaining tuple in the filtered tuple stream taken in the order prescribed by the
order by clause (if it is omitted, the order will be the one of the tuple stream).
Conditional expressions are analogous to the well known if-then-else constructions in imperative programming languages. An example (from [12]).
<bib>
{ for $b in doc("http://bstore1.example.com/bib.xml")//book
where fn:count($b/author) > 0
return
<book>
{ $b/title }
{ for $a in $b/author[position()<=2] return $a }
{ if (fn:count($b/author) > 2) then <et-al/> else () }
</book> }
</bib>
Quantified expressions support the universal (every) and the existential
(some) quantifiers. The quantifier keyword is followed by possibly several inclauses that bind sequences to variables, from which a tuple stream is formed
as for the FLWOR expressions. The final part of the expression is a test-clause,
separated by the satisfies keyword from the rest of the expression, which is
evaluated for each tuple in the stream. The value of the existentially quantified
(some) expression is true if at least one of the evaluations of the test expression
on the tuples from the stream is true (the empty stream yields false value). The
every expression is true if every evaluation of the test expression evaluates to
true (the empty stream yields true).
Examples (from [11]): this expression is true if every part element has a
discounted attribute (regardless of the values of these attributes).
every $part in //part satisfies $part/@discounted
This expression is true if at least one employee element satisfies the given comparison expression.
some $emp in //employee satisfies ($emp/bonus > 0.25 * $emp/salary)
3.6.3
Query Prolog
The prolog is a semicolon separated series of declarations (of variables, namespaces, functions, etc.) and imports (of schemas, modules) that create the environment for query processing.
114
CHAPTER 3. XML
115
return
<dept>
<deptno>{$d}</deptno>
<headcount> {fn:count($e)} </headcount>
<payroll> {fn:sum($e/salary)} </payroll>
</dept>
};
This application of the previous function computes a summary of employees in
Denver.
local:summary(fn:doc("acme_corp.xml")//employee[location = "Denver"])
3.6.4
Exercises
Exercise 3.6.1. Solve the exercises of Section 3.5.3 using now XQuery instead
of XSLT.
116
CHAPTER 3. XML
Bibliography
[1] Apache. http://www.apache.org/.
[2] Brill, G. Codenotes for XML. Random House, 2001.
[3] Date, C. J. An Introduction to Database Systems. Addison Wesley, 1995.
[4] Perl. http://www.perl.com/.
[5] PHP. http://www.php.net/.
[6] PostgreSQL. http://www.postgresql.org/.
[7] Search Engine Watch. http://searchenginewatch.com/.
[8] XML. http://www.w3.org/XML/.
[9] XMLSchema. http://www.w3.org/XML/Schema.
[10] XPath. http://www.w3.org/TR/xpath.
[11] XQuery. http://www.w3.org/TR/xquery/.
[12] XQuery Use Cases. http://www.w3.org/TR/xquery-use-cases/.
[13] XSLT. http://www.w3.org/Style/XSL/.
[14] Zaiane, O.
Database systems and structures, 1998.
lecture notes (also availble on line:
http://www.cs.sfu.ca/
CC/354/zaiane/material/notes/contents.html).
117