XML was designed to be a language for describing the structure of data and for encoding structured or semistructured data in a manner that facilitates interchange. It can be viewed simply as an exchange form or as a more widely used encoding form for persistent data. Among its design goals, we note especially that it "shall support a wide range of applications," the majority of which are presently unknown. We expect the discussions within QL'98 to lead to the adoption of a standard data model for XML that will support as wide a range of applications as possible.
A query language for XML imposes semantics on the encoded data. As such, it forms an integral part of the resulting data model. The properties of the language will therefore promote or inhibit future applications of XML. Among other features, such a language must
In this paper, we do not advocate specific syntactic constructs, because we believe that most interactions with XML data will be through programs that convert query intentions, expressed through a user interface designed to maximize user convenience, into precise query statements expressed in a formal query language.
We propose that an XML document be modelled as a rooted, directed, ordered, labelled tree. In this model, every component of a document is accessible in the form in which it was encoded. Tree pattern matching is used to identify entities, attributes, values, or text fragments that are of interest to the user. Equally important is that updates to the tree through an integrated data manipulation language will result in structures that have unique expressions in XML. Thus the query and manipulation languages can be faithful interpreters of the application's data.
XML-QL, which has recently been proposed as a general-purpose query language for XML, models an XML document instead as a rooted, directed, unordered, labelled graph. We agree with the proponents of XML-QL that adopting a database approach to XML will address applications' needs "for extracting data from large XML documents, for translating XML data between different ontologies (DTD's), for integrating XML data from multiple XML sources, and for transporting large amounts of XML data to clients or for sending queries to XML sources." However, we disagree with the choice of model, as it limits potential applications by conflating distinct data encoding choices.
XML defines an encoding of text viewed as a sequence of characters. As an integral part of the definition, however, an XML application accesses the text through an XML processor, which distinguishes markup from character data and performs various normalizations on the character stream. A query engine will not be able to distinguish aspects of the data that are masked by an XML processor (e.g., "#xD#xA" or "#xD" must be replaced by "#xA"). As a result, a query language for XML must not include constructs specifying such indistinguishable features in a document. Thus the tree model is more abstract than a pure-text string model for XML.
Nevertheless, XML processors preserve many properties of documents, and the query language should not mask these. In particular, as opposed to the proponents of XML-QL, we believe that distinguishing the document fragment
<book year="1998">
<title>Foundations for Object/Relational Databases</title>
<author><lastname>Date</lastname></author>
<author><lastname>Darwen</lastname></author>
<publisher>Addison-Wesley</publisher>
</book>
from one in which the authors are listed in the reverse order is a fundamental property of XML encodings. Even though the encoding:
<writers>
<person id="o123">
<name>Smith</name>
</person>
<person id="o234">
<name>Jones</name>
</person>
</writers>
<article author="o123 o234">
<title>Encoding XML</title>
</article>
and the alternative encoding:
<writers person="o123 o234"/>
<article>
<author id="o123">
<name>Smith</name>
</author>
<author id="o234">
<name>Jones</name>
</author>
<title>Encodings</title>
</article>
might be modelled by identical graphs, they represent distinct XML structures. Applications must be able to distinguish which elements contain defining elements and which ones refer to those elements via attributes having values of type IDREF. Without imposing our prejudices on future unknown applications of XML, we cannot safely assert which parts of the encoding will be essential to the data and which will not.
Similarly, we argue that the data model should preserve the order in which attributes are stored. Although some XML processors might normalize the order of attributes, we note that the XML specification does not dictate how a processor must present attributes to an application. For a general-purpose query language, we therefore advocate the conservative position that the order of occurrence for each tag's attributes must be accessible to an application that wishes to exploit that property. (For both an element's children and its attributes, the query language may also include order-independent matching and re-ordering of results to support those applications that depend on an ability to encode unordered sets.)
Of course, not every query need be posed directly against the tree-structured model of the data. Many facilities are best supported through the extensive use of views.
As an example, we integrated our original proposal with the ongoing standardization of SQL, and we therefore included mechanisms to extract relations from documents. We believe that this is only one of many possible extraction forms that should be supported. In particular, we also support extraction of selected data as a directed graph that could be further manipulated by a semistructured database engine such as those underlying XML-QL, Ozone, and others.
Experience with large text databases has shown views to be an excellent mechanism to allow applications to ignore sections of a document collection or specific document encodings and thus provide convenient access to those parts of the data that provide information for the immediate application needs. For example, a single large document that represents a book of policies and procedures can be accessed through a view that isolates individual policies and individual procedures as if they were each separately stored documents; a second view can isolate those procedures that dictate aspects of purchasing non-capital equipment and present them to an application as if they were stored as a single document. Thus, through the availability of views, application designers need not be limited by their choice of document boundaries.
The view mechanism need not be limited to defining virtual documents; it can also provide means to define virtual tables of data conforming to the relational model, collections of virtual objects accessible by object-oriented language constructs, virtual graphs underlying semistructured data models, or other formats serving future applications' needs. Furthermore, views allow documents to conform to evolving encoding standards. Views hide selected encoding details, assemble widespread data, and rename (as well as reformat and recompute) components to conform to a chosen ontology. Through views, applications can access equivalent forms of XML data even if the sources of that data adopted diverse representations.
When it comes to updating the data, an application that has accessed it through a view may or may not be able to express which XML encoding to use to represent the modified data. However, this is a standard problem that continues to be addressed by researchers in the database community: certain views may be deemed to disallow updates, others may define which update among various alternatives to choose, yet others may require an administrator's intervention.
A general-purpose query language for XML should provide access to "base documents" without attaching semantics beyond what is dictated by the markup. In the presence of a DTD, one or more views can be defined over the document to allow graph-like structures through IDREF-ID pairs and to ignore ordering of attributes or elements selectively. In the presence of other XML extensions, such as XSL or XLink and Xpointer, further views can be defined for even more abstract modelling of the text. Given other types, such as relations or object classes, additional views can be superimposed on the text to support various applications' information needs. In this way, fidelity, flexibility, and convenience can all be provided to all of us who need to query and manipulate XML data
S. Abiteboul, J. Widom, T. Lahiri, A Unified Approach for Querying Structured Data and XML, Position paper for QL'98.
L .J. Brown, M. P. Consens, I. J. Davis, C. R. Palmer, and F. W. Tompa, A Structured Text ADT for Object-Relational Databases, in ``Objects, Databases, and the WWW,'' a special issue of Theory and Practice of Object Systems, to appear (1998).
I. Davis, Adding Structured Text to SQL/MM Part 2: Full Text Department of Computer Science, University of Waterloo, SQL/MM Change Proposal LHR-24, CAC WG3 N334R2, February 12, 1996, 42 pp.
A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu, XML-QL: A Query Language for XML, W3C Submission, NOTE-xml-ql-19980819 (August 19, 1998).
D. R. Raymond, F. W. Tompa, and D.Wood, From Data Representation to Data Model: Meta-Semantic Issues in the Evolution of SGML, Computer Standards & Interfaces 18 (1996) 25-36.
† The opinions expressed in this paper are those of the author and do not necessarily represent positions advocated by the University of Waterloo nor by others associated with the University. Financial assistance has been provided by a University-Industry award from the Conference Board of Canada and the Natural Sciences and Engineering Research Council of Canada.