Abstract
Statistical summaries in relational databases mainly focus on the distribution of data values and have been found useful for various applications, such as query evaluation and data storage. As xml has been widely used, e.g. for online data exchange, the need for (corresponding) statistical summaries in xml has been evident. While relational techniques may be applicable to the data values in xml documents, novel techniques are requried for summarizing the structures of xml documents. In this paper, we propose metrics for major structural properties, in particular, nestings of entities and one-to-many relationships, of XML documents. Our technique is different from the existing ones in that we generate a quantitative summary of an xml structure. By using our approach, we illustrate that some popular real-world and synthetic xml benchmark datasets are indeed highly skewed and hardly hierarchical and contain few recursions. We wish this preliminary finding shreds insight on improving the design of xml benchmarking and experimentations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bex, G.J., Neven, F., den Bussche, J.V.: DTDs versus XML Schema: A Practical Study. In: WebDB, pp. 79–84 (2004)
Bohannon, P., Choi, B., Fan, W.: Incremental evaluation of schema-directed XML publishing. In: SIGMOD (2004)
Bohannon, P., Freire, J., Roy, P., Simeon, J.: From XML schema to relations: A cost-based approach to XML storage. In: ICDE (2002)
Boncz, P.A., Grust, T., van Keulen, M., Manegold, S., Rittinger, J., Teubner, J.: MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In: SIGMOD, pp. 479–490 (2006)
Braganholo, V.P., Davidson, S.B., Heuser, C.A.: From XML view updates to relational view updates: old solutions to a new problem. In: VLDB (2004)
Buneman, P., Choi, B., Fan, W., Hutchison, R., Mann, R., Viglas, S.: Vectorizing and querying large xml repositories. In: ICDE, pp. 261–272 (2005)
Chen, Z., Jagadish, H.V., Korn, F., Koudas, N., Muthukrishnan, S., Ng, R., Srivastava, D.: Counting twig matches in a tree. In: ICDE (2001)
Cheney, J.: Compressing XML with multiplexed hierarchical PPM models. In: Data Compression Conference (2001)
Choi, B.: What are real DTDs like. In: WebDB, pp. 43–48 (2002)
Choi, B.: Document decomposition for XML compression: A heuristic approach. In: Li Lee, M., Tan, K.-L., Wuwongse, V. (eds.) DASFAA 2006. LNCS, vol. 3882, pp. 202–217. Springer, Heidelberg (2006)
Deutsch, A., Fernandez, M.F., Suciu, D.: Storing semistructured data with STORED. In: SIGMOD, pp. 431–442. ACM Press, New York (1999)
ExPASy. Swiss-prot and TrEMBL, available at: http://www.expasy.ch/sprot/
Fiebig, T., Helmer, S., Kanne, C.-C., Moerkotte, G., Neumann, J., Schiele, R., Westmann, T.: Anatomy of a native XML base management system. VLDB Journal 11(4), 292–314 (2002)
Florescu, D., Kossmann, D.: Storing and querying XML data using an RDMBS. IEEE Data Engineering Bulletin 22(3), 27–34 (1999)
Freire, J., Haritsa, J.R., Ramanath, M., Roy, P., Siméon, J.: StatiX: making XML count. In: SIGMOD Conference, pp. 181–191 (2002)
Kaushik, R., Shenoy, P., Bohannon, P., Gudes, E.: Exploiting local similarity for efficient indexing of paths in graph structured data. In: ICDE (2002)
Ley, M.: DBLP Bibliography (March 2005), available at: http://www.informatik.uni-trier.de/~ley/db/
Liefke, H., Suciu, D.: XMILL: An efficient compressor for XML data. In: SIGMOD (2000)
McHugh, J., Widom, J.: Query optimization for XML. In: VLDB (1999)
Milo, T., Suciu, D.: Index structures for path expressions. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 277–295. Springer, Heidelberg (1998)
National Aeronautics and Space Administration. The NASA XML project, available at: http://xml.nasa.gov/xmlwg/index.htm
Paparizos, S., Al-Khalifa, S., Chapman, A., Jagadish, H.V., Lakshmanan, L.V.S., Nierman, A., Patel, J.M., Srivastava, D., Wiwatwattana, N., Wu, Y., Yu, C.: TIMBER: A native system for querying XML. In: SIGMOD (2003)
Polyzotis, N., Garofalakis, M.N.: Statistical synopses for graph-structured XML databases. In: SIGMOD (2002)
Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: SIGMOD, pp. 294–305 (1996)
Prakash, S., Bhowmick, S.S., Madria, S.K.: Efficient recursive XML query processing in relational database systems. In: Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, T.-W. (eds.) ER 2004. LNCS, vol. 3288, pp. 493–510. Springer, Heidelberg (2004)
Runapongsa, K., Patel, J., Jagadish, H., Chen, Y., Al-Khalifa, S.: The Michigan benchmark: Towards XML query performance diagnostics (2003)
Schmidt, A.: XMark – an XML benchmakr project (2003), available at: http://monetdb.cwi.nl/xml/generator.html
Schmidt, A., Waas, F., Kersten, M., Carey, M.J., Manolescu, I., Busse, R.: XMark: A benchmark for XML data management. In: VLDB, pp. 974–985 (2002)
Segoufin, L., Vianu, V.: Validating streaming xml documents. In: PODS, pp. 53–64 (2002)
Shanmugasundaram, J., Shekita, E., Kiernan, J.: A general technique for querying XML documents using a relational database system. SIGMOD Record 30(3), 20–26 (2001)
Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., DeWitt, D.J., Naughton, J.F.: Relational databases for querying XML documents: Limitations and opportunities. VLDB Journal, 302–314 (1999)
ToXGene. The ToX XML generator (2005), available at: http://www.cs.toronto.edu/tox/toxgene/
W3C. Extensible Markup Language (XML), available at: http://www.w3.org/XML/
Yao, B.B., Ozsu, M.T., Khandelwal, N.: XBench benchmark and performance testing of XML DBMSs. In: ICDE, pp. 621–633 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lin, Z., He, B., Choi, B. (2006). A Quantitative Summary of XML Structures. In: Embley, D.W., Olivé, A., Ram, S. (eds) Conceptual Modeling - ER 2006. ER 2006. Lecture Notes in Computer Science, vol 4215. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11901181_18
Download citation
DOI: https://doi.org/10.1007/11901181_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-47224-7
Online ISBN: 978-3-540-47227-8
eBook Packages: Computer ScienceComputer Science (R0)