Abstract
For querying structured and semistructured data, data retrieval and document retrieval are two valuable and complementary techniques that have not yet been fully integrated. In this paper, we introduce integrated information retrieval (IIR), an XML-based retrieval approach that closes this gap. We introduce the syntax and semantics of an extension of the XQuery language called XQuery/IR. The extended language realizes IIR and thereby allows users to formulate new kinds of queries by nesting ranked document retrieval and precise data retrieval queries. Furthermore, we detail index structures and efficient query processing approaches for implementing XQuery/IR. Based on a new identification scheme for nodes in node-labeled tree structures, the extended index structures require only a fraction of the space of comparable index structures that only support data retrieval.
Similar content being viewed by others
References
Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, San Francisco (2000)
Abiteboul, S., Kaplan, H., Milo, T.: Compact labeling schemes for ancestor queries. In: Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 547–556(2001)
Al-Khalifa, S., Jagadish, H., Koudas, N., Patel, J.M., Srivastava, D., Wu, Y.: Structural joins: a primitive for efficient XML query pattern matching. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE), pp. 141–152 (2002)
Al-Khalifa, S., Yu, C., Jagadish, H.: Querying structured text in an XML database. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 4–15 (2003)
Alonso, O.: Oracle text white paper. Technical Report, Oracle Corporation, Redwood Shores, CA (2001)
Amer-Yahi, S., Botev, C., Shanmugasundaram, J.: TeXQuery: a full-text search extension to XQuery. In: Proceedings of the 13th World Wide Web Conference, pp. 583–594 (2004)
Baeza-Yates, R.A., Navarro, G.: Integrating contents and structure in text retrieval. SIGMOD Rec. 25, 67–79 (1996)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading, MA (1999)
Berglund, A., Boag, S., Chamberlin, D., Fernández, M.F., Kay, M., Robie, J., Siméon, J.: XML Path Language (XPath) 2.0. W3C working draft, W3C (November 2003). www.w3.org/TR/2003/WD-xpath20-20031112
Boag, S., Chamberlin, D., Fernández, M.F., Florescu, D., Robie, J., Siméon, J.: XQuery 1.0: An XML Query Language. W3C working draft, W3C (November 2003). www.w3.org/TR/2003/WD-xquery-20031112/
Botev, C., Amer-Yahia, S., Shanmugasundaram, J.: On the completeness of full-text search languages for XML. Technical Report, Cornell University (December 2003)
Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., Yergeau, F., Cowan, J.: Extensible Markup Language (XML) 1.1. W3C recommendation, W3C (February 2004). www.w3.org/TR/2004/REC-xml11-20040204
Bremer, J.-M., Gertz, M.: Query processing and index structures for integrated XML document and data retrieval. Technical Report CSE-2002-22, Department of Computer Science, University of California at Davis (2002)
Bremer, J.-M., Gertz, M.: XQuery/IR: integrating XML document and data retrieval. In: Proceedings of the 4th International Workshop on the Web and Databases (WebDB), pp. 1–6(2002)
Bremer, J.-M., Gertz, M.: An efficient XML node identification and indexing scheme. Technical Report CSE-2003-04, Department of Computer Science, University of California at Davis (2003)
Brin, S., Page, L.: The anatomy of a large scale hypertextual Web search engine. In: Proceedings of the 7th World Wide Web Conference, Elsevier, Amsterdam, pp. 107–117 (1998)
Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 310–311 (2002)
Buxton, S., Rys, M.: XQuery and XPath full-text requirements. W3C working draft, W3C (May 2003). www.w3.org/TR/2003/WD-xquery-full-text-requirements-20030502/
Callan, J., Croft, W.B., Broglio, J.: TREC and Tipster experiments with InQuery. Inf. Process. Manage. 31, 327–332, 343 (1995)
Chamberlin, D., Frankhauser, P., Florescu, D., Marchiori, M., Robie, J.: XML query use cases. W3C working draft, W3C (November 2003). www.w3.org/TR/2003/WD-xmlquery-use-cases-20031112/
Chien, S.Y., Vagena, Z., Zhang, D., Tsotras, V.J., Zaniolo, C.: Efficient structural joins on indexed XML documents. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 263–274 (2002)
Chinenyanga, T.T., Kushmerick, N.: An expressive and efficient language for XML information retrieval. In: Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–171 (2001)
Chung, C.W., Min, J.K., Shim, K.: Apex: an adaptive path index. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 121–132 (2002)
Cohen, E., Kaplan, H., Milo, T.: Labeling dynamic XML trees. In: Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pp. 271–281 (2002)
Cooper, B.F., Samle, N., Franklin, M.J., Hjaltson, G.R., Shadmon, M.: A fast index for semistructured data. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), pp. 341–250 (2001)
Cowan, J., Tobin, R.: XML Information Set, 2nd edn. W3C recommendation, W3C (February 2004). www.w3c.org/TR/2004/REC-xml-infoset-20040204
Croft, W.B.: “What do people want from information retrieval?”. D-Lib. Mag. (1995)
DeHaan, D., Toman, D., Consens, M.P., Özsu, M.T.: A comprehensive XQuery to SQL translation using dynamic interval encoding. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 623–634 (2003)
Dessloch, S., Mattos, N.M.: Integrating SQL databases with content-specific search engines. In: Proceedings of the 23rd International Conference on Very Large Databases (VLDB), pp. 528–537 (1997)
Eickler, A., Gerlhof, C.A., Kossmann, D.: A performance evaluation of OID mapping techniques. In: Proceedings of the 21st International Conference on Very Large Databases (VLDB), pp. 18–29 (1995)
Fernández, M., Marsh, J., Malhotra, A., Nagy, M., Walsh, N.: XQuery 1.0 and XPath 2.0 data model. W3C working draft, W3C (November 2003). www.w3.org/TR/2003/WD-path-datamodel-20031112
Fiebig, T., Helmer, S., Kanne, K.C., Moerkotte, G., Neumann, J., Schiele, R.: Anatomy of a native XML base management system. VLDB J. 11, 292–314 (2002)
Florescu, D., Kossmann, D., Manolescu, I.: Integrating keyword search into XML query processing. In: Proceedings of the 9th International Word Wide Web Conference/Computer Networks. 33(1–6), 119–135 (2000)
Fuhr, N., Gövert, N., Kazai, G., Lalmas, M.: INEX: initiative for the evaluation of XML retrieval. In: Proceedings of the ACM SIGIR 2002 Workshop on XML and Information Retrieval(2002)
Fuhr, N., Grossjohann, K.: XIRQL: a query language for information retrieval in XML documents. In: Proceedings of 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 172–180(2001)
Goldman, R., Widom, J.: DataGuides: enabling query formulation and optimization in semistructured databases. In: Proceedings of the 23rd International Conference on Very Large Databases (VLDB), pp. 436–445 (1997)
Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing XPath queries. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 95–106 (2002)
Grabs, T., Schek, H.J.: Generating vector spaces on-the-fly for flexible XML retrieval. In: Proceedings of the ACM SIGIR 2002 Workshop on XML and Information Retrieval (2002)
Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv. 25, 73–169 (1993)
Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRank: ranked keyword search over XML documents. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 16–27 (2003)
Holmes, N.: The great term robbery. Computer 34, 94–96 (2001)
Jacobsen, G., Krishnamurthy, B., Srivastava, D., Suciu, D.: Focusing search in hierarchical structure with directory sets. In: Proceedings of the 7th International Conference on Information and Knowledge Management (CIKM), pp. 1–9 (1998)
Jagadish, H., Lakshmanan, L.V., Milo, T., Srivastava, D., Vista, D.: Querying network directories. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 133–144 (1999)
Kaplan, H., Milo, T., Shabo, R.: A comparison of labeling schemes for ancestor queries. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 954–963 (2002)
Kaszkiel, M., Zobel, J., Sacks-Davis, R.: Efficient passage ranking for document databases. ACM Trans. Inf. Syst. 17, 406–439 (1999)
Li, Q., Moon, B.: Indexing and querying XML data for regular path expressions. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), pp. 361–370(2001)
Maier, A., Novak, H.J.: DB2's full-text search products –white paper. Technical Report, International BusinessMachines Corporation (2001). www-900.ibm.com/cn/software/db2/products/download/whitepaper/whitense.pdf
McHugh, J., Widom, J., Abiteboul, S., Luo, Q., Rajaraman, A.: Indexing semistructured data. Technical Report, Stanford University, Stanford, CA (1998)
Milo, T., Suciu, D.: Index structures for path expressions. In: Proceedings of the 7th International Conference on Database Theory (ICDT99). Lecture Notes in Computer Science, vol. 1540, pp. 277–295. Springer, Berlin Heidelberg New York (1999)
Myaeng, S.H., Jang, D.H., Kim, M.S., Zhoo, Z.C.: A flexible model for retrieval of SGML documents. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 138–145 (1998)
Navarro, G., Baeza-Yates, R.: Proximal nodes: a model to query document databases by content and structure. ACM Trans. Inf. Syst. 15, 401–435 (1997)
Papakonstantinou, Y., Garcia-Molina, H., Widom, J.: Object exchange across heterogeneous information sources. In: Proceedings of the 11th International Conference on Data Engineering (ICDE), pp. 251–260 (1995)
Peleg, D.: Informative labeling schemes for graphs. In: Proceedings of the 25th International Symposium on Mathematical Foundations of Computer Science. Lecture Notes in Computer Science, vol. 1893, Springer, Berlin Heidelberg New York (2000)
Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19, release data 2000-11-03 Format version 1, correction level 0 (2000). http://about.reuters.com/researchandstandards/corpus/
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)
Rizzolo, F., Mendelzon, A.: Indexing XML data with Toxin. In: Proceedings of the 3rd International Workshop on the Web and Databases (WebDB), pp. 49–54 (2001)
Sacks-Davis, R., Dao, T., Thom, J.A., Zobel, J.: Indexing documents for queries on structure, content and attributes. In: Proceedings of the International Symposium on Digital Media Information Base, pp. 236–245 (1997)
Salton, G., Allan, J., Buckley, C.: Approaches to passage retrieval in full text information systems. In: Proceedings of the 16th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, pp. 49–58 (1993)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Santoro, N., Khatib, R.: Labeling and implicit routing in networks. The Computer Journal 28, 5–8 (1985)
Schlieder, T., Meuss, H.: Querying and ranking XML documents. J. Am. Soc. Inf. Sci. Technol. 53(6), 489–503 (2002)
Schmidt, A.R., Waas, F., Kersten, M.L., Manolescu, I., Carey, M.J., Manolescu, I., Busse, R.: XMark: a benchmark for XML data management. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 974–985 (2002)
Shekita, E.J., Carey, M.J.: A performance evaluation of pointer-based joins. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 300–311 (1990)
Shin, D., Jang, H., Jin, H.: BUS: an effective indexing and retrieval scheme in structured documents. In: Proceedings of the 3rd ACM International Conference on Digital Libraries, pp. 235–243 (1998)
Tatarinov, I., Viglas, S.D., Beyer, K., Shanmugasundaram, J., Shekita, E., Zhang, C.: Storing and querying ordered XML using a relational database system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 204–215 (2002)
Theobald, A., Weikum, G.: Adding relevance to XML. In: Proceedings of the 3rd International Workshop on the Web and Databases (WebDB). Lecture Notes in Computer Science, vol. 1997, pp. 105–124. Springer, Berlin Heidelberg New York (2001)
Tolani, P.M., Haritsa, J.R.: XGRIND: a query-friendly XML compressor. In: Proceedings of the 18th International Conference on Data Engineering (ICDE), pp. 225–234 (2002)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes—Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Mateo, CA (1999)
Yan, T.W., Annevelink, J.: Integrating a structured-text retrieval system with an object-oriented database system. In: Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), pp. 740–749 (1994)
Yoshikawa, M., Amagasa, T., Shimura, T., Shunsuke, U.: XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Trans. Internet Technol. 1, 110–141 (2001)
Zhang, C., Naughton, J., DeWitt, D., Luo, Q., Lohman, G.: On supporting containment queries in relational database management systems. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 425–436 (2001)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bremer, JM., Gertz, M. Integrating document and data retrieval based on XML. The VLDB Journal 15, 53–83 (2006). https://doi.org/10.1007/s00778-004-0150-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-004-0150-4