Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Integrating document and data retrieval based on XML

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

For querying structured and semistructured data, data retrieval and document retrieval are two valuable and complementary techniques that have not yet been fully integrated. In this paper, we introduce integrated information retrieval (IIR), an XML-based retrieval approach that closes this gap. We introduce the syntax and semantics of an extension of the XQuery language called XQuery/IR. The extended language realizes IIR and thereby allows users to formulate new kinds of queries by nesting ranked document retrieval and precise data retrieval queries. Furthermore, we detail index structures and efficient query processing approaches for implementing XQuery/IR. Based on a new identification scheme for nodes in node-labeled tree structures, the extended index structures require only a fraction of the space of comparable index structures that only support data retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  2. Abiteboul, S., Kaplan, H., Milo, T.: Compact labeling schemes for ancestor queries. In: Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 547–556(2001)

  3. Al-Khalifa, S., Jagadish, H., Koudas, N., Patel, J.M., Srivastava, D., Wu, Y.: Structural joins: a primitive for efficient XML query pattern matching. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE), pp. 141–152 (2002)

  4. Al-Khalifa, S., Yu, C., Jagadish, H.: Querying structured text in an XML database. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 4–15 (2003)

  5. Alonso, O.: Oracle text white paper. Technical Report, Oracle Corporation, Redwood Shores, CA (2001)

    Google Scholar 

  6. Amer-Yahi, S., Botev, C., Shanmugasundaram, J.: TeXQuery: a full-text search extension to XQuery. In: Proceedings of the 13th World Wide Web Conference, pp. 583–594 (2004)

  7. Baeza-Yates, R.A., Navarro, G.: Integrating contents and structure in text retrieval. SIGMOD Rec. 25, 67–79 (1996)

    Article  Google Scholar 

  8. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading, MA (1999)

    Google Scholar 

  9. Berglund, A., Boag, S., Chamberlin, D., Fernández, M.F., Kay, M., Robie, J., Siméon, J.: XML Path Language (XPath) 2.0. W3C working draft, W3C (November 2003). www.w3.org/TR/2003/WD-xpath20-20031112

  10. Boag, S., Chamberlin, D., Fernández, M.F., Florescu, D., Robie, J., Siméon, J.: XQuery 1.0: An XML Query Language. W3C working draft, W3C (November 2003). www.w3.org/TR/2003/WD-xquery-20031112/

  11. Botev, C., Amer-Yahia, S., Shanmugasundaram, J.: On the completeness of full-text search languages for XML. Technical Report, Cornell University (December 2003)

  12. Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., Yergeau, F., Cowan, J.: Extensible Markup Language (XML) 1.1. W3C recommendation, W3C (February 2004). www.w3.org/TR/2004/REC-xml11-20040204

  13. Bremer, J.-M., Gertz, M.: Query processing and index structures for integrated XML document and data retrieval. Technical Report CSE-2002-22, Department of Computer Science, University of California at Davis (2002)

  14. Bremer, J.-M., Gertz, M.: XQuery/IR: integrating XML document and data retrieval. In: Proceedings of the 4th International Workshop on the Web and Databases (WebDB), pp. 1–6(2002)

  15. Bremer, J.-M., Gertz, M.: An efficient XML node identification and indexing scheme. Technical Report CSE-2003-04, Department of Computer Science, University of California at Davis (2003)

  16. Brin, S., Page, L.: The anatomy of a large scale hypertextual Web search engine. In: Proceedings of the 7th World Wide Web Conference, Elsevier, Amsterdam, pp. 107–117 (1998)

    Google Scholar 

  17. Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 310–311 (2002)

  18. Buxton, S., Rys, M.: XQuery and XPath full-text requirements. W3C working draft, W3C (May 2003). www.w3.org/TR/2003/WD-xquery-full-text-requirements-20030502/

  19. Callan, J., Croft, W.B., Broglio, J.: TREC and Tipster experiments with InQuery. Inf. Process. Manage. 31, 327–332, 343 (1995)

    Article  Google Scholar 

  20. Chamberlin, D., Frankhauser, P., Florescu, D., Marchiori, M., Robie, J.: XML query use cases. W3C working draft, W3C (November 2003). www.w3.org/TR/2003/WD-xmlquery-use-cases-20031112/

  21. Chien, S.Y., Vagena, Z., Zhang, D., Tsotras, V.J., Zaniolo, C.: Efficient structural joins on indexed XML documents. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 263–274 (2002)

  22. Chinenyanga, T.T., Kushmerick, N.: An expressive and efficient language for XML information retrieval. In: Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–171 (2001)

  23. Chung, C.W., Min, J.K., Shim, K.: Apex: an adaptive path index. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 121–132 (2002)

  24. Cohen, E., Kaplan, H., Milo, T.: Labeling dynamic XML trees. In: Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pp. 271–281 (2002)

  25. Cooper, B.F., Samle, N., Franklin, M.J., Hjaltson, G.R., Shadmon, M.: A fast index for semistructured data. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), pp. 341–250 (2001)

  26. Cowan, J., Tobin, R.: XML Information Set, 2nd edn. W3C recommendation, W3C (February 2004). www.w3c.org/TR/2004/REC-xml-infoset-20040204

  27. Croft, W.B.: “What do people want from information retrieval?”. D-Lib. Mag. (1995)

  28. DeHaan, D., Toman, D., Consens, M.P., Özsu, M.T.: A comprehensive XQuery to SQL translation using dynamic interval encoding. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 623–634 (2003)

  29. Dessloch, S., Mattos, N.M.: Integrating SQL databases with content-specific search engines. In: Proceedings of the 23rd International Conference on Very Large Databases (VLDB), pp. 528–537 (1997)

  30. Eickler, A., Gerlhof, C.A., Kossmann, D.: A performance evaluation of OID mapping techniques. In: Proceedings of the 21st International Conference on Very Large Databases (VLDB), pp. 18–29 (1995)

  31. Fernández, M., Marsh, J., Malhotra, A., Nagy, M., Walsh, N.: XQuery 1.0 and XPath 2.0 data model. W3C working draft, W3C (November 2003). www.w3.org/TR/2003/WD-path-datamodel-20031112

  32. Fiebig, T., Helmer, S., Kanne, K.C., Moerkotte, G., Neumann, J., Schiele, R.: Anatomy of a native XML base management system. VLDB J. 11, 292–314 (2002)

    Article  Google Scholar 

  33. Florescu, D., Kossmann, D., Manolescu, I.: Integrating keyword search into XML query processing. In: Proceedings of the 9th International Word Wide Web Conference/Computer Networks. 33(1–6), 119–135 (2000)

    Google Scholar 

  34. Fuhr, N., Gövert, N., Kazai, G., Lalmas, M.: INEX: initiative for the evaluation of XML retrieval. In: Proceedings of the ACM SIGIR 2002 Workshop on XML and Information Retrieval(2002)

  35. Fuhr, N., Grossjohann, K.: XIRQL: a query language for information retrieval in XML documents. In: Proceedings of 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 172–180(2001)

  36. Goldman, R., Widom, J.: DataGuides: enabling query formulation and optimization in semistructured databases. In: Proceedings of the 23rd International Conference on Very Large Databases (VLDB), pp. 436–445 (1997)

  37. Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing XPath queries. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 95–106 (2002)

  38. Grabs, T., Schek, H.J.: Generating vector spaces on-the-fly for flexible XML retrieval. In: Proceedings of the ACM SIGIR 2002 Workshop on XML and Information Retrieval (2002)

  39. Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv. 25, 73–169 (1993)

    Article  Google Scholar 

  40. Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRank: ranked keyword search over XML documents. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 16–27 (2003)

  41. Holmes, N.: The great term robbery. Computer 34, 94–96 (2001)

    Article  Google Scholar 

  42. Jacobsen, G., Krishnamurthy, B., Srivastava, D., Suciu, D.: Focusing search in hierarchical structure with directory sets. In: Proceedings of the 7th International Conference on Information and Knowledge Management (CIKM), pp. 1–9 (1998)

  43. Jagadish, H., Lakshmanan, L.V., Milo, T., Srivastava, D., Vista, D.: Querying network directories. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 133–144 (1999)

  44. Kaplan, H., Milo, T., Shabo, R.: A comparison of labeling schemes for ancestor queries. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 954–963 (2002)

  45. Kaszkiel, M., Zobel, J., Sacks-Davis, R.: Efficient passage ranking for document databases. ACM Trans. Inf. Syst. 17, 406–439 (1999)

    Article  Google Scholar 

  46. Li, Q., Moon, B.: Indexing and querying XML data for regular path expressions. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), pp. 361–370(2001)

  47. Maier, A., Novak, H.J.: DB2's full-text search products –white paper. Technical Report, International BusinessMachines Corporation (2001). www-900.ibm.com/cn/software/db2/products/download/whitepaper/whitense.pdf

  48. McHugh, J., Widom, J., Abiteboul, S., Luo, Q., Rajaraman, A.: Indexing semistructured data. Technical Report, Stanford University, Stanford, CA (1998)

    Google Scholar 

  49. Milo, T., Suciu, D.: Index structures for path expressions. In: Proceedings of the 7th International Conference on Database Theory (ICDT99). Lecture Notes in Computer Science, vol. 1540, pp. 277–295. Springer, Berlin Heidelberg New York (1999)

    Google Scholar 

  50. Myaeng, S.H., Jang, D.H., Kim, M.S., Zhoo, Z.C.: A flexible model for retrieval of SGML documents. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 138–145 (1998)

    Google Scholar 

  51. Navarro, G., Baeza-Yates, R.: Proximal nodes: a model to query document databases by content and structure. ACM Trans. Inf. Syst. 15, 401–435 (1997)

    Article  Google Scholar 

  52. Papakonstantinou, Y., Garcia-Molina, H., Widom, J.: Object exchange across heterogeneous information sources. In: Proceedings of the 11th International Conference on Data Engineering (ICDE), pp. 251–260 (1995)

  53. Peleg, D.: Informative labeling schemes for graphs. In: Proceedings of the 25th International Symposium on Mathematical Foundations of Computer Science. Lecture Notes in Computer Science, vol. 1893, Springer, Berlin Heidelberg New York (2000)

    Google Scholar 

  54. Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19, release data 2000-11-03 Format version 1, correction level 0 (2000). http://about.reuters.com/researchandstandards/corpus/

  55. van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)

    Google Scholar 

  56. Rizzolo, F., Mendelzon, A.: Indexing XML data with Toxin. In: Proceedings of the 3rd International Workshop on the Web and Databases (WebDB), pp. 49–54 (2001)

  57. Sacks-Davis, R., Dao, T., Thom, J.A., Zobel, J.: Indexing documents for queries on structure, content and attributes. In: Proceedings of the International Symposium on Digital Media Information Base, pp. 236–245 (1997)

  58. Salton, G., Allan, J., Buckley, C.: Approaches to passage retrieval in full text information systems. In: Proceedings of the 16th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, pp. 49–58 (1993)

  59. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    Google Scholar 

  60. Santoro, N., Khatib, R.: Labeling and implicit routing in networks. The Computer Journal 28, 5–8 (1985)

    Article  MathSciNet  Google Scholar 

  61. Schlieder, T., Meuss, H.: Querying and ranking XML documents. J. Am. Soc. Inf. Sci. Technol. 53(6), 489–503 (2002)

    Article  Google Scholar 

  62. Schmidt, A.R., Waas, F., Kersten, M.L., Manolescu, I., Carey, M.J., Manolescu, I., Busse, R.: XMark: a benchmark for XML data management. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 974–985 (2002)

  63. Shekita, E.J., Carey, M.J.: A performance evaluation of pointer-based joins. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 300–311 (1990)

  64. Shin, D., Jang, H., Jin, H.: BUS: an effective indexing and retrieval scheme in structured documents. In: Proceedings of the 3rd ACM International Conference on Digital Libraries, pp. 235–243 (1998)

  65. Tatarinov, I., Viglas, S.D., Beyer, K., Shanmugasundaram, J., Shekita, E., Zhang, C.: Storing and querying ordered XML using a relational database system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 204–215 (2002)

  66. Theobald, A., Weikum, G.: Adding relevance to XML. In: Proceedings of the 3rd International Workshop on the Web and Databases (WebDB). Lecture Notes in Computer Science, vol. 1997, pp. 105–124. Springer, Berlin Heidelberg New York (2001)

    Google Scholar 

  67. Tolani, P.M., Haritsa, J.R.: XGRIND: a query-friendly XML compressor. In: Proceedings of the 18th International Conference on Data Engineering (ICDE), pp. 225–234 (2002)

  68. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes—Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Mateo, CA (1999)

    Google Scholar 

  69. Yan, T.W., Annevelink, J.: Integrating a structured-text retrieval system with an object-oriented database system. In: Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), pp. 740–749 (1994)

  70. Yoshikawa, M., Amagasa, T., Shimura, T., Shunsuke, U.: XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Trans. Internet Technol. 1, 110–141 (2001)

    Article  Google Scholar 

  71. Zhang, C., Naughton, J., DeWitt, D., Luo, Q., Lohman, G.: On supporting containment queries in relational database management systems. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 425–436 (2001)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Gertz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bremer, JM., Gertz, M. Integrating document and data retrieval based on XML. The VLDB Journal 15, 53–83 (2006). https://doi.org/10.1007/s00778-004-0150-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-004-0150-4

Keywords

Navigation