Nothing Special   »   [go: up one dir, main page]

Skip to main content

Ad-Hoc Queries over Document Collections – A Case Study

  • Conference paper
Enabling Real-Time Business Intelligence (BIRTE 2009)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 41))

Abstract

We discuss the novel problem of supporting analytical business intelligence queries over web-based textual content, e.g., BI-style reports based on 100.000’s of documents from an ad-hoc web search result. Neither conventional search engines nor conventional Business Intelligence and ETL tools address this problem, which lies at the intersection of their capabilities. “Google Squared” or our system GOOLAP.info, are examples of these kinds of systems. They execute information extraction methods over one or several document collections at query time and integrate extracted records into a common view or tabular structure. Frequent extraction and object resolution failures cause incomplete records which could not be joined into a record answering the query. Our focus is the identification of join-reordering heuristics maximizing the size of complete records answering a structured query. With respect to given costs for document extraction we propose two novel join-operations: The multi-way CJ-operator joins records from multiple relationships extracted from a single document. The two-way join-operator DJ ensures data density by removing incomplete records from results. In a preliminary case study we observe that our join-reordering heuristics positively impact result size, record density and lower execution costs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Markl, V., Lohman, G.M., Raman, V.: LEO: An autonomic query optimizer for DB2. IBM Syst. J. 42(1), 98–106 (2003)

    Article  Google Scholar 

  2. Zhu, H., Raghavan, S., Vaithyanathan, S., Löser, A.: Navigating the intranet with high precision. In: WWW ’07, pp. 491–500. ACM, New York (2007)

    Chapter  Google Scholar 

  3. Kasneci, G., Suchanek, F.M., Ifrim, G., Elbassuoni, S., Ramanath, M., Weikum, G.: NAGA: harvesting, searching and ranking knowledge. In: SIGMOD ’08, pp. 1285–1288. ACM, New York (2008)

    Chapter  Google Scholar 

  4. Jain, A., Doan, A., Gravano, L.: Optimizing SQL Queries over Document collections. In: ICDE, pp. 636–645. IEEE Computer Society, Washington (2008)

    Google Scholar 

  5. Jain, A., Srivastava, D.: Exploring a Few Good Tuples from Document collections. In: ICDE, pp. 616–627. IEEE Computer Society, Washington (2009)

    Google Scholar 

  6. Jain, A., Ipeirotis, P.G., Doan, A., Gravano, L.: Join Optimization of Information Extraction Output: Quality Matters! In: ICDE, pp. 186–197. IEEE Computer Society, Washington (2009)

    Google Scholar 

  7. Naumann, F., Leser, U., Freytag, J.C.: Quality-driven Integration of Heterogeneous Information Systems. In: Very Large Data Bases, pp. 447–458. Morgan Kaufmann Publishers, San Francisco (1999)

    Google Scholar 

  8. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD ’79, pp. 23–34. ACM, New York (1979)

    Chapter  Google Scholar 

  9. Kandogan, E., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Zhu, H.: Avatar semantic search: a database approach to information retrieval. In: SIGMOD ’06, pp. 790–792. ACM, New York (2006)

    Chapter  Google Scholar 

  10. Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: SIGMOD ’08, pp. 1031–1042. ACM, New York (2008)

    Chapter  Google Scholar 

  11. Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S., Zhu, H.: System T: a system for declarative information extraction. SIGMOD Rec. 37(4), 7–13 (2009)

    Article  Google Scholar 

  12. Agichtein, E., Gravano, L.: QXtract: a building block for efficient information extraction from document collections. In: SIGMOD ’03, pp. 663–663. ACM, New York (2003)

    Chapter  Google Scholar 

  13. Jain, A., Ipeirotis, P.G.: A quality-aware optimizer for information extraction. ACM Trans. Database Syst. 34(1), 1–48 (2009)

    Article  Google Scholar 

  14. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative Data Cleaning: Language, Model, and Algorithms. In: Very Large Data Bases, pp. 371–380. Morgan Kaufmann Publishers, San Francisco (2001)

    Google Scholar 

  15. Vapnik, V.: Statistical Learning Theory. Wiley Interscience Publication, New York (1998)

    MATH  Google Scholar 

  16. Joachims, T.: Making large-scale support vector machine learning practical. MIT press, Camebridge (1989)

    Google Scholar 

  17. Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L.: To search or to crawl?: towards a query optimizer for text-centric tasks. In: SIGMOD ’06, ACM, New York (2006)

    Google Scholar 

  18. Garcia-Molina, H., Ullmann, J.D., Widom, J.: Database Systems: The Complete Book. Prentice Hall, Englewood Cliffs (2008)

    Google Scholar 

  19. Liu, J., Dong, X., Halevy, A.: Answering Structured Queries on Unstructured Data. In: WebDB (2006)

    Google Scholar 

  20. Dong, X., Halevy, A.Y., Madhavan, J.: Reference Reconciliation in Complex Information Spaces. In: SIGMOD Conference 2005, pp. 85–96 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Löser, A., Lutter, S., Düssel, P., Markl, V. (2010). Ad-Hoc Queries over Document Collections – A Case Study. In: Castellanos, M., Dayal, U., Miller, R.J. (eds) Enabling Real-Time Business Intelligence. BIRTE 2009. Lecture Notes in Business Information Processing, vol 41. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14559-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14559-9_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14558-2

  • Online ISBN: 978-3-642-14559-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics