Abstract
We discuss the novel problem of supporting analytical business intelligence queries over web-based textual content, e.g., BI-style reports based on 100.000’s of documents from an ad-hoc web search result. Neither conventional search engines nor conventional Business Intelligence and ETL tools address this problem, which lies at the intersection of their capabilities. “Google Squared” or our system GOOLAP.info, are examples of these kinds of systems. They execute information extraction methods over one or several document collections at query time and integrate extracted records into a common view or tabular structure. Frequent extraction and object resolution failures cause incomplete records which could not be joined into a record answering the query. Our focus is the identification of join-reordering heuristics maximizing the size of complete records answering a structured query. With respect to given costs for document extraction we propose two novel join-operations: The multi-way CJ-operator joins records from multiple relationships extracted from a single document. The two-way join-operator DJ ensures data density by removing incomplete records from results. In a preliminary case study we observe that our join-reordering heuristics positively impact result size, record density and lower execution costs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Markl, V., Lohman, G.M., Raman, V.: LEO: An autonomic query optimizer for DB2. IBM Syst. J. 42(1), 98–106 (2003)
Zhu, H., Raghavan, S., Vaithyanathan, S., Löser, A.: Navigating the intranet with high precision. In: WWW ’07, pp. 491–500. ACM, New York (2007)
Kasneci, G., Suchanek, F.M., Ifrim, G., Elbassuoni, S., Ramanath, M., Weikum, G.: NAGA: harvesting, searching and ranking knowledge. In: SIGMOD ’08, pp. 1285–1288. ACM, New York (2008)
Jain, A., Doan, A., Gravano, L.: Optimizing SQL Queries over Document collections. In: ICDE, pp. 636–645. IEEE Computer Society, Washington (2008)
Jain, A., Srivastava, D.: Exploring a Few Good Tuples from Document collections. In: ICDE, pp. 616–627. IEEE Computer Society, Washington (2009)
Jain, A., Ipeirotis, P.G., Doan, A., Gravano, L.: Join Optimization of Information Extraction Output: Quality Matters! In: ICDE, pp. 186–197. IEEE Computer Society, Washington (2009)
Naumann, F., Leser, U., Freytag, J.C.: Quality-driven Integration of Heterogeneous Information Systems. In: Very Large Data Bases, pp. 447–458. Morgan Kaufmann Publishers, San Francisco (1999)
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD ’79, pp. 23–34. ACM, New York (1979)
Kandogan, E., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Zhu, H.: Avatar semantic search: a database approach to information retrieval. In: SIGMOD ’06, pp. 790–792. ACM, New York (2006)
Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: SIGMOD ’08, pp. 1031–1042. ACM, New York (2008)
Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S., Zhu, H.: System T: a system for declarative information extraction. SIGMOD Rec. 37(4), 7–13 (2009)
Agichtein, E., Gravano, L.: QXtract: a building block for efficient information extraction from document collections. In: SIGMOD ’03, pp. 663–663. ACM, New York (2003)
Jain, A., Ipeirotis, P.G.: A quality-aware optimizer for information extraction. ACM Trans. Database Syst. 34(1), 1–48 (2009)
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative Data Cleaning: Language, Model, and Algorithms. In: Very Large Data Bases, pp. 371–380. Morgan Kaufmann Publishers, San Francisco (2001)
Vapnik, V.: Statistical Learning Theory. Wiley Interscience Publication, New York (1998)
Joachims, T.: Making large-scale support vector machine learning practical. MIT press, Camebridge (1989)
Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L.: To search or to crawl?: towards a query optimizer for text-centric tasks. In: SIGMOD ’06, ACM, New York (2006)
Garcia-Molina, H., Ullmann, J.D., Widom, J.: Database Systems: The Complete Book. Prentice Hall, Englewood Cliffs (2008)
Liu, J., Dong, X., Halevy, A.: Answering Structured Queries on Unstructured Data. In: WebDB (2006)
Dong, X., Halevy, A.Y., Madhavan, J.: Reference Reconciliation in Complex Information Spaces. In: SIGMOD Conference 2005, pp. 85–96 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Löser, A., Lutter, S., Düssel, P., Markl, V. (2010). Ad-Hoc Queries over Document Collections – A Case Study. In: Castellanos, M., Dayal, U., Miller, R.J. (eds) Enabling Real-Time Business Intelligence. BIRTE 2009. Lecture Notes in Business Information Processing, vol 41. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14559-9_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-14559-9_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14558-2
Online ISBN: 978-3-642-14559-9
eBook Packages: Computer ScienceComputer Science (R0)