Ad-Hoc Queries over Document Collections – A Case Study

Alexander Löser⁹,
Steffen Lutter⁹,
Patrick Düssel¹⁰ &
…
Volker Markl^9,10

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 41))

Included in the following conference series:

International Workshop on Business Intelligence for the Real-Time Enterprise

551 Accesses
1 Citations

Abstract

We discuss the novel problem of supporting analytical business intelligence queries over web-based textual content, e.g., BI-style reports based on 100.000’s of documents from an ad-hoc web search result. Neither conventional search engines nor conventional Business Intelligence and ETL tools address this problem, which lies at the intersection of their capabilities. “Google Squared” or our system GOOLAP.info, are examples of these kinds of systems. They execute information extraction methods over one or several document collections at query time and integrate extracted records into a common view or tabular structure. Frequent extraction and object resolution failures cause incomplete records which could not be joined into a record answering the query. Our focus is the identification of join-reordering heuristics maximizing the size of complete records answering a structured query. With respect to given costs for document extraction we propose two novel join-operations: The multi-way CJ-operator joins records from multiple relationships extracted from a single document. The two-way join-operator DJ ensures data density by removing incomplete records from results. In a preliminary case study we observe that our join-reordering heuristics positively impact result size, record density and lower execution costs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Partial materialization for online analytical processing over multi-tagged document collections

Article Open access 07 September 2015

Indexing Data on the Web: A Comparison of Schema-Level Indices for Data Search

Pre-indexing Pruning Strategies

References

Markl, V., Lohman, G.M., Raman, V.: LEO: An autonomic query optimizer for DB2. IBM Syst. J. 42(1), 98–106 (2003)
Article Google Scholar
Zhu, H., Raghavan, S., Vaithyanathan, S., Löser, A.: Navigating the intranet with high precision. In: WWW ’07, pp. 491–500. ACM, New York (2007)
Chapter Google Scholar
Kasneci, G., Suchanek, F.M., Ifrim, G., Elbassuoni, S., Ramanath, M., Weikum, G.: NAGA: harvesting, searching and ranking knowledge. In: SIGMOD ’08, pp. 1285–1288. ACM, New York (2008)
Chapter Google Scholar
Jain, A., Doan, A., Gravano, L.: Optimizing SQL Queries over Document collections. In: ICDE, pp. 636–645. IEEE Computer Society, Washington (2008)
Google Scholar
Jain, A., Srivastava, D.: Exploring a Few Good Tuples from Document collections. In: ICDE, pp. 616–627. IEEE Computer Society, Washington (2009)
Google Scholar
Jain, A., Ipeirotis, P.G., Doan, A., Gravano, L.: Join Optimization of Information Extraction Output: Quality Matters! In: ICDE, pp. 186–197. IEEE Computer Society, Washington (2009)
Google Scholar
Naumann, F., Leser, U., Freytag, J.C.: Quality-driven Integration of Heterogeneous Information Systems. In: Very Large Data Bases, pp. 447–458. Morgan Kaufmann Publishers, San Francisco (1999)
Google Scholar
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD ’79, pp. 23–34. ACM, New York (1979)
Chapter Google Scholar
Kandogan, E., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Zhu, H.: Avatar semantic search: a database approach to information retrieval. In: SIGMOD ’06, pp. 790–792. ACM, New York (2006)
Chapter Google Scholar
Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: SIGMOD ’08, pp. 1031–1042. ACM, New York (2008)
Chapter Google Scholar
Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S., Zhu, H.: System T: a system for declarative information extraction. SIGMOD Rec. 37(4), 7–13 (2009)
Article Google Scholar
Agichtein, E., Gravano, L.: QXtract: a building block for efficient information extraction from document collections. In: SIGMOD ’03, pp. 663–663. ACM, New York (2003)
Chapter Google Scholar
Jain, A., Ipeirotis, P.G.: A quality-aware optimizer for information extraction. ACM Trans. Database Syst. 34(1), 1–48 (2009)
Article Google Scholar
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative Data Cleaning: Language, Model, and Algorithms. In: Very Large Data Bases, pp. 371–380. Morgan Kaufmann Publishers, San Francisco (2001)
Google Scholar
Vapnik, V.: Statistical Learning Theory. Wiley Interscience Publication, New York (1998)
MATH Google Scholar
Joachims, T.: Making large-scale support vector machine learning practical. MIT press, Camebridge (1989)
Google Scholar
Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L.: To search or to crawl?: towards a query optimizer for text-centric tasks. In: SIGMOD ’06, ACM, New York (2006)
Google Scholar
Garcia-Molina, H., Ullmann, J.D., Widom, J.: Database Systems: The Complete Book. Prentice Hall, Englewood Cliffs (2008)
Google Scholar
Liu, J., Dong, X., Halevy, A.: Answering Structured Queries on Unstructured Data. In: WebDB (2006)
Google Scholar
Dong, X., Halevy, A.Y., Madhavan, J.: Reference Reconciliation in Complex Information Spaces. In: SIGMOD Conference 2005, pp. 85–96 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

DIMA Group, Technische Universität Berlin, Einsteinufer 17, 10587, Berlin, Germany
Alexander Löser, Steffen Lutter & Volker Markl
Intelligent Data Analysis, Fraunhofer Institute FIRST, 12489, Berlin, Germany
Patrick Düssel & Volker Markl

Authors

Alexander Löser
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Lutter
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Düssel
View author publications
You can also search for this author in PubMed Google Scholar
Volker Markl
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Hewlett-Packard, 1501 Page Mill Rd, MS-1142, 94304, Palo Alto, CA, USA
Malu Castellanos
Hewlett-Packard, 1501 Page Mill Rd, 94304, Palo Alto, CA, USA
Umeshwar Dayal
University of Toronto, 40 St. George St, M5S 3H5, Toronto, ON, Canada
Renée J. Miller

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Löser, A., Lutter, S., Düssel, P., Markl, V. (2010). Ad-Hoc Queries over Document Collections – A Case Study. In: Castellanos, M., Dayal, U., Miller, R.J. (eds) Enabling Real-Time Business Intelligence. BIRTE 2009. Lecture Notes in Business Information Processing, vol 41. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14559-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-14559-9_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14558-2
Online ISBN: 978-3-642-14559-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Ad-Hoc Queries over Document Collections – A Case Study

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Partial materialization for online analytical processing over multi-tagged document collections

Indexing Data on the Web: A Comparison of Schema-Level Indices for Data Search

Pre-indexing Pruning Strategies

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Ad-Hoc Queries over Document Collections – A Case Study

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Partial materialization for online analytical processing over multi-tagged document collections

Indexing Data on the Web: A Comparison of Schema-Level Indices for Data Search

Pre-indexing Pruning Strategies

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation