Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1871437.1871528acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Efficient temporal keyword search over versioned text

Published: 26 October 2010 Publication History

Abstract

Modern text analytics applications operate on large volumes of temporal text data such as Web archives, newspaper archives, blogs, wikis, and micro-blogs. In these settings, searching and mining needs to use constraints on the time dimension in addition to keyword constraints. A natural approach to address such queries is using an inverted index whose entries are enriched with valid-time intervals. It has been shown that these indexes have to be partitioned along time in order to achieve efficiency. However, when the temporal predicate corresponds to a long time range, requiring the processing of multiple partitions, naive query processing incurs high cost of reading of redundant entries across partitions.
We present a framework for efficient approximate processing of keyword queries over a temporally partitioned inverted index which minimizes this overhead, thus speeding up query processing. By using a small synopsis for each partition we identify partitions that maximize the number of final non-redundant results, and schedule them for processing early on. Our approach aims to balance the estimated gains in the final result recall against the cost of index reading required. We present practical algorithms for the resulting optimization problem of index partition selection. Our experiments with three diverse, large-scale text archives reveal that our proposed approach can provide close to 80% result recall even when only about half the index is allowed to be read.

References

[1]
European archive. http://www.europarchive.org.
[2]
New york times annotated corpus. http://corpus.nytimes.com.
[3]
S. Acharya, P. B. Gibbons, and V. Poosala. Aqua: A Fast Decision Support Systems Using Approximate Query Answers. In VLDB, pages 754--757, 1999.
[4]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, pages 275--286, 1999.
[5]
O. Alonso, M. Gertz, and R. Baeza-Yates. On the value of temporal information in information retrieval. SIGIR Forum, 41(2):35--41, 2007.
[6]
A. Anand, S. Bedathur, K. Berberich and Ralf Schenkel. Efficient Temporal Keyword Queries over Versioned Text. Technical Report MPI-I-2010-5-003, Max-Planck Institute for Informatics, 2010.
[7]
B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. An asymptotically optimal multiversion b-tree. The VLDB Journal, 5(4), 1996.
[8]
K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A Time Machine for Text Search. In SIGIR, 2007.
[9]
K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. FluxCapacitor: Efficient Time-Travel Text Search. In Proc. of VLDB, 2007.
[10]
K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD, pages 199--210, 2007.
[11]
D. Gao, C. S. Jensen, R. T. Snodgrass, and M. D. Soo. Join operations in temporal databases. The VLDB Journal, 14(1):2--29, 2005.
[12]
S. Khuller, A. Moss, and J. S. Naor. The budgeted maximum coverage problem. Inf. Process. Lett., 70(1):39--45, 1999.
[13]
P. Muth, P. E. O'Neil, A. Pick, and G. Weikum. The LHAM Log-Structured History Data Access Method. VLDB J., 8(3-4):199--221, 2000.
[14]
Wikipedia. http://en.wikipedia.org/.

Cited By

View all
  • (2024)Scalable Range Search over Temporal and Numerical ExpressionsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672509(91-100)Online publication date: 2-Aug-2024
  • (2024)Temporal JSON Keyword SearchProceedings of the ACM on Management of Data10.1145/36549802:3(1-27)Online publication date: 30-May-2024
  • (2024)QuanTemp: A real-world open-domain benchmark for fact-checking numerical claimsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657874(650-660)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
October 2010
2036 pages
ISBN:9781450300995
DOI:10.1145/1871437
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. partition selection
  2. partitioned inverted index
  3. synopses
  4. time-travel search

Qualifiers

  • Research-article

Conference

CIKM '10

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Scalable Range Search over Temporal and Numerical ExpressionsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672509(91-100)Online publication date: 2-Aug-2024
  • (2024)Temporal JSON Keyword SearchProceedings of the ACM on Management of Data10.1145/36549802:3(1-27)Online publication date: 30-May-2024
  • (2024)QuanTemp: A real-world open-domain benchmark for fact-checking numerical claimsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657874(650-660)Online publication date: 10-Jul-2024
  • (2024)Spatio-Temporal Keyword Query Processing Based on Key-Value StoresData Science and Engineering10.1007/s41019-024-00265-8Online publication date: 5-Dec-2024
  • (2023)Indexing for Keyword Search with Structured ConstraintsProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588663(263-275)Online publication date: 18-Jun-2023
  • (2023)An Index for Set Intersection With Post-FilteringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.3329145(1-14)Online publication date: 2023
  • (2019)Towards Longitudinal Analytics on Social Media Data2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00039(350-361)Online publication date: Apr-2019
  • (2017)Top-k temporal keyword search over social media dataWorld Wide Web10.1007/s11280-016-0430-020:5(1049-1069)Online publication date: 1-Sep-2017
  • (2016)Temporal Information RetrievalProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2914805(1235-1238)Online publication date: 7-Jul-2016
  • (2016)TempasProceedings of the 25th International Conference Companion on World Wide Web10.1145/2872518.2890555(207-210)Online publication date: 11-Apr-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media