Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

RC-index: diversifying answers to range queries

Published: 01 March 2018 Publication History

Abstract

Query result diversification is widely used in data exploration, Web search, and recommendation systems. The problem of returning diversified query results consists of finding a small subset of valid query answers that are representative and different from one another, usually quantified by a diversity score. Most existing techniques for query diversification first compute all valid query results and then find a diverse subset. These techniques are inefficient when the set of valid query results is large. Other work has proposed efficient solutions for restricted application settings, where results are shared across multiple queries. In this paper, our goal is to support result diversification for general range queries over a single relation. We propose the RC-Index, a novel index structure that achieves efficiency by reducing the number of items that must be retrieved by the database to form a diverse set of the desired size (about 1 second for a dataset of 1 million items). Further, we prove that an RC-Index offers strong approximation guarantees. To the best of our knowledge, this is the first index-based diversification method with a guaranteed approximation ratio for range queries.

References

[1]
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. Blinkdb: Queries with bounded errors and bounded response times on very large data. In EuroSys, pages 29--42, 2013.
[2]
Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509--517, September 1975.
[3]
Jon Louis Bentley. Decomposable searching problems. Inf. Process. Lett., 8(5):244--251, 1979.
[4]
Jon Louis Bentley and Jerome H. Friedman. Data structures for range searching. ACM Comput. Surv., 11(4):397--409, 1979.
[5]
Alina Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest neighbor. In ICML, pages 97--104, 2006.
[6]
Jock A. Blackard and Denis J. Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture, 24(3):131 -- 151, 1999.
[7]
Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In SIGIR, pages 335--336, 1998.
[8]
Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci, and Eli Upfal. Mapreduce and streaming algorithms for diversity maximization in metric spaces of bounded doubling dimension. PVLDB, 10(5):469--180, 2017.
[9]
Douglas Comer. Ubiquitous b-tree. ACM Comput. Surv., 11(2):121--137, June 1979.
[10]
Ting Deng and Wenfei Fan. On the complexity of query result diversification. PVLDB, 6(8):577--588, 2013.
[11]
Ting Deng and Wenfei Fan. On the complexity of query result diversification. ACM Trans. Database Syst., 39(2):15:1--15:46, 2014.
[12]
M. Drosou and E. Pitoura. Diverse set selection over dynamic data. IEEE Transactions on Knowledge and Data Engineering, 26(5):1102--1116, 2014.
[13]
Marina Drosou and Evaggelia Pitoura. Search result diversification. SIGMOD Rec., 39(1):41--47, 2010.
[14]
Marina Drosou and Evaggelia Pitoura. Disc diversity: Result diversification based on dissimilarity and coverage. PVLDB, 6(1):13--24, 2012.
[15]
Marina Drosou and Evaggelia Pitoura. Dynamic diversification of continuous data. In EDBT, pages 216--227, 2012.
[16]
Erhan Erkut. The discrete p-dispersion problem. European Journal of Operational Research, 46(1):48 -- 60, 1990.
[17]
R. A. Finkel and J. L. Bentley. Quad trees a data structure for retrieval on composite keys. Acta Inf., 4(1):1--9, March 1974.
[18]
Sreenivas Gollapudi and Aneesh Sharma. An axiomatic approach for result diversification. In WWW, pages 381--390, 2009.
[19]
Joseph M. Hellerstein, Jeffrey F. Naughton, and Avi Pfeffer. Generalized search trees for database systems. In VLDB, pages 562--573, 1995.
[20]
Ramon Huerta, Thiago Mosqueiro, Jordi Fonollosa, Nikolai F Rulkov, and Irene Rodriguez-Lujan. Online decorrelation of humidity and temperature in chemical sensors for continuous monitoring. Chemometrics and Intelligent Laboratory Systems, 157:169 -- 176, 2016.
[21]
Piotr Indyk, Sepideh Mahabadi, Mohammad Mahdian, and Vahab S. Mirrokni. Composable core-sets for diversity and coverage maximization. In PODS, pages 100--108, 2014.
[22]
David R. Karger and Matthias Ruhl. Finding nearest neighbors in growth-restricted metrics. In STOC, pages 741--750, 2002.
[23]
H. A. Khan and M. A. Sharaf. Progressive diversification for column-based data exploration platforms. In ICDE, pages 327--338, 2015.
[24]
Hina A. Khan, Marina Drosou, and Mohamed A. Sharaf. Dos: An efficient scheme for the diversification of multiple search results. In SSDBM, pages 40:1--40:4, 2013.
[25]
Hina A. Khan, Marina Drosou, and Mohamed A. Sharaf. Scalable diversification of multiple search results. In CIKM, pages 775--780, 2013.
[26]
Hina A. Khan, Mohamed A. Sharaf, and Abdullah Albarrak. Divide: Efficient diversification for interactive data exploration. In SSDBM, pages 15:1--15:12, 2014.
[27]
Robert Krauthgamer and James R. Lee. Navigating nets: Simple algorithms for proximity search. In SODA, pages 798--807, 2004.
[28]
D. T. Lee and C. K. Wong. Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees. Acta Inf., 9(1):23--29, March 1977.
[29]
D. T. Lee and C. K. Wong. Quintary trees: A file structure for multidimensional datbase sytems. ACM Trans. Database Syst., 5(3):339--353, 1980.
[30]
M. Lichman. UCI machine learning repository, 2013.
[31]
George S. Lueker. A data structure for orthogonal range queries. In Proceedings of the 19th Annual Symposium on Foundations of Computer Science, pages 28--34, 1978.
[32]
George S. Lueker and Dan E. Willard. A data structure for dynamic range queries. Information Processing Letters, 15(5):209--213, 1982.
[33]
Enrico Minack, Wolf Siberski, and Wolfgang Nejdl. Incremental diversification for very large sets: A streaming-based approach. In SIGIR, pages 585--594, 2011.
[34]
Sérgio Moro, Paulo Cortez, and Paulo Rita. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22 -- 31, 2014.
[35]
J. Nievergelt and E. M. Reingold. Binary search trees of bounded balance. SIAM Journal on Computing, 2(1):33--43, 1973.
[36]
Stephen M. Omohundro. Five balltree construction algorithms. Technical report, 1989.
[37]
Lu Qin, Jeffrey Xu Yu, and Lijun Chang. Diversifying top-k results. PVLDB, 5(11):1124--1135, 2012.
[38]
S. S. Ravi, D. J. Rosenkrantz, and G. K. Tayi. Heuristic and special case algorithms for dispersion problems. Operations Research, 42(2):299--310, 1994.
[39]
Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. Search result diversification. Found. Trends Inf. Retr., 9(1):1--90, March 2015.
[40]
Arie Tamir. Obnoxious facility location on graphs. SIAM J. Discret. Math., 4(4):550--567, September 1991.
[41]
Erik Vee, Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, and Sihem Amer Yahia. Efficient computation of diverse query results. In ICDE, pages 228--236, 2008.
[42]
Marcos R. Vieira, Humberto L. Razente, Maria C. N. Barioni, Marios Hadjieleftheriou, Divesh Srivastava, Caetano Traina, and Vassilis J. Tsotras. On query result diversification. In ICDE, pages 1163--1174, 2011.
[43]
Yue Wang, Alexandra Meliou, and Gerome Miklau. Rc-index: Diversifying answers to range queries. Technical report, https://web.cs.umass.edu/publication/details.php?id=2445, 2017.
[44]
Roger Weber, Hans-Jörg Schek, and Stephen Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, pages 194--205, 1998.
[45]
Dan E. Willard. Predicate-Oriented Database Search Algorithms. Outstanding Dissertations in the Computer Sciences. Garland Publishing, New York, 1978.
[46]
Cong Yu, Laks Lakshmanan, and Sihem Amer-Yahia. It takes variety to make a world: Diversification in recommender systems. In EDBT, pages 368--378, 2009.
[47]
Kaiping Zheng, Hongzhi Wang, Zhixin Qi, Jianzhong Li, and Hong Gao. A survey of query result diversification. Knowl. Inf Syst., 51(1):1--36, 2017.

Cited By

View all
  • (2022)Chunk-oriented dimension ordering for efficient range query processing on sparse multidimensional dataWorld Wide Web10.1007/s11280-022-01098-z26:4(1395-1433)Online publication date: 9-Sep-2022
  • (2022)ORTree: Tuning Diversified Similarity Queries by Means of Data PartitioningAdvances in Databases and Information Systems10.1007/978-3-031-15740-0_13(165-178)Online publication date: 5-Sep-2022
  • (2020)Efficient Indexes for Diverse Top-k Range QueriesProceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3375395.3387667(213-227)Online publication date: 14-Jun-2020
  • Show More Cited By
  1. RC-index: diversifying answers to range queries

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 11, Issue 7
    March 2018
    107 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 March 2018
    Published in PVLDB Volume 11, Issue 7

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Chunk-oriented dimension ordering for efficient range query processing on sparse multidimensional dataWorld Wide Web10.1007/s11280-022-01098-z26:4(1395-1433)Online publication date: 9-Sep-2022
    • (2022)ORTree: Tuning Diversified Similarity Queries by Means of Data PartitioningAdvances in Databases and Information Systems10.1007/978-3-031-15740-0_13(165-178)Online publication date: 5-Sep-2022
    • (2020)Efficient Indexes for Diverse Top-k Range QueriesProceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3375395.3387667(213-227)Online publication date: 14-Jun-2020
    • (2020)Personalized top-n influential community search over large social networksWorld Wide Web10.1007/s11280-020-00788-w23:3(2153-2184)Online publication date: 1-May-2020
    • (2019)ReducE-Comm: Effective Inventory Reduction System for E-CommerceProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357861(2957-2960)Online publication date: 3-Nov-2019
    • (2018)A Framework for Processing Cumulative Frequency Queries over Medical Data StreamsWeb Information Systems Engineering – WISE 201810.1007/978-3-030-02925-8_9(121-131)Online publication date: 12-Nov-2018

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media