Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2483574.2483586acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Efficient and scalable monitoring and summarization of large probabilistic data

Published: 22 June 2013 Publication History

Abstract

In numerous real applications, uncertainty is inherently introduced when massive data are generated. Modern database management systems aim to incorporate and handle data with uncertainties as a first-class citizen, where uncertain data are represented as probabilistic relations. In my thesis, my work has focused on monitoring and summarization of large probabilistic data. Specifically, we extended the distributed threshold monitoring problem to distributed probabilistic data. Instead, we actually need to monitor the aggregated value (e.g. sum) of distributed probabilistic data against both the score threshold and the probability threshold, which make the techniques designed for deterministic data are not directly applicable. Our algorithms have significantly reduced both the communication and computation costs as shown by an extensive experimental evaluation on large real datasets. On the other hand, building histograms to summarize the distribution of certain feature in a large data set is a fundamental problem in data management. Recent work have extended this studies to probabilistic data, but their methods suffer from the limited scalability. We present novel methods to build scalable histograms over large probabilistic data using distributed and parallel algorithms. Extensive experiments on large real data sets have demonstrated the superb scalability and efficiency achieved by our implementations in MapReduce, when compared to the existing, state-of-the-art centralized methods.

References

[1]
P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom. Trio: A system for data, uncertainty, and lineage. In VLDB, 2006.
[2]
M. Arlitt and T. Jin. A workload characterization study of the 1998 world cup web site. In Technical report, IEEE Network, 1999.
[3]
J. Boulos, N. Dalvi, B. Mandhani, S. Mathur, C. Re, and D. Suciu. Mystiq: A system for finding more answers by using probabilities. In SIGMOD, 2005.
[4]
R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD, 2003.
[5]
R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J. S. Vitter. Efficient indexing methods for probabilistic threshold queries over uncertain data. In VLDB, 2004.
[6]
G. Cormode and A. Deligiannakis. Probabilistic histograms for probabilistic data. In VLDB, 2009.
[7]
G. Cormode and M. Garofalakis. Histograms and wavelets on probabilistic data. In ICDE, 2009.
[8]
G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed functional monitoring. In SODA, 2008.
[9]
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.
[10]
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.
[11]
A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, 2004.
[12]
X. Dong, A. Y. Halevy, and C. Yu. Data integration with uncertainty. In VLDB, 2007.
[13]
P. K. Felix Halim and R. H. C. Yap. Fast and effective histogram construction. In CIKM, 2009.
[14]
R. Fink, A. Hogue, D. Olteanu, and S. Rath. SPROUT2: a squared query engine for uncertain web data. In SIGMOD Conference, pages 1299--1302, 2011.
[15]
A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Fast, small-space algorithms for approximate histogram maintenance. In STOC, 2002.
[16]
L. Gruenwald, H. Chok, and M. Aboukhamis. Using data mining to estimate missing sensor data. In ICDMW, 2007.
[17]
M. Hua and J. Pei. Continuously monitoring top-k uncertain data streams: a probabilistic threshold method. DPD, 26(1):29--65, 2009.
[18]
Y. E. Ioannidis and V. Poosala. Balancing histogram optimality and practicality for query result size estimation. In SIGMOD, 1995.
[19]
H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In VLDB, 1998.
[20]
T. S. Jayram, S. Kale, and E. Vee. Efficient aggregation algorithms for probabilistic data. In SODA, 2007.
[21]
T. S. Jayram, A. McGregor, S. Muthukrishnan, and E. Vee. Estimating statistical aggregates on probabilistic data streams. In PODS, 2007.
[22]
K. Y. Jeffrey Jestes and F. Li. Building wavelet histograms on large data in mapreduce. In VLDB, 2011.
[23]
S. Jeyashanker, S. Kashyap, R. Rastogi, and P. Shukla. Efficient constraint monitoring using adaptive thresholds. In ICDE, 2008.
[24]
C. K. Lyublena Antova and D. Olteanu. 10106 worlds and beyond: Efficient representation and processing of incomplete information. In ICDE, 2007.
[25]
Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In SIGMOD Conference, pages 448--459, 1998.
[26]
J. M. P. Mingwang Tang, Feifei Li and J. Jestes. Efficient threshold monitoring for distributed probabilistic data. In ICDE, 2012.
[27]
L. Perez, S. Arumugam, and C. Jermaine. Evaluation of probabilistic threshold queries in MCDB. In SIGMOD, 2010.
[28]
SAMOS. Shipboard Automated Meteorological and Oceanographic System. http://samos.coaps.fsu.edu.
[29]
S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. E. Hambrusch, and R. Shah. Orion 2.0: native support for uncertain data. In SIGMOD Conference, pages 1239--1242, 2008.
[30]
D. Suciu, D. Olteanu, C. Ré, and C. Koch. Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011.
[31]
N. K. Sudipto Guha and K. Shim. Approximation and streaming algorithms for histogram construction problems. TODS, 31(1):396--438, March 2006.
[32]
E. Terzi and P. Tsaparas. Efficient algorithms for sequence segmentation. In SIAM SDM, 2006.
[33]
H. Woo and A. K. Mok. Real-time monitoring of uncertain data streams using probabilistic similarity. In RTSS, 2007.
[34]
K. Y. Zengfeng Huang, Lu Wang and Y. Liu. Sampling based algorithms for quantile computation in sensor networks. In SIGMOD, 2011.
[35]
Y. L. Zengfeng Huang, Ke Yi and G. Chen. Optimal sampling algorithms for frequency estimation in distributed data. In IEEE INFOCOM, 2011.

Cited By

View all
  • (2019)Research on the Optimization of Spark Big Table Equal JoinArtificial Intelligence and Security10.1007/978-3-030-24265-7_37(430-441)Online publication date: 11-Jul-2019
  • (2018)Data Balance Algorithm Based on Histogram in MapReduceXibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University10.1051/jnwpu/2018363048036:3(480-486)Online publication date: 8-Oct-2018
  • (2017)Distributed and parallel construction method for equi-width histogram in cloud databaseMultiagent and Grid Systems10.3233/MGS-17027313:3(311-329)Online publication date: 28-Sep-2017

Index Terms

  1. Efficient and scalable monitoring and summarization of large probabilistic data

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD'13 PhD Symposium: Proceedings of the 2013 SIGMOD/PODS Ph.D. symposium
    June 2013
    78 pages
    ISBN:9781450321556
    DOI:10.1145/2483574
    • Program Chairs:
    • Lei Chen,
    • Xin Luna Dong
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 June 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. monitor
    2. probabilistic data
    3. summarize
    4. uncertainty

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'13
    Sponsor:

    Acceptance Rates

    SIGMOD'13 PhD Symposium Paper Acceptance Rate 12 of 26 submissions, 46%;
    Overall Acceptance Rate 40 of 60 submissions, 67%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Research on the Optimization of Spark Big Table Equal JoinArtificial Intelligence and Security10.1007/978-3-030-24265-7_37(430-441)Online publication date: 11-Jul-2019
    • (2018)Data Balance Algorithm Based on Histogram in MapReduceXibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University10.1051/jnwpu/2018363048036:3(480-486)Online publication date: 8-Oct-2018
    • (2017)Distributed and parallel construction method for equi-width histogram in cloud databaseMultiagent and Grid Systems10.3233/MGS-17027313:3(311-329)Online publication date: 28-Sep-2017

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media