research-article

Efficient and scalable monitoring and summarization of large probabilistic data

Author:

Mingwang TangAuthors Info & Claims

SIGMOD'13 PhD Symposium: Proceedings of the 2013 SIGMOD/PODS Ph.D. symposium

Pages 61 - 66

https://doi.org/10.1145/2483574.2483586

Published: 22 June 2013 Publication History

Abstract

In numerous real applications, uncertainty is inherently introduced when massive data are generated. Modern database management systems aim to incorporate and handle data with uncertainties as a first-class citizen, where uncertain data are represented as probabilistic relations. In my thesis, my work has focused on monitoring and summarization of large probabilistic data. Specifically, we extended the distributed threshold monitoring problem to distributed probabilistic data. Instead, we actually need to monitor the aggregated value (e.g. sum) of distributed probabilistic data against both the score threshold and the probability threshold, which make the techniques designed for deterministic data are not directly applicable. Our algorithms have significantly reduced both the communication and computation costs as shown by an extensive experimental evaluation on large real datasets. On the other hand, building histograms to summarize the distribution of certain feature in a large data set is a fundamental problem in data management. Recent work have extended this studies to probabilistic data, but their methods suffer from the limited scalability. We present novel methods to build scalable histograms over large probabilistic data using distributed and parallel algorithms. Extensive experiments on large real data sets have demonstrated the superb scalability and efficiency achieved by our implementations in MapReduce, when compared to the existing, state-of-the-art centralized methods.

References

[1]

P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom. Trio: A system for data, uncertainty, and lineage. In VLDB, 2006.

Digital Library

[2]

M. Arlitt and T. Jin. A workload characterization study of the 1998 world cup web site. In Technical report, IEEE Network, 1999.

[3]

J. Boulos, N. Dalvi, B. Mandhani, S. Mathur, C. Re, and D. Suciu. Mystiq: A system for finding more answers by using probabilities. In SIGMOD, 2005.

Digital Library

[4]

R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD, 2003.

Digital Library

[5]

R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J. S. Vitter. Efficient indexing methods for probabilistic threshold queries over uncertain data. In VLDB, 2004.

Digital Library

[6]

G. Cormode and A. Deligiannakis. Probabilistic histograms for probabilistic data. In VLDB, 2009.

Digital Library

[7]

G. Cormode and M. Garofalakis. Histograms and wavelets on probabilistic data. In ICDE, 2009.

Digital Library

[8]

G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed functional monitoring. In SODA, 2008.

Digital Library

[9]

N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.

Digital Library

[10]

N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.

Digital Library

[11]

A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, 2004.

Digital Library

[12]

X. Dong, A. Y. Halevy, and C. Yu. Data integration with uncertainty. In VLDB, 2007.

Digital Library

[13]

P. K. Felix Halim and R. H. C. Yap. Fast and effective histogram construction. In CIKM, 2009.

Digital Library

[14]

R. Fink, A. Hogue, D. Olteanu, and S. Rath. SPROUT2: a squared query engine for uncertain web data. In SIGMOD Conference, pages 1299--1302, 2011.

Digital Library

[15]

A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Fast, small-space algorithms for approximate histogram maintenance. In STOC, 2002.

Digital Library

[16]

L. Gruenwald, H. Chok, and M. Aboukhamis. Using data mining to estimate missing sensor data. In ICDMW, 2007.

Digital Library

[17]

M. Hua and J. Pei. Continuously monitoring top-k uncertain data streams: a probabilistic threshold method. DPD, 26(1):29--65, 2009.

Digital Library

[18]

Y. E. Ioannidis and V. Poosala. Balancing histogram optimality and practicality for query result size estimation. In SIGMOD, 1995.

Digital Library

[19]

H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In VLDB, 1998.

Digital Library

[20]

T. S. Jayram, S. Kale, and E. Vee. Efficient aggregation algorithms for probabilistic data. In SODA, 2007.

Digital Library

[21]

T. S. Jayram, A. McGregor, S. Muthukrishnan, and E. Vee. Estimating statistical aggregates on probabilistic data streams. In PODS, 2007.

Digital Library

[22]

K. Y. Jeffrey Jestes and F. Li. Building wavelet histograms on large data in mapreduce. In VLDB, 2011.

Digital Library

[23]

S. Jeyashanker, S. Kashyap, R. Rastogi, and P. Shukla. Efficient constraint monitoring using adaptive thresholds. In ICDE, 2008.

Digital Library

[24]

C. K. Lyublena Antova and D. Olteanu. 10106 worlds and beyond: Efficient representation and processing of incomplete information. In ICDE, 2007.

[25]

Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In SIGMOD Conference, pages 448--459, 1998.

Digital Library

[26]

J. M. P. Mingwang Tang, Feifei Li and J. Jestes. Efficient threshold monitoring for distributed probabilistic data. In ICDE, 2012.

Digital Library

[27]

L. Perez, S. Arumugam, and C. Jermaine. Evaluation of probabilistic threshold queries in MCDB. In SIGMOD, 2010.

Digital Library

[28]

SAMOS. Shipboard Automated Meteorological and Oceanographic System. http://samos.coaps.fsu.edu.

[29]

S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. E. Hambrusch, and R. Shah. Orion 2.0: native support for uncertain data. In SIGMOD Conference, pages 1239--1242, 2008.

Digital Library

[30]

D. Suciu, D. Olteanu, C. Ré, and C. Koch. Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011.

Digital Library

[31]

N. K. Sudipto Guha and K. Shim. Approximation and streaming algorithms for histogram construction problems. TODS, 31(1):396--438, March 2006.

Digital Library

[32]

E. Terzi and P. Tsaparas. Efficient algorithms for sequence segmentation. In SIAM SDM, 2006.

[33]

H. Woo and A. K. Mok. Real-time monitoring of uncertain data streams using probabilistic similarity. In RTSS, 2007.

Digital Library

[34]

K. Y. Zengfeng Huang, Lu Wang and Y. Liu. Sampling based algorithms for quantile computation in sensor networks. In SIGMOD, 2011.

Digital Library

[35]

Y. L. Zengfeng Huang, Ke Yi and G. Chen. Optimal sampling algorithms for frequency estimation in distributed data. In IEEE INFOCOM, 2011.

Cited By

Wang SZhang LZhang Y(2019)Research on the Optimization of Spark Big Table Equal JoinArtificial Intelligence and Security10.1007/978-3-030-24265-7_37(430-441)Online publication date: 11-Jul-2019
https://doi.org/10.1007/978-3-030-24265-7_37
Zhou WZhong YWang Y(2018)Data Balance Algorithm Based on Histogram in MapReduceXibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University10.1051/jnwpu/2018363048036:3(480-486)Online publication date: 8-Oct-2018
https://doi.org/10.1051/jnwpu/20183630480
Wang YZhong YMa QYang G(2017)Distributed and parallel construction method for equi-width histogram in cloud databaseMultiagent and Grid Systems10.3233/MGS-17027313:3(311-329)Online publication date: 28-Sep-2017
https://doi.org/10.3233/MGS-170273

Index Terms

Efficient and scalable monitoring and summarization of large probabilistic data
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Query answering techniques on uncertain and probabilistic data: tutorial summary
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Uncertain data are inherent in some important applications, such as environmental surveillance, market analysis, and quantitative economics research. Due to the importance of those applications and the rapidly increasing amount of uncertain data ...
Finding frequent items in probabilistic data
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Computing statistical information on probabilistic data has attracted a lot of attention recently, as the data generated from a wide range of data sources are inherently fuzzy or uncertain. In this paper, we study an important statistical query on ...
Scalable histograms on large probabilistic data
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Histogram construction is a fundamental problem in data management, and a good histogram supports numerous mining operations. Recent work has extended histograms to probabilistic data. However, constructing histograms for probabilistic data can be ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD'13 PhD Symposium: Proceedings of the 2013 SIGMOD/PODS Ph.D. symposium

June 2013

78 pages

ISBN:9781450321556

DOI:10.1145/2483574

Program Chairs:
Lei Chen
Hong Kong University of Science and Technology, China
,
Xin Luna Dong
Google Inc., USA

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'13

Sponsor:

SIGMOD

SIGMOD/PODS'13: International Conference on Management of Data

June 23, 2013

New York, New York, USA

Acceptance Rates

SIGMOD'13 PhD Symposium Paper Acceptance Rate 12 of 26 submissions, 46%;

Overall Acceptance Rate 40 of 60 submissions, 67%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
125
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang SZhang LZhang Y(2019)Research on the Optimization of Spark Big Table Equal JoinArtificial Intelligence and Security10.1007/978-3-030-24265-7_37(430-441)Online publication date: 11-Jul-2019
https://doi.org/10.1007/978-3-030-24265-7_37
Zhou WZhong YWang Y(2018)Data Balance Algorithm Based on Histogram in MapReduceXibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University10.1051/jnwpu/2018363048036:3(480-486)Online publication date: 8-Oct-2018
https://doi.org/10.1051/jnwpu/20183630480
Wang YZhong YMa QYang G(2017)Distributed and parallel construction method for equi-width histogram in cloud databaseMultiagent and Grid Systems10.3233/MGS-17027313:3(311-329)Online publication date: 28-Sep-2017
https://doi.org/10.3233/MGS-170273

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents