research-article

Mining frequent itemsets over uncertain databases

Authors:

Philip S. YuAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 5, Issue 11

Pages 1650 - 1661

https://doi.org/10.14778/2350229.2350277

Published: 01 July 2012 Publication History

Abstract

In recent years, due to the wide applications of uncertain data, mining frequent itemsets over uncertain databases has attracted much attention. In uncertain databases, the support of an itemset is a random variable instead of a fixed occurrence counting of this itemset. Thus, unlike the corresponding problem in deterministic databases where the frequent itemset has a unique definition, the frequent itemset under uncertain environments has two different definitions so far. The first definition, referred as the expected support-based frequent itemset, employs the expectation of the support of an itemset to measure whether this itemset is frequent. The second definition, referred as the probabilistic frequent itemset, uses the probability of the support of an itemset to measure its frequency. Thus, existing work on mining frequent itemsets over uncertain databases is divided into two different groups and no study is conducted to comprehensively compare the two different definitions. In addition, since no uniform experimental platform exists, current solutions for the same definition even generate inconsistent results. In this paper, we firstly aim to clarify the relationship between the two different definitions. Through extensive experiments, we verify that the two definitions have a tight connection and can be unified together when the size of data is large enough. Secondly, we provide baseline implementations of eight existing representative algorithms and test their performances with uniform measures fairly. Finally, according to the fair tests over many different benchmark data sets, we clarify several existing inconsistent conclusions and discuss some new findings.

References

[1]

Frequent itemset mining implementations repository. http://fimi.us.ac.be.

[2]

Wikipedia of poisson binomial distribution.

[3]

C. C. Aggarwal. Managing and Mining Uncertain Data. Kluwer Press, 2009.

Digital Library

[4]

C. C. Aggarwal, Y. Li, J. Wang, and J. Wang. Frequent pattern mining with uncertain data. In KDD, pages 29--38, 2009.

Digital Library

[5]

C. C. Aggarwal and P. S. Yu. Outlier detection with uncertain data. In SDM, pages 483--493, 2008.

[6]

C. C. Aggarwal and P. S. Yu. A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng., 21(5):609--623, 2009.

Digital Library

[7]

R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pages 207--216, 1993.

Digital Library

[8]

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487--499, 1994.

Digital Library

[9]

T. Bernecker, H.-P. Kriegel, M. Renz, F. Verhein, and A. Züfle. Probabilistic frequent itemset mining in uncertain databases. In KDD, pages 119--128, 2009.

Digital Library

[10]

T. Calders, C. Garboni, and B. Goethals. Approximation of frequentness probability of itemsets in uncertain data. In ICDM, pages 749--754, 2010.

Digital Library

[11]

T. Calders, C. Garboni, and B. Goethals. Efficient pattern mining of uncertain data with sampling. In PAKDD, pages 480--487, 2010.

Digital Library

[12]

L. L. Cam. An approximation theorem for the poisson binomial distribution. Pacific Journal of Mathematics, 10(4):1181--1197, 1960.

[13]

L. Chen and R. T. Ng. On the marriage of lp-norms and edit distance. In VLDB, pages 792--803, 2004.

Digital Library

[14]

L. Chen, M. T. Özsu, and V. Oria. Robust and fast similarity search for moving object trajectories. In SIGMOD, pages 491--502, 2005.

Digital Library

[15]

R. Cheng, D. V. Kalashnikov, and S. Prabhakar. Querying imprecise data in moving object environments. IEEE Trans. Knowl. Data Eng., 16(9):1112--1127, 2004.

Digital Library

[16]

H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Statist., 23(4):493--507, 1952.

[17]

C. K. Chui and B. Kao. A decremental approach for mining frequent itemsets from uncertain data. In PAKDD, pages 64--75, 2008.

Digital Library

[18]

C. K. Chui, B. Kao, and E. Hung. Mining frequent itemsets from uncertain data. In PAKDD, pages 47--58, 2007.

Digital Library

[19]

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, pages 1--12, 2000.

Digital Library

[20]

B. Jiang and J. Pei. Outlier detection on uncertain data: Objects, instances, and inferences. In ICDE, pages 422--433, 2011.

Digital Library

[21]

B. Kao, S. D. Lee, F. K. F. Lee, D. W.-L. Cheung, and W.-S. Ho. Clustering uncertain data using voronoi diagrams and r-tree index. IEEE Trans. Knowl. Data Eng., 22(9), 2010.

Digital Library

[22]

C. K.-S. Leung, M. A. F. Mateo, and D. A. Brajczuk. A tree-based approach for frequent pattern mining from uncertain data. In PAKDD, pages 653--661, 2008.

Digital Library

[23]

M. Li and Y. Liu. Underground coal mine monitoring with wireless sensor networks. TOSN, 5(2):10, 2009.

Digital Library

[24]

Y. Liu, K. Liu, and M. Li. Passive diagnosis for wireless sensor networks. IEEE/ACM Trans. Netw., 18(4):1132--1144, 2010.

Digital Library

[25]

M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized algorithm and probabilistic analysis. Cambridge University Press, 2005.

Digital Library

[26]

L. Mo, Y. He, Y. Liu, J. Zhao, S. Tang, X.-Y. Li, and G. Dai. Canopy closure estimates with greenorbs: sustainable sensing in the forest. In SenSys, pages 99--112, 2009.

Digital Library

[27]

J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. H-mine: Hyper-structure mining of frequent patterns in large databases. In ICDM, pages 441--448, 2001.

Digital Library

[28]

L. Sun, R. Cheng, D. W. Cheung, and J. Cheng. Mining uncertain data with probabilistic guarantees. In KDD, pages 273--282, 2010.

Digital Library

[29]

S. Suthram, T. Shlomi, E. Ruppin, R. Sharan, and T. Ideker. A direct comparison of protein interaction confidence assignment schemes. BMC Bioinformatics, 7:360, 2006.

[30]

Y. Tong, L. Chen, and B. Ding. Discovering threshold-based frequent closed itemsets over probabilistic data. In ICDE, pages 270--281, 2012.

Digital Library

[31]

L. Wang, R. Cheng, S. D. Lee, and D. W.-L. Cheung. Accelerating probabilistic frequent itemset mining: a model-based approach. In CIKM, pages 429--438, 2010.

Digital Library

[32]

M. J. Zaki. Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng., 12(3):372--390, 2000.

Digital Library

[33]

Q. Zhang, F. Li, and K. Yi. Finding frequent items in probabilistic data. In SIGMOD, pages 819--832, 2008.

Digital Library

Cited By

Zhou YFang YMa CHou THuang X(2024)Efficient Maximal Motif-Clique Enumeration over Large Heterogeneous Information NetworksProceedings of the VLDB Endowment10.14778/3681954.368197517:11(2946-2959)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.14778/3681954.3681975
Fan ZWang RCai YZhang RYang TWu YCui BUhlig S(2023)OneSketch: A Generic and Accurate Sketch for Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.327802835:12(12887-12901)Online publication date: 19-May-2023
https://dl.acm.org/doi/10.1109/TKDE.2023.3278028
Tian WLi FLiu YWang ZZhang T(2023)Depth-First Uncertain Frequent Itemsets Mining based on Ensembled Conditional Item-Wise Supports2023 International Conference on Intelligent Supercomputing and BioPharma (ISBP)10.1109/ISBP57705.2023.10061307(121-128)Online publication date: 6-Jan-2023
https://doi.org/10.1109/ISBP57705.2023.10061307
Show More Cited By

Recommendations

Probabilistic frequent itemset mining in uncertain databases
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Probabilistic frequent itemset mining in uncertain transaction databases semantically and computationally differs from traditional techniques applied to standard "certain" transaction databases. The consideration of existential uncertainty of item(sets),...
Efficient algorithms for mining high-utility itemsets in uncertain databases

High-utility itemset mining (HUIM) is a useful set of techniques for discovering patterns in transaction databases, which considers both quantity and profit of items. However, most algorithms for mining high-utility itemsets (HUIs) assume that the ...
Mining probabilistic frequent closed itemsets in uncertain databases
ACMSE '11: Proceedings of the 49th annual ACM Southeast Conference

This paper defines probabilistic support and probabilistic frequent closed itemsets in uncertain databases for the first time. It also proposes a probabilistic frequent closed itemset mining (PFCIM) algorithm to mine probabilistic frequent closed ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 5, Issue 11

July 2012

608 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2012

Published in PVLDB Volume 5, Issue 11

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

120
Total Citations
View Citations
317
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou YFang YMa CHou THuang X(2024)Efficient Maximal Motif-Clique Enumeration over Large Heterogeneous Information NetworksProceedings of the VLDB Endowment10.14778/3681954.368197517:11(2946-2959)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.14778/3681954.3681975
Fan ZWang RCai YZhang RYang TWu YCui BUhlig S(2023)OneSketch: A Generic and Accurate Sketch for Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.327802835:12(12887-12901)Online publication date: 19-May-2023
https://dl.acm.org/doi/10.1109/TKDE.2023.3278028
Tian WLi FLiu YWang ZZhang T(2023)Depth-First Uncertain Frequent Itemsets Mining based on Ensembled Conditional Item-Wise Supports2023 International Conference on Intelligent Supercomputing and BioPharma (ISBP)10.1109/ISBP57705.2023.10061307(121-128)Online publication date: 6-Jan-2023
https://doi.org/10.1109/ISBP57705.2023.10061307
Fan ZGuo JLi XYang TZhao YWu YCui BXu YUhlig SZhang G(2023)Finding Simplex Items in Data Streams2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00152(1953-1966)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00152
Malipatil SHanumantha Reddy T(2023)Discovery of interesting frequent item sets in an uncertain database using ant colony optimizationInternational Journal of Computers and Applications10.1080/1206212X.2023.226368945:11(673-679)Online publication date: 9-Oct-2023
https://doi.org/10.1080/1206212X.2023.2263689
Cheng JYan DQu WHao XLong CNg WWang X(2022)Mining Order-preserving Submatrices under Data Uncertainty: A Possible-world Approach and Efficient Approximation MethodsACM Transactions on Database Systems10.1145/352491547:2(1-57)Online publication date: 23-May-2022
https://dl.acm.org/doi/10.1145/3524915
Miao RZhong ZGuo JLi ZYang TCui B(2022)BurstSketch: Finding Bursts in Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322368635:11(11126-11140)Online publication date: 21-Nov-2022
https://dl.acm.org/doi/10.1109/TKDE.2022.3223686
Zhang CZhang HXie WLiu NLi QJiang DLin PWu KChen L(2022)Cleaning Uncertain Data With Crowdsourcing - A General Model With Diverse Accuracy RatesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.302754534:8(3629-3642)Online publication date: 1-Aug-2022
https://doi.org/10.1109/TKDE.2020.3027545
Wu JLiu SLin J(2022)Mining of High-Utility Sequence Patterns in Large-Scale Uncertain Databases2022 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech)10.1109/DASC/PiCom/CBDCom/Cy55231.2022.9927807(1-7)Online publication date: 12-Sep-2022
https://doi.org/10.1109/DASC/PiCom/CBDCom/Cy55231.2022.9927807
Kumar SMohbey K(2022)A review on big data based parallel and distributed approaches of pattern miningJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2019.09.00634:5(1639-1662)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1016/j.jksuci.2019.09.006
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents