Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Mining frequent itemsets over uncertain databases

Published: 01 July 2012 Publication History

Abstract

In recent years, due to the wide applications of uncertain data, mining frequent itemsets over uncertain databases has attracted much attention. In uncertain databases, the support of an itemset is a random variable instead of a fixed occurrence counting of this itemset. Thus, unlike the corresponding problem in deterministic databases where the frequent itemset has a unique definition, the frequent itemset under uncertain environments has two different definitions so far. The first definition, referred as the expected support-based frequent itemset, employs the expectation of the support of an itemset to measure whether this itemset is frequent. The second definition, referred as the probabilistic frequent itemset, uses the probability of the support of an itemset to measure its frequency. Thus, existing work on mining frequent itemsets over uncertain databases is divided into two different groups and no study is conducted to comprehensively compare the two different definitions. In addition, since no uniform experimental platform exists, current solutions for the same definition even generate inconsistent results. In this paper, we firstly aim to clarify the relationship between the two different definitions. Through extensive experiments, we verify that the two definitions have a tight connection and can be unified together when the size of data is large enough. Secondly, we provide baseline implementations of eight existing representative algorithms and test their performances with uniform measures fairly. Finally, according to the fair tests over many different benchmark data sets, we clarify several existing inconsistent conclusions and discuss some new findings.

References

[1]
Frequent itemset mining implementations repository. http://fimi.us.ac.be.
[2]
Wikipedia of poisson binomial distribution.
[3]
C. C. Aggarwal. Managing and Mining Uncertain Data. Kluwer Press, 2009.
[4]
C. C. Aggarwal, Y. Li, J. Wang, and J. Wang. Frequent pattern mining with uncertain data. In KDD, pages 29--38, 2009.
[5]
C. C. Aggarwal and P. S. Yu. Outlier detection with uncertain data. In SDM, pages 483--493, 2008.
[6]
C. C. Aggarwal and P. S. Yu. A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng., 21(5):609--623, 2009.
[7]
R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pages 207--216, 1993.
[8]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487--499, 1994.
[9]
T. Bernecker, H.-P. Kriegel, M. Renz, F. Verhein, and A. Züfle. Probabilistic frequent itemset mining in uncertain databases. In KDD, pages 119--128, 2009.
[10]
T. Calders, C. Garboni, and B. Goethals. Approximation of frequentness probability of itemsets in uncertain data. In ICDM, pages 749--754, 2010.
[11]
T. Calders, C. Garboni, and B. Goethals. Efficient pattern mining of uncertain data with sampling. In PAKDD, pages 480--487, 2010.
[12]
L. L. Cam. An approximation theorem for the poisson binomial distribution. Pacific Journal of Mathematics, 10(4):1181--1197, 1960.
[13]
L. Chen and R. T. Ng. On the marriage of lp-norms and edit distance. In VLDB, pages 792--803, 2004.
[14]
L. Chen, M. T. Özsu, and V. Oria. Robust and fast similarity search for moving object trajectories. In SIGMOD, pages 491--502, 2005.
[15]
R. Cheng, D. V. Kalashnikov, and S. Prabhakar. Querying imprecise data in moving object environments. IEEE Trans. Knowl. Data Eng., 16(9):1112--1127, 2004.
[16]
H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Statist., 23(4):493--507, 1952.
[17]
C. K. Chui and B. Kao. A decremental approach for mining frequent itemsets from uncertain data. In PAKDD, pages 64--75, 2008.
[18]
C. K. Chui, B. Kao, and E. Hung. Mining frequent itemsets from uncertain data. In PAKDD, pages 47--58, 2007.
[19]
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, pages 1--12, 2000.
[20]
B. Jiang and J. Pei. Outlier detection on uncertain data: Objects, instances, and inferences. In ICDE, pages 422--433, 2011.
[21]
B. Kao, S. D. Lee, F. K. F. Lee, D. W.-L. Cheung, and W.-S. Ho. Clustering uncertain data using voronoi diagrams and r-tree index. IEEE Trans. Knowl. Data Eng., 22(9), 2010.
[22]
C. K.-S. Leung, M. A. F. Mateo, and D. A. Brajczuk. A tree-based approach for frequent pattern mining from uncertain data. In PAKDD, pages 653--661, 2008.
[23]
M. Li and Y. Liu. Underground coal mine monitoring with wireless sensor networks. TOSN, 5(2):10, 2009.
[24]
Y. Liu, K. Liu, and M. Li. Passive diagnosis for wireless sensor networks. IEEE/ACM Trans. Netw., 18(4):1132--1144, 2010.
[25]
M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized algorithm and probabilistic analysis. Cambridge University Press, 2005.
[26]
L. Mo, Y. He, Y. Liu, J. Zhao, S. Tang, X.-Y. Li, and G. Dai. Canopy closure estimates with greenorbs: sustainable sensing in the forest. In SenSys, pages 99--112, 2009.
[27]
J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. H-mine: Hyper-structure mining of frequent patterns in large databases. In ICDM, pages 441--448, 2001.
[28]
L. Sun, R. Cheng, D. W. Cheung, and J. Cheng. Mining uncertain data with probabilistic guarantees. In KDD, pages 273--282, 2010.
[29]
S. Suthram, T. Shlomi, E. Ruppin, R. Sharan, and T. Ideker. A direct comparison of protein interaction confidence assignment schemes. BMC Bioinformatics, 7:360, 2006.
[30]
Y. Tong, L. Chen, and B. Ding. Discovering threshold-based frequent closed itemsets over probabilistic data. In ICDE, pages 270--281, 2012.
[31]
L. Wang, R. Cheng, S. D. Lee, and D. W.-L. Cheung. Accelerating probabilistic frequent itemset mining: a model-based approach. In CIKM, pages 429--438, 2010.
[32]
M. J. Zaki. Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng., 12(3):372--390, 2000.
[33]
Q. Zhang, F. Li, and K. Yi. Finding frequent items in probabilistic data. In SIGMOD, pages 819--832, 2008.

Cited By

View all
  • (2024)Efficient Maximal Motif-Clique Enumeration over Large Heterogeneous Information NetworksProceedings of the VLDB Endowment10.14778/3681954.368197517:11(2946-2959)Online publication date: 30-Aug-2024
  • (2022)Mining Order-preserving Submatrices under Data Uncertainty: A Possible-world Approach and Efficient Approximation MethodsACM Transactions on Database Systems10.1145/352491547:2(1-57)Online publication date: 23-May-2022
  • (2020)Top-k Frequent Itemsets Publication of Uncertain Data Based on Differential PrivacyWeb Information Systems and Applications10.1007/978-3-030-60029-7_49(547-558)Online publication date: 23-Sep-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 5, Issue 11
July 2012
608 pages

Publisher

VLDB Endowment

Publication History

Published: 01 July 2012
Published in PVLDB Volume 5, Issue 11

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Maximal Motif-Clique Enumeration over Large Heterogeneous Information NetworksProceedings of the VLDB Endowment10.14778/3681954.368197517:11(2946-2959)Online publication date: 30-Aug-2024
  • (2022)Mining Order-preserving Submatrices under Data Uncertainty: A Possible-world Approach and Efficient Approximation MethodsACM Transactions on Database Systems10.1145/352491547:2(1-57)Online publication date: 23-May-2022
  • (2020)Top-k Frequent Itemsets Publication of Uncertain Data Based on Differential PrivacyWeb Information Systems and Applications10.1007/978-3-030-60029-7_49(547-558)Online publication date: 23-Sep-2020
  • (2019)A novel single scan distributed pattern mining algorithm for frequent pattern identificationInternational Journal of Data Analysis Techniques and Strategies10.5555/3302686.330269011:1(81-100)Online publication date: 1-Jan-2019
  • (2019)Cleaning uncertain graphs via noisy crowdsourcingWorld Wide Web10.1007/s11280-018-0624-822:4(1523-1553)Online publication date: 1-Jul-2019
  • (2019)Fine-grained probability counting for cardinality estimation of data streamsWorld Wide Web10.1007/s11280-018-0583-022:5(2065-2081)Online publication date: 1-Sep-2019
  • (2018)Mining frequent itemsets over uncertain data streamsInternational Journal of High Performance Computing and Networking10.1504/IJHPCN.2018.09323411:4(312-321)Online publication date: 1-Jan-2018
  • (2017)Frequent Symptom Sets Identification from Uncertain Medical Data in Differentially Private WayScientific Programming10.1155/2017/75453472017Online publication date: 1-Jan-2017
  • (2017)JOTAComputer Communications10.1016/j.comcom.2017.02.009102:C(17-27)Online publication date: 1-Apr-2017
  • (2017)Finding efficiencies in frequent pattern mining from big uncertain dataWorld Wide Web10.1007/s11280-016-0411-320:3(571-594)Online publication date: 1-May-2017
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media