Abstract
Within the field of exploratory data mining, subgroup discovery is concerned with finding regions in the data that stand out with respect to a particular target. An important question is how to validate the patterns found; how do we distinguish a true finding from a false discovery? A common solution is to apply a statistical significance test that states that a pattern is real iff it is different from a random subset.
In this paper we argue and empirically show that this assumption is often too weak, as almost any realistic pattern language specifies a set of subsets that strongly deviates from random subsets. In particular, our analysis shows that one should expect the unexpected in subgroup discovery: given a dataset and corresponding description language, it is very likely that high-quality subgroups can —and hence will— be found.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Note that we here use a simpler yet more versatile approach for estimating the number of closed patterns compared to the one proposed in [10].
- 2.
Non-essential derivations can be found in Appendix at https://anon.to/NA9I9A.
- 3.
- 4.
- 5.
We note, however, that for the simpler problem of merely counting accessible subsets using the frequency-biased sampler may give more accurate results. These results are omitted due to length restrictions.
References
Dong, G., Zhang, X., Wong, L., Li, J.: CAEP: classification by aggregating emerging patterns. In: Japkowicz, N., Matwin, S. (eds.) DS 1999. LNCS (LNAI), vol. 1721, pp. 30–42. Springer, Heidelberg (1999). doi:10.1007/3-540-46846-3_4
Duivesteijn, W., Knobbe, A.: Exploiting false discoveries - statistical validation of patterns and quality measures in subgroup discovery. In: Proceedings of the ICDM 2011, pp. 151–160 (2011)
Gionis, A., Mannila, H., Mielikäinen, T., Tsaparas, P.: Assessing data mining results via swap randomization. ACM Trans. Knowl. Discov. Data 1(3), 14 (2007)
Good, P.I.: Permutation, Parametric and Bootstrap Tests of Hypotheses, 3rd edn. Springer, New York (2005)
Grosskreutz, H., Rüping, S.: On subgroup discovery in numerical domains. Data Min. Knowl. Disc. 19(2), 210–226 (2009)
Hämäläinen, W.: Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowl. Inf. Syst. 32(2), 383–414 (2012)
Klösgen, W.: Explora: a multipattern and multistrategy discovery assistant. In: Advances in Knowledge Discovery and Data Mining, pp. 249–271 (1996)
Knuth, D.: Estimating the efficiency of backtrack programs. Math. Comput. 29(129), 122–136 (1975)
van Leeuwen, M., Knobbe, A.: Non-redundant subgroup discovery in large and complex data. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6913, pp. 459–474. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23808-6_30
van Leeuwen, M., Ukkonen, A.: Fast estimation of the pattern frequency spectrum. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8725, pp. 114–129. Springer, Heidelberg (2014). doi:10.1007/978-3-662-44851-9_8
Lemmerich, F., Puppe, F.: A critical view on automatic significance-filtering in pattern mining. In: Proceedings of ECMLPKDD 2014 Workshop on Statistically Sound Data Mining (2014)
Llinares-López, F., Sugiyama, M., Papaxanthos, L., Borgwardt, K.M.: Fast and memory-efficient significant pattern mining via permutation testing. Proc. KDD 2015, 725–734 (2015)
Minato, S., Uno, T., Tsuda, K., Terada, A., Sese, J.: A fast method of statistical assessment for combinatorial hypotheses based on frequent itemset enumeration. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8725, pp. 422–436. Springer, Heidelberg (2014). doi:10.1007/978-3-662-44851-9_27
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, New York (1995)
Ojala, M., Garriga, G.C.: Permutation tests for studying classifier performance. J. Mach. Learn. Res. 11, 1833–1863 (2010)
Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. In: Beeri, C., Buneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 398–416. Springer, Heidelberg (1999). doi:10.1007/3-540-49257-7_25
Terada, A., Okada-Hatakeyama, M., Tsuda, K., Sese, J.: Statistical significance of combinatorial regulations. Proc. Natl. Acad. Sci. 110(32), 12996–13001 (2013)
Webb, G.I.: Discovering significant patterns. Mach. Learn. 68(1), 1–33 (2007)
Wrobel, S.: An algorithm for multi-relational discovery of subgroups. In: Komorowski, J., Zytkow, J. (eds.) PKDD 1997. LNCS, vol. 1263, pp. 78–87. Springer, Heidelberg (1997). doi:10.1007/3-540-63223-9_108
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
van Leeuwen, M., Ukkonen, A. (2016). Expect the Unexpected – On the Significance of Subgroups. In: Calders, T., Ceci, M., Malerba, D. (eds) Discovery Science. DS 2016. Lecture Notes in Computer Science(), vol 9956. Springer, Cham. https://doi.org/10.1007/978-3-319-46307-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-46307-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46306-3
Online ISBN: 978-3-319-46307-0
eBook Packages: Computer ScienceComputer Science (R0)