Nothing Special   »   [go: up one dir, main page]

Skip to main content

Expect the Unexpected – On the Significance of Subgroups

  • Conference paper
  • First Online:
Discovery Science (DS 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9956))

Included in the following conference series:

Abstract

Within the field of exploratory data mining, subgroup discovery is concerned with finding regions in the data that stand out with respect to a particular target. An important question is how to validate the patterns found; how do we distinguish a true finding from a false discovery? A common solution is to apply a statistical significance test that states that a pattern is real iff it is different from a random subset.

In this paper we argue and empirically show that this assumption is often too weak, as almost any realistic pattern language specifies a set of subsets that strongly deviates from random subsets. In particular, our analysis shows that one should expect the unexpected in subgroup discovery: given a dataset and corresponding description language, it is very likely that high-quality subgroups can —and hence will— be found.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Note that we here use a simpler yet more versatile approach for estimating the number of closed patterns compared to the one proposed in [10].

  2. 2.

    Non-essential derivations can be found in Appendix at https://anon.to/NA9I9A.

  3. 3.

    http://archive.ics.uci.edu/ml/.

  4. 4.

    https://www.R-project.org.

  5. 5.

    We note, however, that for the simpler problem of merely counting accessible subsets using the frequency-biased sampler may give more accurate results. These results are omitted due to length restrictions.

References

  1. Dong, G., Zhang, X., Wong, L., Li, J.: CAEP: classification by aggregating emerging patterns. In: Japkowicz, N., Matwin, S. (eds.) DS 1999. LNCS (LNAI), vol. 1721, pp. 30–42. Springer, Heidelberg (1999). doi:10.1007/3-540-46846-3_4

    Chapter  Google Scholar 

  2. Duivesteijn, W., Knobbe, A.: Exploiting false discoveries - statistical validation of patterns and quality measures in subgroup discovery. In: Proceedings of the ICDM 2011, pp. 151–160 (2011)

    Google Scholar 

  3. Gionis, A., Mannila, H., Mielikäinen, T., Tsaparas, P.: Assessing data mining results via swap randomization. ACM Trans. Knowl. Discov. Data 1(3), 14 (2007)

    Article  Google Scholar 

  4. Good, P.I.: Permutation, Parametric and Bootstrap Tests of Hypotheses, 3rd edn. Springer, New York (2005)

    MATH  Google Scholar 

  5. Grosskreutz, H., Rüping, S.: On subgroup discovery in numerical domains. Data Min. Knowl. Disc. 19(2), 210–226 (2009)

    Article  MathSciNet  Google Scholar 

  6. Hämäläinen, W.: Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowl. Inf. Syst. 32(2), 383–414 (2012)

    Article  Google Scholar 

  7. Klösgen, W.: Explora: a multipattern and multistrategy discovery assistant. In: Advances in Knowledge Discovery and Data Mining, pp. 249–271 (1996)

    Google Scholar 

  8. Knuth, D.: Estimating the efficiency of backtrack programs. Math. Comput. 29(129), 122–136 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  9. van Leeuwen, M., Knobbe, A.: Non-redundant subgroup discovery in large and complex data. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6913, pp. 459–474. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23808-6_30

    Chapter  Google Scholar 

  10. van Leeuwen, M., Ukkonen, A.: Fast estimation of the pattern frequency spectrum. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8725, pp. 114–129. Springer, Heidelberg (2014). doi:10.1007/978-3-662-44851-9_8

    Google Scholar 

  11. Lemmerich, F., Puppe, F.: A critical view on automatic significance-filtering in pattern mining. In: Proceedings of ECMLPKDD 2014 Workshop on Statistically Sound Data Mining (2014)

    Google Scholar 

  12. Llinares-López, F., Sugiyama, M., Papaxanthos, L., Borgwardt, K.M.: Fast and memory-efficient significant pattern mining via permutation testing. Proc. KDD 2015, 725–734 (2015)

    Article  Google Scholar 

  13. Minato, S., Uno, T., Tsuda, K., Terada, A., Sese, J.: A fast method of statistical assessment for combinatorial hypotheses based on frequent itemset enumeration. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8725, pp. 422–436. Springer, Heidelberg (2014). doi:10.1007/978-3-662-44851-9_27

    Google Scholar 

  14. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, New York (1995)

    Book  MATH  Google Scholar 

  15. Ojala, M., Garriga, G.C.: Permutation tests for studying classifier performance. J. Mach. Learn. Res. 11, 1833–1863 (2010)

    MathSciNet  MATH  Google Scholar 

  16. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. In: Beeri, C., Buneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 398–416. Springer, Heidelberg (1999). doi:10.1007/3-540-49257-7_25

    Chapter  Google Scholar 

  17. Terada, A., Okada-Hatakeyama, M., Tsuda, K., Sese, J.: Statistical significance of combinatorial regulations. Proc. Natl. Acad. Sci. 110(32), 12996–13001 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  18. Webb, G.I.: Discovering significant patterns. Mach. Learn. 68(1), 1–33 (2007)

    Article  Google Scholar 

  19. Wrobel, S.: An algorithm for multi-relational discovery of subgroups. In: Komorowski, J., Zytkow, J. (eds.) PKDD 1997. LNCS, vol. 1263, pp. 78–87. Springer, Heidelberg (1997). doi:10.1007/3-540-63223-9_108

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Matthijs van Leeuwen or Antti Ukkonen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

van Leeuwen, M., Ukkonen, A. (2016). Expect the Unexpected – On the Significance of Subgroups. In: Calders, T., Ceci, M., Malerba, D. (eds) Discovery Science. DS 2016. Lecture Notes in Computer Science(), vol 9956. Springer, Cham. https://doi.org/10.1007/978-3-319-46307-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46307-0_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46306-3

  • Online ISBN: 978-3-319-46307-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics