Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

A user parameter-free approach for mining robust sequential classification rules

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Sequential data are generated in many domains of science and technology. Although many studies have been carried out for sequence classification in the past decade, the problem is still a challenge, particularly for pattern-based methods. We identify two important issues related to pattern-based sequence classification, which motivate the present work: the curse of parameter tuning and the instability of common interestingness measures. To alleviate these issues, we suggest a new approach and framework for mining sequential rule patterns for classification purpose. We introduce a space of rule pattern models and a prior distribution defined on this model space. From this model space, we define a Bayesian criterion for evaluating the interest of sequential patterns. We also develop a user parameter-free algorithm to efficiently mine sequential patterns from the model space. Extensive experiments show that (i) the new criterion identifies interesting and robust patterns, (ii) the direct use of the mined rules as new features in a classification process demonstrates higher inductive performance than the state-of-the-art sequential pattern-based classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html.

  2. http://web.ist.utl.pt/acardoso/datasets/.

  3. http://www.khiops.com.

  4. In case if MiSeRe has difficulty in finding the required 1024 rules, another constraint based on time is fixed. If MiSeRe cannot find any interesting rule 5 minutes after the generation of last interesting rule, the algorithm is terminated.

References

  1. Agrawal R, Srikant R (1995) Mining sequential patterns. In: ICDE’95, pp 3–14

  2. Aseervatham S, Osmani A, Viennet E (2006) Bitspade: a lattice-based sequential pattern mining algorithm using bitmap representation. In: Sixth International Conference on Data Mining, 2006. ICDM’06. IEEE, pp 792–797

  3. Ayres J, Flannick J, Gehrke J, Yiu T (2002) Sequential pattern mining using a bitmap representation. In: KDD’02. ACM, pp 429–435

  4. Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml

  5. Baralis E, Chiusano S, Dutto R, Mantellini L (2008) Compact representations of sequential classification rules. In: Data mining: foundations and practice, pp 1–30

  6. Boullé M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65(1):131–165

    Article  Google Scholar 

  7. Boullé M (2007) Compression-based averaging of selective naive Bayes classifiers. J Mach Learn Res 8:1659–1685

    MathSciNet  MATH  Google Scholar 

  8. Cardoso-Cachopo A (2007) Improving methods for single-label text categorization. PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa

  9. Cheng H, Yan X, Han J, Hsu CW (2007) Discriminative frequent pattern analysis for effective classification. In: IEEE 23rd international conference on data engineering, 2007. ICDE 2007. IEEE, pp 716–725

  10. Coenen F, Leng PH (2007) The effect of threshold values on association rule based classification accuracy. Data Knowl Eng 60(2):345–360

    Article  Google Scholar 

  11. Companion website (2015) MiSeRe: Mining sequential classification rules. http://misere.co.nf

  12. Cover TM, Thomas JA (2006) Elements of information theory (Wiley series in telecommunications and signal processing). Wiley-Interscience, New York

    Google Scholar 

  13. Dafé G, Veloso A, Zaki M, Meira W Jr. (2015) Learning sequential classifiers from long and noisy discrete-event sequences efficiently. Data Min Knowl Discov, To appear

  14. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  15. Deng K, Zaïane OR (2010) An occurrence based approach to mine emerging sequences. In: DaWaK’10, pp 275–284

  16. Deshpande M, Karypis G (2002) Evaluation of techniques for classifying biological sequences. In: PAKDD’02, pp 417–431

  17. Egho E, Gay D, Boullé M, Voisine N, Clérot F (2015) A parameter-free approach for mining robust sequential classification rules. In: 2015 IEEE international conference on data mining, ICDM 2015, Atlantic City, 14–17 Nov 2015, pp 745–750

  18. Egho E, Raïssi C, Calders T, Jay N, Napoli A (2015) On measuring similarity for sequences of itemsets. Data Min Knowl Discov 29(3):732–764

    Article  MathSciNet  Google Scholar 

  19. Fan W, Zhang K, Cheng H, Gao J, Yan X, Han J, Yu PS, Verscheure O (2008) Direct mining of discriminative and essential frequent patterns via model-based search tree. In: ACM SIGKDD’08, pp 230–238

  20. Fradkin D, Mörchen F (2015) Mining sequential patterns for classification. Knowl Inf Syst, To appear

  21. Gay D, Boullé M (2012) A Bayesian approach for classification rule mining in quantitative databases. In: ECML/PKDD’12, pp 243–259

  22. Gay D, Selmaoui N, Boulicaut J-F (2008) Feature construction based on closedness properties is not that simple. In: Advances in knowledge discovery and data mining. Springer, pp 112–123

  23. Grosskreutz H, Lang B, Trabold D (2013) A relevance criterion for sequential patterns. In: ECML/PKDD’13, pp 369–384

  24. Grünwald PD, Myung IJ, Pitt MA (2005) Advances in minimum description length: theory and applications. MIT press, Cambridge

    Google Scholar 

  25. Hall MA, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18

    Article  Google Scholar 

  26. Holat P, Plantevit M, Raïssi C, Tomeh N, Charnois T, Crémilleux B (2014) Sequence classification based on delta-free sequential patterns. In: ICDM’14, pp 170–179

  27. Ji X, Bailey J, Dong G (2005) Mining minimal distinguishing subsequence patterns with gap constraints. In: IEEE ICDM’05, pp 194–201

  28. Jorge AM, Azevedo PL, Pereira F (2006) Distribution rules with numeric attributes of interest. In: PKDD’06, pp 247–258

  29. Lam HT, Moerchen F, Fradkin D, Calders T (2012) Mining compressing sequential patterns. In: SDM’12, pp 319–330

  30. Lam HT, Mörchen F, Fradkin D, Calders T (2014) Mining compressing sequential patterns. Stat Anal Data Min 7(1):34–52

    Article  MathSciNet  Google Scholar 

  31. Lavrac N, Gamberger D, Jovanoski V (1999) A study of relevance for learning in deductive databases. J Log Program 40(2–3):215–249

    Article  MathSciNet  MATH  Google Scholar 

  32. Lawrence R (2002) A tutorial on hidden Markov models and selected applications in speech recognition. IEEE 77(2):419–444

    Google Scholar 

  33. Lesh N, Zaki MJ, Ogihara M (1999) Mining features for sequence classification. In: ACM SIGKDD’99, pp 342–346

  34. Leslie CS, Eskin E, Weston J, Noble WS (2002) Mismatch string kernels for SVM protein classification. In: NIPS’02, pp 1417–1424

  35. Liu B, Hsu W, Ma Y (1998) Integrating classification and association rule mining. In: ACM SIGKDD’98, pp 80–86

  36. Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Christopher JCH (2002) Watkins. Text classification using string kernels. J Mach Learn Res 2:419–444

    MATH  Google Scholar 

  37. Mannila H, Toivonen H (1996) Multiple uses of frequent sets and condensed representations (extended abstract). In: KDD’96, pp 189–194

  38. Ming L, Paul V (2013) An introduction to Kolmogorov complexity and its applications. Springer Science & Business Media, New York

    Google Scholar 

  39. Mörchen F, Ultsch A (2007) Efficient mining of understandable patterns from multivariate interval time series. Data Min Knowl Discov 15(2):181–215

    Article  MathSciNet  Google Scholar 

  40. Myers JL, Well AD (2003) Res Des Stat Anal. Lawrence Erlbaum Associates, New Jersey

    Google Scholar 

  41. Rissanen J (1978) Modeling by shortest data description. Automatica 14(5):465–471

    Article  MATH  Google Scholar 

  42. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Article  MathSciNet  Google Scholar 

  43. Shannon CE (2001) A mathematical theory of communication. ACM SIGMOBILE Mob Comput Commun Rev 5(1):3–55

    Article  MathSciNet  Google Scholar 

  44. She R, Chen F, Wang F, Ester M, Gardy FL, Brinkman FSL (2003) Frequent-subsequence-based prediction of outer membrane proteins. In: ACM SIGKDD’03, pp 436–445

  45. Tan P-N, Kumar V (2002) Discovery of web robot sessions based on their navigational patterns. Data Min Knowl Discov 6(1):9–35

    Article  MathSciNet  Google Scholar 

  46. Tatti N, Vreeken J (2012) The long and the short of it: Summarising event sequences with serial episodes. In: ACM SIGKDD’12, pp 462–470

  47. Tseng VS, Lee CH (2005) CBS: a new classification method by using sequential patterns. In: SDM’05, pp 596–600

  48. Vitányi P, Li M (2000) Minimum description length induction, Bayesianism, and Kolmogorov complexity. IEEE Trans Inf Theory 46(2):446–464

    Article  MathSciNet  MATH  Google Scholar 

  49. Wang J, Han J (2004) BIDE: efficient mining of frequent closed sequences. In: ICDE’04, pp 79–90

  50. Xing Z, Pei J, Keogh EJ (2010) A brief survey on sequence classification. SIGKDD Explor 12(1):40–48

    Article  Google Scholar 

  51. Zaki MJ (2000) Sequence mining in categorical domains: incorporating constraints. In: CIKM’00, pp 422–429

  52. Zaki MJ, Carothers CD, Szymanski BK (2010) VOGUE: a variable order hidden Markov model with duration based on frequent sequence mining. TKDD, 4(1)

  53. Zhou C, Cule B, Goethals B (2013) Itemset based sequence classification. In: ECML/PKDD’13, pp 353–368

  54. Zimmermann A, Nijssen S (2014) Supervised pattern mining and applications to classification. In: Frequent pattern mining, pp 425–442

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elias Egho.

Additional information

This work was done while Dominique Gay was a research engineer at Orange Labs.

Appendix: Proof of Theorem 1 and  2

Appendix: Proof of Theorem 1 and  2

Here, we will present a complete Proof of Theorem 1, while theorem 2 can be proven in the same way.

Proof

Given the cost of the null model in Eq. , the prior terms are neglected when the number of sequences n is very high. Therefore, when \(n \rightarrow \infty \) the cost of the null model is written as:

$$\begin{aligned} \mathrm{cost}(\pi _{\emptyset })= & {} log (n!)- \sum _{i=1}^{j} log (n_{c_i}!)\\ \end{aligned}$$

Using the approximation \(log(n!) = n(log(n)-1)+O(log(n))\) based on Stirlings formula, the cost of the null model can be written as:

$$\begin{aligned} \mathrm{cost}(\pi _{\emptyset })= & {} n(log(n)-1)- \sum _{i=1}^{j} n_{c_i}(log (n_{c_i})-1)+O(log(n))\nonumber \\= & {} nlog(n)-n - \sum _{i=1}^{j} n_{c_i}log (n_{c_i})+ \sum _{i=1}^{j} n_{c_i}+O(log(n))\\ \end{aligned}$$

Notice that \(n=\sum _{i=1}^{j} n_{c_i}\). Therefore, we have:

$$\begin{aligned} \mathrm{cost}(\pi _{\emptyset })= & {} \sum _{i=1}^{j} n_{c_i}log(n)-n - \sum _{i=1}^{j} n_{c_i}log (n_{c_i})+ n+O(log(n))\\= & {} -\sum _{i=1}^{j} n_{c_i} \bigg ( log(n_{c_i})-log(n) \bigg )+O(log(n))\\= & {} n \times \bigg (-\sum _{i=1}^{j}\frac{n_{c_i}}{n}log( \frac{n_{c_i}}{n})\bigg )+O(log(n))\\= & {} n \times \bigg (-\sum \limits _{i=1}^{j} p(c_i)log(p(c_i))\bigg )+O(log(n))\\= & {} n \times H(y) +O(log(n)) \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Egho, E., Gay, D., Boullé, M. et al. A user parameter-free approach for mining robust sequential classification rules. Knowl Inf Syst 52, 53–81 (2017). https://doi.org/10.1007/s10115-016-1002-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-1002-4

Keywords

Navigation