Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Frequency-based views to pattern collections

Published: 01 May 2006 Publication History

Abstract

Finding interesting patterns from data is one of the most important problems in data mining and it has been studied actively for more than a decade. However, it is still largely open problem which patterns are interesting and which are not. The problem of detecting the interesting patterns (in a predefined class of patterns) has been attempted to solve by determining quality values for potentially interesting patterns and deciding a pattern to be interesting if its quality value (i.e., the interestingness of the pattern) is higher than a given threshold value. Again, it is very difficult to find a threshold value and a way to determine the quality values such that the collection of patterns with quality values greater than the threshold value would contain almost all truly interesting patterns and only few uninteresting ones. To enable more accurate characterization of interesting patterns, use of constraints to further prune the pattern collection has been proposed. However, most of the constrained pattern discovery research has been focused on structural constraints for the pattern collections and the patterns. We take a complementary approach and focus on constraining the quality values of the patterns. We propose quality value simplifications as a complementary approach to structural constraints on patterns. As a special case of the quality value simplifications, we consider discretizing the quality values. We analyze the worst-case error of certain discretization functions and give efficient discretization algorithms minimizing several loss functions. In addition to that, we show that the discretizations of the quality values can be used to obtain small approximate condensed representations for collections of interesting patterns. We evaluate the proposed condensation approach experimentally using frequent itemsets.

References

[1]
Agrawal, R., Imielinski, T. and Swami, A.N., Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (Eds.), Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, ACM Press, New York. pp. 207-216.
[2]
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. and Verkamo, A.I., Fast discovery of association rules. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (Eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, New York. pp. 307-328.
[3]
B¿doiu, M., Har-Peled, S. and Indyk, P., Approximate clustering via core-sets. In: Proceedings on 34th Annual ACM Symposium on Theory of Computing, ACM Press, New York. pp. 250-257.
[4]
Bayardo Jr., R.J., Agrawal, R. and Gunopulos, D., Constraint-based rule mining in large, dense databases. Data Min. Knowledge Discovery. v4 i2/3. 217-240.
[5]
R. Bayardo, B. Goethals, M.J. Zaki (Eds.), Proceedings of the Workshop on Frequent Itemset Mining Implementations (FIMI-04), Brighton, UK, 1 November 2004, CEUR Workshop Proceedings, vol. 126, 2004, http://CEUR-WS.org/Vol-126/.
[6]
F. Bonchi, F. Giannotti, A. Mazzanti, D. Pedreschi, ExAnte: anticipated data reduction in constrained pattern mining, in: N. Lavrač, D. Gamberger, H. Blockeel, L. Todorovski (Eds.), Knowledge Discovery in Databases: PKDD 2003, Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 22-26 September 2003, Proceedings, Lecture Notes in Artificial Intelligence, vol. 2838, Springer, Berlin, 2003, pp. 59-70.
[7]
J.-F. Boulicaut, A. Bykowski, Frequent closures as a concise representation for binary data mining, in: T. Terano, H. Liu, A.L.P. Chen (Eds.), Knowledge Discovery and Data Mining, Current Issues and New Applications, Fourth Pacific-Asia Conference, 18-20 April 2000, PAKDD 2000, Kyoto, Japan, Proceedings, Lecture Notes in Computer Science, vol. 1805, Springer, Berlin, 2000, pp. 62-73.
[8]
Boulicaut, J.-F., Bykowski, A. and Rigotti, C., Free-sets: a condensed representation of Boolean data for the approximation of frequency queries. Data Min. Knowledge Discovery. v7 i1. 5-22.
[9]
Bykowski, A. and Rigotti, C., A condensed representation to find frequent patterns. In: Proceedings of the 20th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, ACM Press, New York.
[10]
T. Calders, B. Goethals, Mining all non-derivable frequent itemsets, in: T. Elomaa, H. Mannila, H. Toivonen (Eds.), Principles of Data Mining and Knowledge Discovery, Sixth European Conference, PKDD 2002, Helsinki, Finland, 19-23 August 2002, Proceedings, Lecture Notes in Artificial Intelligence, vol. 2431, Springer, Berlin, 2002, pp. 74-865.
[11]
T. Calders, B. Goethals, Minimal k-free representations of frequent sets, in: N. Lavrač, D. Gamberger, H. Blockeel, L. Todorovski (Eds.), Knowledge Discovery in Databases: PKDD 2003, Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 22-26 September 2003, Proceedings, Lecture Notes in Artificial Intelligence, vol. 2838, Springer, Berlin, 2003, pp. 71-82.
[12]
G. Casas-Garriga, Discovering unbounded episodes in sequential data, in: N. Lavrač, D. Gamberger, H. Blockeel, L. Todorovski (Eds.), Knowledge Discovery in Databases: PKDD 2003, Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 22-26 September 2003, Proceedings, Lecture Notes in Artificial Intelligence, vol. 2838, Springer, Berlin, 2003, pp. 83-94.
[13]
S. Dasgupta, Performance guarantees for hierarchical clustering, in: J. Kivinen, R.H. Sloan (Eds.), Computational Learning Theory, 15th Annual Conference on Computational Learning Theory, COLT 2002, Sydney, Australia, 8-10 July 2002, Proceedings, Lecture Notes in Artificial Intelligence, vol. 2375, Springer, Berlin, 2002, pp. 351-363.
[14]
de la Vega, W.F., Karpinski, M., Kenyon, C. and Rabani, Y., Approximation schemes for clustering problems. In: Proceedings of the 35th Annual ACM Symposium on Theory of Computing, ACM Press, New York.
[15]
L. De Raedt, M. Jaeger, S.D. Lee, H. Mannila, A theory of inductive query answering, in: V. Kumar, S. Tsumoto (Eds.), Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), 9-12 December 2002, Maebashi City, Japan, IEEE Computer Society, Silver Spring, MD, 2002, pp. 123-130.
[16]
Dehaspe, L. and Toivonen, H.T.T., Discovery of relational association rules. In: D¿eroski, S., Lavrač, N. (Eds.), Relational Data Mining, Springer, Berlin. pp. 189-212.
[17]
Elomaa, T. and Rousu, J., On the computational complexity of optimal multisplitting. Fundam. Inform. v47 i1-2. 35-52.
[18]
Feder, T. and Greene, D.H., Optimal algorithms for approximate clustering. In: Proceedings of the 20th Annual ACM Symposium on Theory of Computing, ACM Press, New York. pp. 434-444.
[19]
Fisher, W.D., On grouping for maximum homogeneity. J. Amer. Statist. Assoc. v53 i284. 789-798.
[20]
B. Goethals, J. Van den Bussche, On supporting interactive association rule mining, in: Y. Kambayashi, M.K. Mohania, A.M. Tjoa (Eds.), DaWaK, Lecture Notes in Computer Science, vol. 1874, Springer, Berlin, 2000, pp. 307-316.
[21]
B. Goethals, J. Van den Bussche, Relational association rules: getting WARMeR, in: D.J. Hand, N.M. Adams, R.J. Bolton (Eds.), Pattern Detection and Discovery, ESF Exploratory Workshop, London, UK, 16-19 September 2002, Proceedings, Lecture Notes in Computer Science, vol. 2447, Springer, Berlin, 2002, pp. 125-139.
[22]
B. Goethals, M.J. Zaki (Eds.), Proceedings of the Workshop on Frequent Itemset Mining Implementations (FIMI-03), Melbourne, FL, USA, 19 November 2003, CEUR Workshop Proceedings, vol. 90, 2003, http://CEUR-WS.org/Vol-90/.
[23]
Gunopulos, D., Khardon, R., Mannila, H., Saluja, S., Toivonen, H. and Sharma, R.S., Discovering all most specific sentences. ACM Trans. Database Systems. v28 i2. 140-174.
[24]
R. Gwadera, M. Atallah, W. Szpankowski, Reliable detection of episodes in event sequences, in: X. Wu, A. Tuzhilin, J. Shavlik (Eds.), Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003), 19-22 December 2003, Melbourne, FL, USA, IEEE Computer Society Press, Silver Spring, MD, 2003, pp. 67-74.
[25]
Han, J., Pei, J., Yin, Y. and Mao, R., Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowledge Discovery. v8 i1. 53-87.
[26]
D.J. Hand, Pattern detection and discovery, in: D.J. Hand, N.M. Adams, R.J. Bolton (Eds.), Pattern Detection and Discovery, ESF Exploratory Workshop, London, UK, 16-19 September 2002, Proceedings, Lecture Notes in Computer Science, vol. 2447, Springer, Berlin, 2002, pp. 1-12.
[27]
Hand, D.J., Mannila, H. and Smyth, P., Principles of Data Mining. 2001. MIT Press, Cambridge, MA.
[28]
Har-Peled, S. and Sadri, B., How fast is the k-means method?. Algorithmica. v41. 185-202.
[29]
T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics, Springer, Berlin, 2001.
[30]
Inokuchi, A., Washio, T. and Motoda, H., Complete mining of frequent patterns from graphs: mining graph data. Mach. Learn. v50. 321-354.
[31]
Ioannidis, Y.E., The history of histograms (abridged). In: Freytag, J.C., Lockemann, P.C., Abiteboul, S., Carey, M.J., Selinger, P.G., Heuer, A. (Eds.), VLDB 2003, Proceedings of 29th International Conference on Very Large Data Bases, Morgan Kaufmann, Los Altos, CA. pp. 19-30.
[32]
Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C. and Suel, T., Optimal histograms with quality guarantees. In: Gupta, A., Shmueli, O., Widom, J. (Eds.), VLDB'98, Proceedings of 24th International Conference on Very Large Data Bases, Morgan Kaufmann, Los Altos, CA. pp. 275-286.
[33]
Kannan, R., Vempala, S. and Vetta, A., On clusterings: good, bad and spectral. J. ACM. v51 i3. 497-515.
[34]
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R. and Wu, A.Y., A local search approximation algorithm for k-means clustering. Comput. Geom.: Theory Appl. v28. 89-112.
[35]
Kifer, D., Gehrke, J., Bucila, C. and White, W., How to quickly find a witness. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, ACM Press, New York. pp. 272-283.
[36]
Knuth, D.E., Sorting and Searching, The Art of Computer Programming. 1998. second ed. Addison-Wesley, Reading, MA.
[37]
M. Kryszkiewicz, Concise representation of frequent patterns based on disjunction-free generators, in: N. Cercone, T.Y. Lin, X. Wu (Eds.), Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November-2 December 2001, San Jose, California, USA, IEEE Computer Society Press, Silver Spring, MD, 2001, pp. 305-312.
[38]
Kumar, A., Sabharwal, Y. and Sen, S., A simple linear time (1+ε)-approximation algorithm for k-means clustering in any dimensions. In: 45th Symposium on Foundations of Computer Science (FOCS 2004), IEEE Computer Society Press, Silver Spring, MD. pp. 454-462.
[39]
M. Kurakochi, G. Karypis, Frequent subgraph discovery, in: N. Cercone, T.Y. Lin, X. Wu (Eds.), Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November-2 December 2001, San Jose, California, USA, IEEE Computer Society Press, Silver Spring, MD, 2001, pp. 313-320.
[40]
Lakshmanan, L.V.S., Leung, C.K.-S. and Ng, R.T., Efficient dynamic mining of constrained frequent sets. ACM Trans. Database Systems. v28 i4. 337-389.
[41]
J. Maloberti, E. Suzuki, Improving efficiency of frequent query discovery by eliminating non-relevant candidates, in: G. Grieser, Y. Tanaka, A. Yamamoto (Eds.), Discovery Science, Sixth International Conference, DS 2003, Sapporo, Japan, 17-19 October 2003, Proceedings, Lecture Notes in Computer Science, vol. 2843, Springer, Berlin, 2003, pp. 220-232.
[42]
H. Mannila, Local and global methods in data mining: basic techniques and open problems, in: P. Widmayer, F.T. Ruiz, R.M. Bueno, M. Hennessy, S. Eidenbenz, R. Conejo (Eds.), Automata, Languages and Programming, 29th International Colloquium, ICALP 2002, Malaga, Spain, 8-13 July 2002, Proceedings, Lecture Notes in Computer Science, vol. 2380, Springer, Berlin, 2002, pp. 57-68.
[43]
Mannila, H. and Toivonen, H., Levelwise search and borders of theories in knowledge discovery. Data Min. Knowledge Discovery. v1 i3. 241-258.
[44]
Mannila, H., Toivonen, H. and Verkamo, A.I., Discovery of frequent episodes in event sequences. Data Min. Knowledge Discovery. v1 i3. 259-289.
[45]
T. Mielikäinen, Chaining patterns, in: G. Grieser, Y. Tanaka, A. Yamamoto (Eds.), Discovery Science, Sixth International Conference, DS 2003, Sapporo, Japan, 17-19 October 2003, Proceedings, Lecture Notes in Computer Science, vol. 2843, Springer, Berlin, 2003, pp. 232-243.
[46]
Mielikäinen, T., Finding all occurring sets of interest. In: Boulicaut, J.-F., D¿eroski, S. (Eds.), Second International Workshop on Knowledge Discovery in Inductive Databases, pp. 97-106.
[47]
T. Mielikäinen, Separating structure from interestingness, in: H. Dai, R. Srikant, C. Zhang (Eds.), Advances in Knowledge Discovery and Data Mining, Eighth Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, 26-28 May 2004, Proceedings, Lecture Notes in Artificial Intelligence, vol. 3056, Springer, Berlin, 2004, pp. 476-485.
[48]
T. Mielikäinen, H. Mannila, The pattern ordering problem, in: N. Lavrač, D. Gamberger, H. Blockeel, L. Todorovski (Eds.), Knowledge Discovery in Databases: PKDD 2003, Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 22-26 September 2003, Proceedings, Lecture Notes in Artificial Intelligence, vol. 2838, Springer, Berlin, 2003, pp. 327-338.
[49]
N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal, Discovering frequent closed itemsets for association rules, in: C. Beeri, P. Buneman (Eds.), Database Theory-ICDT '99, Seventh International Conference, Jerusalem, Israel, 10-12 January 1999, Proceedings, Lecture Notes in Computer Science, vol. 1540, Springer, Berlin, 1999, pp. 398-416.
[50]
J. Pei, G. Dong, W. Zou, J. Han, On computing condensed pattern bases, in: V. Kumar, S. Tsumoto (Eds.), Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), 9-12 December 2002, Maebashi City, Japan, IEEE Computer Society Press, Silver Spring, MD, 2002, pp. 378-385.
[51]
Srikant, R., Vu, Q. and Agrawal, R., Mining association rules with item constraints. In: Heckerman, D., Mannila, H., Pregibon, D. (Eds.), Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97), AAAI Press. pp. 67-73.
[52]
P.-N. Tan, V. Kumar, J. Srivastava, Selecting the right interestingness measure for association patterns, in: D. Hand, D. Keim, R. Ng (Eds.), Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 23-26 July 2002, Edmonton, Alberta, Canada, ACM Press, New York, 2002.
[53]
Toivonen, H., Sampling large databases for association rules. In: Vijayaraman, T.M., Buchmann, A.P., Mohan, C., Sarda, N.L. (Eds.), VLDB'96, Proceedings of 22nd International Conference on Very Large Data Bases, Morgan Kaufmann, Los Altos, CA. pp. 134-145.
[54]
Wang, J. and Han, J., BIDE: efficient mining of frequent closed sequences. In: Proceedings of the 20th International Conference on Data Engineering (ICDE 2004), IEEE Computer Society, Silver Spring, MD.
[55]
Wang, X., Wang, J.T.L., Shasha, D., Shapiro, B.A., Rigoutsos, I. and Zhang, K., Finding patterns in three-dimensional graphs: algorithms and applications to scientific data mining. IEEE Trans. Knowledge Data Eng. v14 i4. 731-749.
[56]
Y. Xiao, J.-F. Yao, Z. Li, M.H. Dunham, Efficient data mining for maximal frequent subtrees, in: X. Wu, A. Tuzhilin, J. Shavlik (Eds.), Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003), 19-22 December 2003, Melbourne, FL, USA, IEEE Computer Society, Silver Spring, MD, 2003, pp. 379-386.
[57]
Yan, X., Cheng, H., Han, J. and Xin, D., Summarizing itemset patterns: a profile-based approach. . In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, New York.
[58]
X. Yan, J. Han, gSpan: graph-based substructure pattern mining, in: V. Kumar, S. Tsumoto (Eds.), Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), 9-12 December 2002, Maebashi City, Japan, IEEE Computer Society Press, Silver Spring, MD, 2002, pp. 721-724.
[59]
Zaki, M.J., Scalable algorithms for association mining. IEEE Trans. Knowledge Data Eng. v12 i3. 372-390.
[60]
Zaki, M.J., SPADE: an efficient algorithm for mining frequent sequences. Mach. Learn. v42. 31-60.
[61]
M.J. Zaki, Efficiently mining frequent trees in a forest, in: D. Hand, D. Keim, R. Ng (Eds.), Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 23-26 July 2002, Edmonton, Alberta, Canada, ACM Press, New York, 2002.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Discrete Applied Mathematics
Discrete Applied Mathematics  Volume 154, Issue 7
Special issue: Discrete mathematics & data mining II (DM & DM II)
1 May 2006
127 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 May 2006

Author Tags

  1. Condensed representations of pattern collections
  2. Data mining
  3. Pattern discovery

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media