research-article

Performance characterization of data mining benchmarks

Authors:

Vineeth Mekkat,

Ragavendra Natarajan,

Antonia ZhaiAuthors Info & Claims

INTERACT-14: Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture

Article No.: 11, Pages 1 - 8

https://doi.org/10.1145/1739025.1739040

Published: 13 March 2010 Publication History

Abstract

Explosive growth in the availability of various kinds of data in both commercial and scientific domains have resulted in an unprecedented need to develop novel data-driven, knowledge discovery techniques. Data mining is one such data-centric application. It consists of methods to discover interesting, nontrivial, and useful patterns hidden within massive amounts of data. Researchers from both academia and industry have recognized that the challenges of data mining applications will help shape the future of multi-core processor and parallelizing compiler designs. However, relatively little has been done to understand the performance characteristics of these applications on modern multi-core processors.

The exponential growth of on-chip resources make it critical to exploit parallelism at all granularities for improving the performance of data mining applications. In this paper, we examine the instruction-level, memory-level and thread-level parallelism available in data mining applications. We observe that (i) data mining applications have a slightly different instruction mix from SPEC integer applications, and this difference can potentially lead to different ILP extraction; ii) although many data mining applications suffer from data cache miss penalty, similar to SPEC integer applications, different techniques must be developed to enable effective prefetching due to the existance of complex and irregular data structures, such as hash tables; (iii) although data mining applications have large amount of thread-level parallelism, efficient extraction of such parallelism depends on on-chip cache performance; and (iv) the performance characteristics of data mining applications can vary at runtime, and thus techniques that dynamically tune the applications to adapt to such variations are desired.

References

[1]

Intel Itanium-2 processor. http://www.intel.com/products/processor/itanium/index.htm.

[2]

Intel Xeon processor. http://ark.intel.com/Product.aspx?id=28035.

[3]

Quest synthetic data generation code. http://www.almaden.ibm.com/cs/projects/iis/hdb/Projects/data_mining/datasets/syndata.html.

[4]

TPC-H benchmark revision 2.0.0. http://www.tpc.org.

[5]

Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, P., Keutzer, K., Patterson, D. A., Plishker, W. L., Shalf, J., Williams, S. W., and Yelick, K. A. The landscape of parallel computing research: A view from berkeley, Dec 2006.

[6]

Bradford, J. P., and Fortes, J. Performance and memory-access characterization of data mining applications. Annual IEEE International Workshop on Workload Characterization (1998).

Digital Library

[7]

Bradford, J. P., and Fortes, J. A. B. Characterization and parallelization of decision-tree induction. Journal of Parallel and Distributed Computing 61, 3 (2001), 322--349.

Digital Library

[8]

Chen, Y., Diao, Q., Dulong, C., Hu, W., Lai, C., Li, E., Li, W., Wang, T., and Zhang, Y. Performance scalability of data-mining workloads in bioinformatics. Tech. rep., Intel Technology Journal, 2005.

[9]

Dubey, P. A platform 2015 workload model recognition, mining and synthesis moves computers to the era of tera. In Platform 1015. 2005.

[10]

Foti, D., Lipari, D., Pizzuti, C., and Talia, D. Scalable parallel clustering for data mining on multicomputers. Parallel and Distributed Processing (2000), 390--398.

Digital Library

[11]

Ghoting, A., Buehrer, G., Parthasarathy, S., Kim, D., Nguyen, A., Chen, Y.-K., and Dubey, P. A characterization of data mining algorithms on a modern processor. In DaMoN '05: Proceedings of the 1st international workshop on Data management on new hardware (New York, NY, USA, 2005), ACM.

Digital Library

[12]

Han, E.-H., Karypis, G., and Kumar, V. Scalable parallel data mining for association rules. SIGMOD Rec. 26, 2 (1997), 277--288.

Digital Library

[13]

Han, J., Altman, R. B., Kumar, V., Mannila, H., and Pregibon, D. Emerging scientific applications in data mining. Comminucation of ACM 45, 8 (2002), 54--58.

Digital Library

[14]

Jaleel, A., Mattina, M., and Jacob, B. Last level cache (llc) performance of data mining workloads on a cmp - a case study of parallel bioinformatics workloads. In The Twelfth International Symposium on High-Performance Computer Architecture (2006), pp. 88--98.

[15]

Jibaja, I., and Shaw, K. Understanding the applicability of CMP performance optimizations on data mining applications. In the IEEE International Symposium on Workload Characterization (IISWC 2009). (October 2009).

Digital Library

[16]

Joshi, M., Karypis, G., and Kumar, V. Scalparc: A new scalable and efficient parallel classification algorithm for mining large datasets. International Parallel Processing Symposium (1998).

Digital Library

[17]

Joshi, M. V., Han, E.-H. S., Karypis, G., and Kumar, V. Parallel algorithms for data mining. In Sourcebook of Parallel Computing, J. Dongarra, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White, Eds. Morgan Kaufmann, 2002.

[18]

Lee, C., Potkonjak, M., and Mangione-Smith, W. H. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of MICRO 1997 (1997).

Digital Library

[19]

Li, W., Li, E., Jaleel, A., Shan, J., Chen, Y., Wang, Q., Iyer, R., Illikkal, R., Zhang, Y., Liu, D., Liao, M., Wei, W., and Du, J. Understanding the memory performance of data-mining workloads on small, medium, and large-scale cmps using hardware-software co-simulation. In the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). (April 2007).

[20]

Liu, Y., Pisharath, J., Liao, W., Memik, G., Choudhary, A., and Dubey, P. Performance Evaluation and Characterization of Scalable Data Mining Algorithms. In Proceedings of IASTED (2004).

[21]

Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., S. Wallace, Reddi, V. J., and Hazelwood, K. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of Programming Language Design and Implementation (PLDI) (June 2005).

Digital Library

[22]

Narayanan, R., Ozisikyilmaz, B., Zambreno, J., Memik, G., and Choudhary, A. Minebench: A benchmark suite for data mining workloads. In the IEEE International Symposium on Workload Characterization (IISWC) (October 2006).

[23]

Ozisikyilmaz, B., Narayanan, R., Zambreno, J., Memik, G., and Choudhary, A. An architectural characterization study of data mining and bioinformatics workloads. In the IEEE International Symposium on Workload Characterization (IISWC) (October 2006).

[24]

Parthasarathy, S., Zaki, M. J., Ogihara, M., and Li, W. Parallel data mining for association rules on shared-memory systems. Knowledge and Information Systems 3, 1 (2001), 1--29.

Digital Library

[25]

S. Eranian. Perfmon: Linux performance monitoring for IA64. http://perfmon2.sourceforge.net/.

[26]

Shaw, K. Understanding the working sets of data mining applications. In the Eleventh Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-11) (2008).

[27]

Sherwood, T., Sair, S., and Calder, B. Phase tracking and prediction. In Proc. of the 30th Intl. Symp. on Computer Architecture (ISCA). 2003.

Digital Library

[28]

Skillicorn, D. Strategies for parallel data mining. IEEE Concurrency 7, 4 (1999), 26--35.

Digital Library

[29]

Standard Performance Evaluation Corporation. The SPEC Benchmark Suite. http://www.specbench.org.

[30]

Stoffel, K., and Belkoniene, A. Parallel k/h -means clustering for large data sets. Euro-Par'99 Parallel Processing (1999), 1451--1454.

Digital Library

[31]

Talia, D. Parallelism in knowledge discovery techniques. Applied Parallel Computing (2002), 758--758.

Digital Library

[32]

Tan, P., Steinbach, M., and Kumar, V. Introduction to Data Mining. Addison-Wesley, 2005.

Digital Library

[33]

University of Helsinki. Frequent itemset mining data repository. http://fimi.cs.helsinki.fi/data.

[34]

Zaki, M., Ho, C.-T., and Agrawal, R. Parallel classification for data mining on shared-memory multiprocessors. In Proceedings on the 15th International Conference on Data Engineering (1999), pp. 198--205.

Digital Library

[35]

Zaki, M., Ogihara, M., Parthasarathy, S., and Li, W. Parallel data mining for association rules on shared-memory multi-processors. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing (1996), pp. 43--43.

Digital Library

[36]

Zaki, M. J., and Ho, C.-T., Eds. Large-Scale Parallel Data Mining, vol. 1759 of LNCS/LNAI State-of-the-Art Survey. Springer-Verlag, Heidelberg, Germany, 2000.

Cited By

Ekseth OMeyer JHvasshovd S(2018)hpLysis Database-Engine: A New Data-Scheme for Fast Semantic Queries in Biomedical Databases2018 IEEE 12th International Conference on Semantic Computing (ICSC)10.1109/ICSC.2018.00084(383-390)Online publication date: Jan-2018
https://doi.org/10.1109/ICSC.2018.00084
Natarajan RChaudhuri M(2013)Characterizing multi-threaded applications for designing sharing-aware last-level cache replacement policies2013 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC.2013.6704665(1-10)Online publication date: Sep-2013
https://doi.org/10.1109/IISWC.2013.6704665

Performance characterization of data mining benchmarks
1. General and reference
  1. Cross-computing tools and techniques
2. Information systems
  1. Information systems applications

Recommendations

Performance and Memory-Access Characterization of Data Mining Applications
WWC '98: Proceedings of the Workload Characterization: Methodology and Case Studies

This paper characterizes the performance and memory-access behavior of a decision tree induction program, a previously unstudied application used in data mining and knowledge discovery in databases. Performance is studied via RSIM, an execution driven ...
Mining uncertain data

As an important data mining and knowledge discovery task, association rule mining searches for implicit, previously unknown, and potentially useful pieces of information—in the form of rules revealing associative relationships—that are embedded in the ...
Mining fuzzy specific rare itemsets for education data

Association rule mining is an important data analysis method for the discovery of associations within data. There have been many studies focused on finding fuzzy association rules from transaction databases. Unfortunately, in the real world, one may ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

INTERACT-14: Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture

March 2010

83 pages

ISBN:9781605589213

DOI:10.1145/1739025

General Chair:
Alex K. Jones
University of Pittsburgh
,
Program Chair:
Xin Yuan
Florida State University

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ASPLOS '10

Sponsor:

ASPLOS '10: Architectural Support for Programming Languages and Operating Systems

March 13, 2010

Pennsylvania, Pittsburgh

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
586
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ekseth OMeyer JHvasshovd S(2018)hpLysis Database-Engine: A New Data-Scheme for Fast Semantic Queries in Biomedical Databases2018 IEEE 12th International Conference on Semantic Computing (ICSC)10.1109/ICSC.2018.00084(383-390)Online publication date: Jan-2018
https://doi.org/10.1109/ICSC.2018.00084
Natarajan RChaudhuri M(2013)Characterizing multi-threaded applications for designing sharing-aware last-level cache replacement policies2013 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC.2013.6704665(1-10)Online publication date: Sep-2013
https://doi.org/10.1109/IISWC.2013.6704665

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten