Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1739025.1739040acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Performance characterization of data mining benchmarks

Published: 13 March 2010 Publication History

Abstract

Explosive growth in the availability of various kinds of data in both commercial and scientific domains have resulted in an unprecedented need to develop novel data-driven, knowledge discovery techniques. Data mining is one such data-centric application. It consists of methods to discover interesting, nontrivial, and useful patterns hidden within massive amounts of data. Researchers from both academia and industry have recognized that the challenges of data mining applications will help shape the future of multi-core processor and parallelizing compiler designs. However, relatively little has been done to understand the performance characteristics of these applications on modern multi-core processors.
The exponential growth of on-chip resources make it critical to exploit parallelism at all granularities for improving the performance of data mining applications. In this paper, we examine the instruction-level, memory-level and thread-level parallelism available in data mining applications. We observe that (i) data mining applications have a slightly different instruction mix from SPEC integer applications, and this difference can potentially lead to different ILP extraction; ii) although many data mining applications suffer from data cache miss penalty, similar to SPEC integer applications, different techniques must be developed to enable effective prefetching due to the existance of complex and irregular data structures, such as hash tables; (iii) although data mining applications have large amount of thread-level parallelism, efficient extraction of such parallelism depends on on-chip cache performance; and (iv) the performance characteristics of data mining applications can vary at runtime, and thus techniques that dynamically tune the applications to adapt to such variations are desired.

References

[1]
Intel Itanium-2 processor. http://www.intel.com/products/processor/itanium/index.htm.
[2]
Intel Xeon processor. http://ark.intel.com/Product.aspx?id=28035.
[3]
Quest synthetic data generation code. http://www.almaden.ibm.com/cs/projects/iis/hdb/Projects/data_mining/datasets/syndata.html.
[4]
TPC-H benchmark revision 2.0.0. http://www.tpc.org.
[5]
Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, P., Keutzer, K., Patterson, D. A., Plishker, W. L., Shalf, J., Williams, S. W., and Yelick, K. A. The landscape of parallel computing research: A view from berkeley, Dec 2006.
[6]
Bradford, J. P., and Fortes, J. Performance and memory-access characterization of data mining applications. Annual IEEE International Workshop on Workload Characterization (1998).
[7]
Bradford, J. P., and Fortes, J. A. B. Characterization and parallelization of decision-tree induction. Journal of Parallel and Distributed Computing 61, 3 (2001), 322--349.
[8]
Chen, Y., Diao, Q., Dulong, C., Hu, W., Lai, C., Li, E., Li, W., Wang, T., and Zhang, Y. Performance scalability of data-mining workloads in bioinformatics. Tech. rep., Intel Technology Journal, 2005.
[9]
Dubey, P. A platform 2015 workload model recognition, mining and synthesis moves computers to the era of tera. In Platform 1015. 2005.
[10]
Foti, D., Lipari, D., Pizzuti, C., and Talia, D. Scalable parallel clustering for data mining on multicomputers. Parallel and Distributed Processing (2000), 390--398.
[11]
Ghoting, A., Buehrer, G., Parthasarathy, S., Kim, D., Nguyen, A., Chen, Y.-K., and Dubey, P. A characterization of data mining algorithms on a modern processor. In DaMoN '05: Proceedings of the 1st international workshop on Data management on new hardware (New York, NY, USA, 2005), ACM.
[12]
Han, E.-H., Karypis, G., and Kumar, V. Scalable parallel data mining for association rules. SIGMOD Rec. 26, 2 (1997), 277--288.
[13]
Han, J., Altman, R. B., Kumar, V., Mannila, H., and Pregibon, D. Emerging scientific applications in data mining. Comminucation of ACM 45, 8 (2002), 54--58.
[14]
Jaleel, A., Mattina, M., and Jacob, B. Last level cache (llc) performance of data mining workloads on a cmp - a case study of parallel bioinformatics workloads. In The Twelfth International Symposium on High-Performance Computer Architecture (2006), pp. 88--98.
[15]
Jibaja, I., and Shaw, K. Understanding the applicability of CMP performance optimizations on data mining applications. In the IEEE International Symposium on Workload Characterization (IISWC 2009). (October 2009).
[16]
Joshi, M., Karypis, G., and Kumar, V. Scalparc: A new scalable and efficient parallel classification algorithm for mining large datasets. International Parallel Processing Symposium (1998).
[17]
Joshi, M. V., Han, E.-H. S., Karypis, G., and Kumar, V. Parallel algorithms for data mining. In Sourcebook of Parallel Computing, J. Dongarra, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White, Eds. Morgan Kaufmann, 2002.
[18]
Lee, C., Potkonjak, M., and Mangione-Smith, W. H. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of MICRO 1997 (1997).
[19]
Li, W., Li, E., Jaleel, A., Shan, J., Chen, Y., Wang, Q., Iyer, R., Illikkal, R., Zhang, Y., Liu, D., Liao, M., Wei, W., and Du, J. Understanding the memory performance of data-mining workloads on small, medium, and large-scale cmps using hardware-software co-simulation. In the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). (April 2007).
[20]
Liu, Y., Pisharath, J., Liao, W., Memik, G., Choudhary, A., and Dubey, P. Performance Evaluation and Characterization of Scalable Data Mining Algorithms. In Proceedings of IASTED (2004).
[21]
Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., S. Wallace, Reddi, V. J., and Hazelwood, K. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of Programming Language Design and Implementation (PLDI) (June 2005).
[22]
Narayanan, R., Ozisikyilmaz, B., Zambreno, J., Memik, G., and Choudhary, A. Minebench: A benchmark suite for data mining workloads. In the IEEE International Symposium on Workload Characterization (IISWC) (October 2006).
[23]
Ozisikyilmaz, B., Narayanan, R., Zambreno, J., Memik, G., and Choudhary, A. An architectural characterization study of data mining and bioinformatics workloads. In the IEEE International Symposium on Workload Characterization (IISWC) (October 2006).
[24]
Parthasarathy, S., Zaki, M. J., Ogihara, M., and Li, W. Parallel data mining for association rules on shared-memory systems. Knowledge and Information Systems 3, 1 (2001), 1--29.
[25]
S. Eranian. Perfmon: Linux performance monitoring for IA64. http://perfmon2.sourceforge.net/.
[26]
Shaw, K. Understanding the working sets of data mining applications. In the Eleventh Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-11) (2008).
[27]
Sherwood, T., Sair, S., and Calder, B. Phase tracking and prediction. In Proc. of the 30th Intl. Symp. on Computer Architecture (ISCA). 2003.
[28]
Skillicorn, D. Strategies for parallel data mining. IEEE Concurrency 7, 4 (1999), 26--35.
[29]
Standard Performance Evaluation Corporation. The SPEC Benchmark Suite. http://www.specbench.org.
[30]
Stoffel, K., and Belkoniene, A. Parallel k/h -means clustering for large data sets. Euro-Par'99 Parallel Processing (1999), 1451--1454.
[31]
Talia, D. Parallelism in knowledge discovery techniques. Applied Parallel Computing (2002), 758--758.
[32]
Tan, P., Steinbach, M., and Kumar, V. Introduction to Data Mining. Addison-Wesley, 2005.
[33]
University of Helsinki. Frequent itemset mining data repository. http://fimi.cs.helsinki.fi/data.
[34]
Zaki, M., Ho, C.-T., and Agrawal, R. Parallel classification for data mining on shared-memory multiprocessors. In Proceedings on the 15th International Conference on Data Engineering (1999), pp. 198--205.
[35]
Zaki, M., Ogihara, M., Parthasarathy, S., and Li, W. Parallel data mining for association rules on shared-memory multi-processors. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing (1996), pp. 43--43.
[36]
Zaki, M. J., and Ho, C.-T., Eds. Large-Scale Parallel Data Mining, vol. 1759 of LNCS/LNAI State-of-the-Art Survey. Springer-Verlag, Heidelberg, Germany, 2000.

Cited By

View all
  • (2018)hpLysis Database-Engine: A New Data-Scheme for Fast Semantic Queries in Biomedical Databases2018 IEEE 12th International Conference on Semantic Computing (ICSC)10.1109/ICSC.2018.00084(383-390)Online publication date: Jan-2018
  • (2013)Characterizing multi-threaded applications for designing sharing-aware last-level cache replacement policies2013 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC.2013.6704665(1-10)Online publication date: Sep-2013

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
INTERACT-14: Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
March 2010
83 pages
ISBN:9781605589213
DOI:10.1145/1739025
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2010

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

ASPLOS '10

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2018)hpLysis Database-Engine: A New Data-Scheme for Fast Semantic Queries in Biomedical Databases2018 IEEE 12th International Conference on Semantic Computing (ICSC)10.1109/ICSC.2018.00084(383-390)Online publication date: Jan-2018
  • (2013)Characterizing multi-threaded applications for designing sharing-aware last-level cache replacement policies2013 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC.2013.6704665(1-10)Online publication date: Sep-2013

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media